WO2016134258A1 - Systèmes et procédés pour l'identification et l'utilisation de petits arn - Google Patents

Systèmes et procédés pour l'identification et l'utilisation de petits arn Download PDF

Info

Publication number
WO2016134258A1
WO2016134258A1 PCT/US2016/018680 US2016018680W WO2016134258A1 WO 2016134258 A1 WO2016134258 A1 WO 2016134258A1 US 2016018680 W US2016018680 W US 2016018680W WO 2016134258 A1 WO2016134258 A1 WO 2016134258A1
Authority
WO
WIPO (PCT)
Prior art keywords
smrna
sequencing
sample
nucleic acid
identify
Prior art date
Application number
PCT/US2016/018680
Other languages
English (en)
Inventor
Todd MICHAEL
Connor MCENTEE
Original Assignee
Ibis Biosciences, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ibis Biosciences, Inc. filed Critical Ibis Biosciences, Inc.
Publication of WO2016134258A1 publication Critical patent/WO2016134258A1/fr

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6834Enzymatic or biochemical coupling of nucleic acids to a solid phase
    • C12Q1/6837Enzymatic or biochemical coupling of nucleic acids to a solid phase using probe arrays or probe chips
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers

Definitions

  • RNAs Provided herein are systems and methods for the identification of small RNAs.
  • systems and methods are provided that identify small RNAs by identifying amplified small RNAs de novo from high throughput sequencing information.
  • Small RNA such as micro RNAs (miRNAs) guide the coordinated regulation of gene expression during processes such as plant and animal development (Carrington and Ambros, 2003) through complex animal behaviors such as learning (Gao et al., 2010) and addiction (Hollander et al., 2010).
  • piRNAs are also found in neurons (Lee et al., 2011) and regulate the expression of synaptic plasticity genes (Rajasethupathy et al., 2012) suggests that the function of piRNAs may be broader than first hypothesized (Peng and Lin, 2013).
  • the tools to properly identify and analyze piRNAs from high-throughput sequencing data are lacking.
  • a number of tools have also been developed to identify piRNA sequence and localization.
  • piRNA databases such as piRNABank (Sai Lakshmi and Agrawal, 2006), piRBase (Zhang et al., 2014), and piRNAQuest (Sarkar et al., 2014), combine piRNA data from various publicly available sources into a user-friendly format. Although these databases integrate information from many sources, they are of limited assistance to identify novel piRNA using high-throughput sequencing technologies. Two tools have been developed for identifying piRNA clusters in high-throughput sequencing data, proTRAC (Rosenkranz et al., 2012) and the newer piClust (Jung et al., 2014). Individual piRNA identification has been attempted using analysis to identify piRNA-enriched motifs.
  • RNAs Provided herein are systems and methods for the identification of small RNAs.
  • systems and methods are provided that identify small RNAs by identifying amplified small RNAs de novo from high throughput sequencing information.
  • methods comprising one or more or all of the steps of: a) generating or otherwise obtaining sequencing data from smRNAs (e.g., from a sample); b) inputting the sequencing data into a computer comprising a processor and software (e.g., encoding a computer program configured to analyze the data); c) applying successive statistical models to said sequencing data with the software to identify overexpressed smRNA relative to background; and d) reporting the overexpressed smRNA.
  • Data for any type of smRNA may be employed from any type of sample.
  • the smRNA is one or more of piRNA, miRNA, snoRNA, and tRNA.
  • the smRNA is a previously unknown smRNA (e.g., never previously identified as smRNA, smRNA with unknown function, or smRNA not previously known to exist in the sample).
  • generating sequencing data comprises obtaining RNA (e.g., total RNA, total smRNA, etc.) from a sample and conducting high throughput next- generation sequencing on the RNA (directly or indirectly (e.g., sequencing cDNA, amplification products, or other molecules derived from the RNA)).
  • RNA e.g., total RNA, total smRNA, etc.
  • next- generation sequencing on the RNA directly or indirectly (e.g., sequencing cDNA, amplification products, or other molecules derived from the RNA)).
  • the method further comprises one or more steps that further characterize an identified smRNA or use an identified smRNA. For example, in some embodiments, the method further comprises determining an expression level of a previously unknown (now identified) smRNA in one or more tissues.
  • the sequencing data can be derived from any type of sequencing technology or approach.
  • the sequencing data comprises size selected high- throughput sequencing data.
  • the sequencing data comprises sorted smRNA short-read alignment data from size selected sequencing libraries.
  • the sequencing data is in SAM format (although any other desired format can be used).
  • the applying successive statistical models step comprises identify, call, and filter steps.
  • the identify, call, and filter steps are run in parallel, allowing, for example, separate processes to communicate with each other, and for downstream computation to begin before previous steps have completely finished (e.g., call and filter steps occur while identify steps are still in progress).
  • the identify step comprises streaming off an identified smRNA in the identify step to the calling prior to completion of the identify step.
  • the identify step comprises identification of putative smRNA loci by extracting regions that have statistically overrepresented numbers of smRNA alignments against a null model in which sequenced smRNAs are uniformly distributed across a genome.
  • a user settable cutoff is applied to filter out regions of low interest.
  • a locus that passes the cutoff, and all reads aligning that region as well as base pairs upstream and/or downstream are forwarded to the call step.
  • the call step comprises a Maximum Likelihood Estimation.
  • the call step fits a simple model of smRNA amplification to an observed distribution of smRNA alignments to predict putative mature smRNA loci.
  • the call step assigns a score to each locus quantifying how amplified read support for that locus is compared to a surrounding region.
  • the filter step comprises using amplification p-values assigned to each model locus from the calling stage to discard low quality loci.
  • systems comprising a computer and a computer readable media (e.g., non-transitory computer readable media) configured to carry out any of the above methods or other methods described herein.
  • the system may further comprise one or more additional components necessary, useful, or sufficient for obtaining and processing a sample to obtain the smRNA sequence information (e.g., a nucleic acid sequencing instrument and/or associated reagents such as polymerases, primers, buffers, etc.) and/or for further characterizing or using an smRNA identified by the methods.
  • small RNA refers to small RNA molecules (e.g., those having 1000 or fewer, 500 or fewer, 200 or fewer, 100 or fewer, 50 or fewer, 40 or fewer, 30 or fewer, 20 or fewer nucleotides, and ranges thereinbetween) found in samples, such as plants, animals, and microoganisms and environmental samples.
  • Small RNAs include, but are not limited to, microRNAs (miRNAs), small nucleolar RNAs (snoRNAs), small interfering RNA (siRNAs), small nuclear RNAs (snRNAs), transfer RNAs (tRNAs), and piwi interacting RNAs (piRNAs).
  • processor and "central processing unit” or “CPU” are used interchangeably and refer to a device that is able to read a program from a computer memory (e.g., ROM or other computer memory) and perform a set of steps according to the program.
  • a computer memory e.g., ROM or other computer memory
  • computer readable medium refers to any device or system for storing and providing information (e.g., data and instructions) to a computer processor, including non-transitory computer readable media.
  • Examples of computer readable media include, but are not limited to, DVDs, CDs, hard disk drives, magnetic tape and servers for streaming information over networks.
  • the term “in electronic communication” refers to electrical devices (e.g., computers, processors, communications equipment, research equipment (e.g., nucleic acid sequencing instruments)) that are configured to communicate with one another through direct or indirect signaling.
  • a computer configured to transmit (e.g., through cables, wires, infrared signals, telephone lines, etc.) information to another computer or device, is in electronic communication with the other computer or device.
  • transmitting refers to the movement of information (e.g., data) from one location to another (e.g., from one device to another) using any suitable means.
  • sample is used in its broadest sense. In one sense, it is meant to include a specimen or culture obtained from any source, as well as biological and environmental samples. Biological samples may be obtained from animals (including humans) and encompass fluids, solids, tissues, and gases.
  • Biological samples such as, e.g., blood, serum, plasma, buffy coat, saliva, wound exudates, pus, lung and other respiratory aspirates, nasal aspirates and washes, sinus drainage, bronchial lavage fluids, sputum, medial and inner ear aspirates, cyst aspirates, cerebral spinal fluid, stool, diarrheal fluid, urine, tears, mammary secretions, ovarian contents, ascites fluid, mucous, gastric fluid, gastrointestinal contents, urethral discharge, synovial fluid, peritoneal fluid, meconium, vaginal fluid or discharge, amniotic fluid, semen, penile discharge, or the like may be used.
  • assay a sample from swabs or lavages (e.g., that are representative of mucosal secretions and epithelia), for example, mucosal swabs of the throat, tonsils, gingival, nasal passages, vagina, urethra, rectum, lower colon, and eyes, as are homogenates, lysates, and digests of tissue specimens of all sorts.
  • the sample comprises mammalian cells.
  • the term sample encompasses other samples such as, e.g., samples of water, industrial discharges, food products, milk, air filtrates, etc.
  • nucleic acid shall mean any nucleic acid molecule, including, without limitation, DNA, RNA, and hybrids thereof.
  • the nucleic acid bases that form nucleic acid molecules can be the bases A, C, G, T and U, as well as derivatives thereof. Derivatives of these bases are well known in the art.
  • the term should be understood to include, as equivalents, analogs of either DNA or RNA made from nucleotide analogs.
  • the term as used herein also encompasses cDNA that is complementary DNA produced from an RNA template, for example by the action of a reverse transcriptase.
  • DNA deoxyribonucleic acid
  • RNA ribonucleic acid
  • A adenosine
  • T thymidine
  • C cytidine
  • G guanidine
  • RNA ribonucleic acid
  • A adenosine
  • U uridine
  • G guanidine
  • C adenosine
  • C cytidine
  • G guanidine
  • all of these 5 types of nucleotides specifically bind to one another in combinations called complementary base pairing. That is, A pairs with T (in the case of RNA, however, A pairs with U) and C pairs with G, so that each of these base pairs forms a double strand.
  • nucleic acid sequencing data denotes any information or data that is indicative of the order of the nucleotide bases (e.g., adenine, guanine, cytosine, and thymine/uracil) in a molecule (e.g., a whole genome, a whole transcriptome, an exome, oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA.
  • nucleotide bases e.g., adenine, guanine, cytosine, and thymine/uracil
  • sequence information obtained using all available varieties of techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems,
  • hybridization-based systems direct or indirect nucleotide identification systems
  • pyrosequencing ion- or pH-based detection systems
  • electronic signature-based systems e.g., electronic signature-based systems
  • pore-based e.g., nanopore
  • visualization-based systems etc.
  • a base may refer to a single molecule of that base or to a plurality of the base, e.g., in a solution.
  • a “polynucleotide”, “nucleic acid”, or “oligonucleotide” refers to a linear polymer of nucleosides (including deoxyribonucleosides, ribonucleosides, or analogs thereof) joined by internucleosidic linkages.
  • a polynucleotide comprises at least three nucleosides.
  • oligonucleotides range in size from a few monomelic units, e.g. 3 to 4, to several hundreds of monomeric units.
  • a polynucleotide such as an oligonucleotide
  • ATGCCTG a sequence of letters, such as "ATGCCTG”
  • A denotes deoxyadenosine
  • C denotes deoxycytidine
  • G denotes deoxyguanosine
  • T denotes thymidine, unless otherwise noted.
  • the letters A, C, G, and T may be used to refer to the bases themselves, to nucleosides, or to nucleotides comprising the bases, as is standard in the art.
  • dNTP deoxynucleotidetriphosphate, where the nucleotide comprises a nucleotide base, such as A, T, C, G or U.
  • the term "monomer” as used herein means any compound that can be incorporated into a growing molecular chain by a given polymerase. Such monomers include, without limitation, naturally occurring nucleotides (e.g., ATP, GTP, TTP, UTP, CTP, dATP, dGTP, dTTP, dUTP, dCTP, synthetic analogs), precursors for each nucleotide, non-naturally occurring nucleotides and their precursors or any other molecule that can be incorporated into a growing polymer chain by a given polymerase.
  • “complementary” generally refers to specific nucleotide duplexing to form canonical Watson-Crick base pairs, as is understood by those skilled in the art.
  • complementary also includes base-pairing of nucleotide analogs that are capable of universal base-pairing with A, T, G or C nucleotides and locked nucleic acids that enhance the thermal stability of duplexes.
  • hybridization stringency is a determinant in the degree of match or mismatch in the duplex formed by hybridization.
  • moiety refers to one of two or more parts into which something may be divided, such as, for example, the various parts of a tether, a molecule, or a probe.
  • a "linker” is a molecule or moiety that joins two molecules or moieties and/or provides spacing between the two molecules or moieties such that they are able to function in their intended manner.
  • a linker can comprise a diamine hydrocarbon chain that is covalently bound through a reactive group on one end to an oligonucleotide analog molecule and through a reactive group on another end to a solid support, such as, for example, a bead surface.
  • Coupling of linkers to nucleotides and substrate constructs of interest can be accomplished through the use of coupling reagents that are known in the art (see, e.g., Efimov et al., Nucleic Acids Res.
  • a linker may also be cleavable (e.g., photocleavable) or reversible.
  • a "polymerase” is an enzyme generally for joining 3'-OH, 5 '-triphosphate
  • Polymerases include, but are not limited to, DNA-dependent DNA polymerases, DNA-dependent RNA polymerases, RNA-dependent DNA polymerases, RNA-dependent RNA polymerases, T7 DNA polymerase, T3 DNA
  • KOD HiFi DNA polymerase (Novagen), KOD1 DNA polymerase, Q-beta replicase, terminal transferase, AMV reverse transcriptase, M-MLV reverse transcriptase, Phi6 reverse transcriptase, HIV-1 reverse transcriptase, novel polymerases discovered by bioprospecting, and polymerases cited in U.S. Pat. Appl. Pub. No. 2007/0048748 and in U.S. Pat. Nos. 6,329,178; 6,602,695; and 6,395,524.
  • polymerases include wild-type, mutant isoforms, and genetically engineered variants such as exo ⁇ polymerases and other mutants, e.g., that tolerate modified (e.g., labeled) nucleotides and incorporate them into a strand of nucleic acid.
  • primer refers to an oligonucleotide, whether occurring naturally as in a purified restriction digest or produced synthetically, that is capable of acting as a point of initiation of synthesis when placed under conditions in which synthesis of a primer extension product that is complementary to a nucleic acid strand is induced, (e.g., in the presence of nucleotides and an inducing agent such as a polymerase and at a suitable temperature and pH).
  • the primer is preferably single stranded for maximum efficiency in amplification, but may alternatively be double stranded. If double stranded, the primer is first treated to separate its strands before being used to prepare extension products.
  • the primer is an oligodeoxyribonucleotide.
  • the primer must be sufficiently long to prime the synthesis of extension products in the presence of the inducing agent. The exact lengths of the primers depends on many factors including temperature, source of primer, and the use of the method.
  • an "adaptor” is an oligonucleotide that is linked or is designed to be linked to a nucleic acid to introduce the nucleic acid into a sequencing workflow.
  • An adaptor may be single-stranded or double-stranded (e.g., a double-stranded DNA or a single-stranded DNA).
  • the term “adaptor” refers to the adaptor nucleic in a state that is not linked to another nucleic acid and in a state that is linked to a nucleic acid.
  • the adaptor comprises a known sequence.
  • some embodiments of adaptors comprise a primer binding sequence for amplification of the nucleic acid and/or for binding of a sequencing primer.
  • Some adaptors comprise a sequence for hybridization of a complementary capture probe.
  • Some adaptors comprise a chemical or other moiety (e.g., a biotin moiety) for capture and/or immobilization to a solid support (e.g., comprising an avidin moiety).
  • Some embodiments of adaptors comprise a marker, index, barcode, tag, or other sequence by which the adaptor and a nucleic acid to which it is linked are identifiable.
  • Some adaptors comprise a universal sequence.
  • a universal sequence is a sequence shared by a plurality of adaptors that may otherwise have different sequences outside of the universal sequence.
  • a universal sequence provides a common primer binding site for a collection of nucleic acids from different target nucleic acids, e.g., that may comprise different barcodes.
  • Some embodiments of adaptors comprise a defined but unknown sequence.
  • some embodiments of adaptors comprise a degenerate sequence of a defined number of bases (e.g., a 1- to 20-base degenerate sequence). Such a sequence is defined even if each individual sequence is not known - such a sequence may nevertheless serve as an index, barcode, tag, etc. marking nucleic acid fragments from, e.g., the same target nucleic acid.
  • Some adaptors comprise a blunt end and some adaptors comprise an end with an overhang of one or more bases.
  • an adaptor comprises an azido moiety, e.g., the adaptor comprises an azido (e.g., an azido-methyl) moiety on its 5' end.
  • an azido e.g., an azido-methyl
  • some embodiments are related to adaptors that are or that comprise a 5'-azido-modified
  • oligonucleotide or a 5 '-azido-m ethyl-modified oligonucleotide.
  • a unique index (a "marker" in some embodiments) is used to associate a fragment with the template nucleic acid from which it was produced.
  • a unique index is a unique sequence of synthetic nucleotides or a unique sequence of natural nucleotides that allows for easy identification of the target nucleic acid within a complicated collection of oligonucleotides (e.g., fragments) containing various sequences.
  • unique index identifiers are attached to nucleic acid fragments prior to attaching adaptor sequences.
  • unique index identifiers are contained within adaptor sequences such that the unique sequence is contained in the sequencing reads.
  • homologous fragments can be detected based upon the unique indices that are attached to each fragment, thus further providing for unambiguous reconstruction of a consensus sequence.
  • Homologous fragments may occur for example by chance due to genomic repeats, two fragments originating from homologous chromosomes, or fragments originating from overlapping locations on the same chromosome. Homologous fragments may also arise from closely related sequences (e.g., closely related gene family members, paralogs, orthologs, ohnologs, xenologs, and/or pseudogenes). Such fragments may be discarded to ensure that long fragment assembly can be computed unambiguously.
  • the markers may be attached as described above for the adaptor sequences.
  • the indices e.g., markers
  • the unique index (e.g., index identifier, tag, marker, etc.) is a "barcode".
  • barcode refers to a known nucleic acid sequence that allows some feature of a nucleic acid with which the barcode is associated to be identified.
  • the feature of the nucleic acid to be identified is the sample or source from which the nucleic acid is derived.
  • barcodes are at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or more nucleotides in length. In some embodiments, barcodes are shorter than 10, 9, 8, 7, 6, 5, or 4 nucleotides in length.
  • barcodes associated with some nucleic acids are of a different length than barcodes associated with other nucleic acids.
  • barcodes are of sufficient length and comprise sequences that are sufficiently different to allow the identification of samples based on barcodes with which they are associated.
  • a barcode and the sample source with which it is associated can be identified accurately after the mutation, insertion, or deletion of one or more nucleotides in the barcode sequence, such as the mutation, insertion, or deletion of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more nucleotides.
  • each barcode in a plurality of barcodes differs from every other barcode in the plurality at two or more nucleotide positions, such as at 2, 3, 4, 5, 6, 7, 8, 9, 10, or more positions.
  • one or more adaptors comprise(s) at least one of a plurality of barcode sequences.
  • methods of the technology further comprise identifying the sample or source from which a target nucleic acid is derived based on a barcode sequence to which the target nucleic acid is joined.
  • methods of the technology further comprise identifying the target nucleic acid based on a barcode sequence to which the target nucleic acid is joined.
  • Some embodiments of the method further comprise identifying a source or sample of the target nucleotide sequence by determining a barcode nucleotide sequence. Some embodiments of the method further comprise molecular counting applications (e.g., digital barcode enumeration and/or binning) to determine expression levels or copy number status of desired targets.
  • a barcode may comprise a nucleic acid sequence that when joined to a target nucleic acid serves as an identifier of the sample from which the target polynucleotide was derived.
  • an oligonucleotide such as a primer, adaptor, etc. comprises a
  • a universal sequence is a known sequence, e.g., for use as a primer or probe binding site using a primer or probe of a known sequence (e.g., complementary to the universal sequence). While a template-specific sequence of a primer, a barcode sequence of a primer, and/or a barcode sequence of an adaptor might differ in embodiments of the technology, e.g., from fragment to fragment, from sample to sample, from source to source, or from region of interest to region of interest, embodiments of the technology provide that a universal sequence is the same from fragment to fragment, from sample to sample, from source to source, or from region of interest to region of interest so that all fragments comprising the universal sequence can be handled and/or treated in a same or similar manner, e.g., amplified, identified, sequenced, isolated, etc., using similar methods or techniques (e.g., using the same primer or probe).
  • a “system” denotes a set of components, real or abstract, comprising a whole or for performing a desired objective.
  • Figure 1 shows identification of known piRNA in mouse testes from high throughput sequencing data. Identification of a known piRNA on Chromosome 2 (chr2, position
  • Figure 2 shows data demonstrating that high throughput sequencing produces characteristic peaks for known and novel smRNAs.
  • Each type of smRNA has a characteristic profile after high throughput sequencing and alignment to the genome.
  • rRNA and siRNA result in clusters that are similar to gene models; tRNA and snoRNA make several peaks that are very close to one another; miRNA make one predominant peak and then smaller peak proximal to the larger one that is the star sequence; piRNA make a larger cluster like rRNA and siRNA with one to several highly amplified peaks.
  • Figure 3 shows that deconvolution of smRNA sequence identifies minor peaks.
  • the system identifies all putative peaks based on the underlying reads.
  • three distinct smRNA species (0; 1; 2) contribute to one larger peak alongside reads attributable to background noise but are de-convoluted in the system.
  • RNAs Provided herein are systems and methods for the identification of small RNAs.
  • systems and methods are provided that identify small RNAs by identifying amplified small RNAs de novo from high throughput sequencing information.
  • RNA clusters which employ a Maximum Likelihood Estimation to infer the sequence of all mature smRNAs from high-throughput sequencing data.
  • the systems and methods predict amplified smRNA peaks de novo from aligned sequencing data instead of relying on databases, which can be incomplete or inaccurate.
  • a calling algorithm determines the best-fit location for the mature smRNA, which include, but are not limited to, piRNAs, miRNAs, snoRNAs, and tRNA, each of which classes form clusters above background ( Figure 2).
  • the results are processed into a file (e.g., generic feature format version 3 (GFF3)), which is used to filter the smRNAs of interest or compare expression between groups.
  • GFF3 generic feature format version 3
  • samples comprising smRNAs are obtained and nucleic acid from the sample is sequenced.
  • samples include various tissues or fluid samples.
  • the sample is a bodily fluid sample from the subject.
  • the sample is an aqueous or a gaseous sample. In some embodiments, the sample is a gel. In some embodiments, the sample includes one or more fluid components. In some embodiments, solid or semi-solid samples are provided. In some embodiments, the sample comprises tissue collected from a subject. In some embodiments, the sample comprises a bodily fluid, secretion, and/or tissue of a subject. In some embodiments, the sample is a biological sample. In some embodiments, the biological sample is a bodily fluid, a secretion, and/or a tissue sample.
  • biological samples include but are not limited to, blood, serum, saliva, urine, gastric and digestive fluid, tears, stool, semen, vaginal fluid, interstitial fluids derived from tumorous tissue, ocular fluids, sweat, mucus, earwax, oil, glandular secretions, breath, spinal fluid, hair, fingernails, skin cells, plasma, nasal swab or nasopharyngeal wash, spinal fluid, cerebral spinal fluid, tissue, throat swab, biopsy, placental fluid, amniotic fluid, cord blood, emphatic fluids, cavity fluids, sputum, pus, micropiota, meconium, breast milk, and/or other excretions.
  • the sample is provided from a human or an animal, e.g., in some embodiments the sample is provided from a mammal (e.g., a vertebrate) such as a murine, simian, human, farm animal, sport animal, or pet. In some embodiments, the sample is collected from a living subject and in some embodiments the sample is collected from a dead subject.
  • a mammal e.g., a vertebrate
  • the sample is collected from a living subject and in some embodiments the sample is collected from a dead subject.
  • the sample is collected fresh from a subject and in some embodiments the sample has undergone some form of pre-processing, storage, or transport.
  • the sample is a formalin or formaldehyde fixed paraffin embedded (FFPE) sample.
  • FFPE samples e.g., FFPE tissue samples
  • the clinical utility of FFPE samples is substantial, where
  • retrospective analysis of archival tissue enables the correlation of molecular findings with the response to treatment and the clinical outcome.
  • a subject provides a sample and/or the sample may be collected from a subject.
  • the subject is a patient, clinical subject, or pre-clinical subject.
  • the subject is undergoing diagnosis, treatment, and/or disease management or lifestyle or preventative care.
  • the subject may or may not be under the care of a health care professional.
  • the sample is an environmental sample.
  • environmental samples include, but are not limited to, air samples, water samples, soil samples, biofilm samples, or plant samples. Additional samples include food products, beverages, manufacturing materials, textiles, chemicals, biologies, therapies, or any other samples.
  • samples are processed in a microfluidic device to isolate and/or sequence nucleic acid.
  • sample preparation operations are performed upon the sample to prepare the sequencing library.
  • sample preparation operations include such manipulations as cell lysis, extraction of nucleic acids from whole cell samples (e.g., cell lysates, viruses, and the like), transcription of nucleic acids, amplification of nucleic acids, fragmentation of nucleic acids, labeling, adaptor ligation, quantification, size selection, and/or extension reactions.
  • a nucleic acid sequencing library is prepared from an input sample comprising whole cells (e.g., eukaryotic, bacteria, archaea), viruses, environmental samples, and/or tissue. Accordingly, in some embodiments, nucleic acids are obtained from the cells or viruses prior to continuing with other various sample preparation operations. For example, in some embodiments, following sample collection, the collected cells, viral coat, etc., are treated to prepare a crude extract (e.g., a cell lysate), followed by additional treatments to prepare the sample for subsequent operations, e.g., denaturation of
  • denaturation of nucleic acid binding proteins may generally be performed by physical or chemical methods.
  • chemical methods generally employ lysing agents (e.g., detergents or solvents) to disrupt the cells and liberate the cellular contents, followed by treatment of the extract with chaotropic salts such as guanidinium isothiocyanate or urea to denature any contaminating and potentially interfering proteins.
  • lysing agents e.g., detergents or solvents
  • chaotropic salts such as guanidinium isothiocyanate or urea
  • cell lysis and denaturing of contaminating proteins is carried out by applying an alternating electrical current to the sample.
  • the sample of cells is flowed through a channel or multiple channels while an alternating electric current is applied across the fluid flow.
  • a variety of other methods is utilized within the microfluidic cartridge to effect cell lysis/extraction, including, e.g., subjecting cells to ultrasonic agitation or forcing cells through microgeometry apertures, thereby subjecting the cells to high shear stress resulting in rupture.
  • nucleic acids are extracted from the cells, viruses, and/or cell lysate prior to continuing with other various sample preparation operations. Accordingly, embodiments provide for the separation of nucleic acids from other elements of a sample or crude extract, e.g., denatured proteins, cell membrane particles, and the like, after cell lysis. Removal of particulate matter is generally accomplished by filtration, flocculation, or the like.
  • the nucleic acid is subjected to one or more preparative reactions (e.g., following sample collection, lysis, and/or nucleic acid extraction).
  • preparative reactions e.g., following sample collection, lysis, and/or nucleic acid extraction.
  • exemplary embodiments are associated with preparative reactions that include in vitro transcription, labeling, fragmentation, amplification, and other reactions.
  • Nucleic acid amplification increases the number of copies of the target nucleic acid sequence of interest.
  • a variety of amplification methods are suitable for use in the methods and device described herein, including, e.g., the polymerase chain reaction (PCR), the ligase chain reaction (LCR), self-sustained sequence replication (3SR), and nucleic acid based sequence amplification (NASBA).
  • PCR polymerase chain reaction
  • LCR ligase chain reaction
  • 3SR self-sustained sequence replication
  • NASBA nucleic acid based sequence amplification
  • nucleic acids are labeled, e.g., to facilitate subsequent steps and/or to provide for detection of the nucleic acids.
  • labeling is performed during amplification.
  • amplification incorporates a label into the amplified nucleic acid either through the use of labeled primers or the incorporation of labeled dNTPs into the amplified nucleic acid.
  • the nucleic acids are labeled following amplification.
  • Post-amplification labeling typically involves the covalent attachment of a particular detectable group to the amplified nucleic acids.
  • Suitable labels or detectable groups include a variety of fluorescent or radioactive labeling groups well known in the art. These labels are coupled in various embodiments to the sequences using methods that are well known in the art. See, e.g., Sambrook and Russell, Molecular Cloning, A Laboratory Manual, third edition.
  • the label comprises, but is not limited to, one or more fluorescent labels (including, but not limited to, FITC, PE, Texas RED, Cyber Green, JOE, FAM, HEX, TAMRA, ROX, Alexa 488, Alexa 532, Alexa 546, Alexa 405, or other fluorescent moieties and dyes), radioactive labels (including, but not limited to, 32 P, 3 H, or 14 C), fluorescent proteins (including, but not limited to, GFP, RFP, or YFP), bioluminescent proteins (e.g., luciferase), quantum dots, gold particles, sliver particles, biotin, beads
  • fluorescent labels including, but not limited to, FITC, PE, Texas RED, Cyber Green, JOE, FAM, HEX, TAMRA, ROX, Alexa 488, Alexa 532, Alexa 546, Alexa 405, or other fluorescent moieties and dyes
  • radioactive labels including, but not limited to, 32 P, 3 H, or 14 C
  • fluorescent proteins including
  • nucleic acids are subjected to other treatments.
  • embodiments are related to fragmenting nucleic acids prior to subsequent steps. Fragmentation of nucleic acids may generally be carried out by physical, chemical, or enzymatic methods that are known in the art. These treatments may be performed within the amplification chamber, or alternatively, may be carried out in a separate chamber.
  • physical fragmentation methods include but are not limited to moving the sample containing the nucleic acid over pits or spikes in the surface of a reaction chamber or fluid channel. The motion of the fluid sample, in combination with the surface
  • nucleic acid is purified and size selected e.g., to prepare a collection of nucleic acid fragments or a nucleic acid sequencing library (e.g., for NGS, e.g., an NGS library).
  • a binding buffer and beads are added to the nucleic acid to bind nucleic acids having a specified range of fragment sizes.
  • the binding buffer is formulated to promote the binding of nucleic acids having a specified range of fragment sizes to the beads.
  • the sample comprises nucleic acids having a number of different types of ends, e.g., blunt ends, 3' overhangs, 5' overhangs, and incorrect 3 ' and/or 5' phosphorylation states. Accordingly, for efficient adaptor ligation in subsequent steps, the different types of nucleic acid ends are converted to a similar end type with 3' overhangs removed, 5' overhangs filled in, and the appropriate phosphorylation state produced on both the 3 ' and 5' ends. In some embodiments, an untemplated A is added, e.g., for a T-A ligation scheme.
  • Ligating an adaptor to a nucleic acid fragment is performed using a ligation reaction, which is a biochemical reaction in which an enzyme (e.g., a ligase) forms a chemical link between a nucleic acid fragment and an adaptor.
  • a ligation reaction which is a biochemical reaction in which an enzyme (e.g., a ligase) forms a chemical link between a nucleic acid fragment and an adaptor.
  • Ligation methods are known in the art and utilize standard methods (Sambrook and Russell, Molecular Cloning, A Laboratory Manual, third edition). Such methods utilize ligase enzymes such as DNA ligase to effect or catalyze joining of the ends of two polynucleotide strands by forming a covalent linkage.
  • an adaptor comprises a 5 '-phosphate moiety to facilitate ligation to the nucleic acid fragment 3 '-OH.
  • a nucleic acid fragment comprises a 5'- phosphate moiety, either residually from the shearing process, or added using an enzymatic treatment step.
  • a nucleic acid fragment has been end repaired and, optionally, extended to produce an overhanging base or bases, to give a 3 '-OH suitable for ligation.
  • joining means covalent linkage of polynucleotide strands which were not previously covalently linked.
  • joining comprises formation of a phosphodiester linkage between the two polynucleotide strands, but other means of covalent linkage (e.g. non-phosphodiester backbone linkages) may be used.
  • Many ligation methods utilize either a blunt or TA-mediated ligation.
  • the ligase is a T4 ligase, a variant of T4 DNA ligase, an E. coli ligase, etc.
  • DNA ligase refers to a family of enzymes that catalyze the formation of a covalent phosphodiester bond between two distinct DNA strands, e.g., in a "ligation reaction". While in some embodiments T4 DNA ligase (isolated from the T4 phage) and DNA ligase from E. coli find use in the technology described herein, the technology is not limited by the ligase that is used to perform the ligation reaction. Any enzyme with DNA ligase activity is contemplated by the technology.
  • an oligonucleotide adapter is ligated onto a nucleic acid fragment as a step to prepare a sequencing library for sequencing (e.g., NGS).
  • a sequencing library for sequencing e.g., NGS
  • an end polishing step is performed on nucleic acid fragments to create blunt ends on the nucleic acid fragments.
  • an enzymatic reaction e.g., a PCR, terminal transferase reaction, or Klenow exo minus polymerase reaction adds an untemplated A to the ends of nucleic acids.
  • ligation reactions comprise ligating adaptors with blunt ends and some embodiments of ligation reactions comprise ligating adaptors with overhangs, e.g., with T overhangs that are complementary to the A overhang on the nucleic acid (A-T mediated ligation).
  • the nucleic acids are fragmented nucleic acids and in some embodiments the nucleic acids are size selected nucleic acids.
  • the adaptors comprise an index, barcode, or key that serves to identify the sequencing library after sequencing.
  • the adaptors comprise PCR and/or sequencing priming sites; in some embodiments, the adaptors comprise blunt 3' ends and 5' overhangs; in some embodiments, an adaptor of a pair of adaptors comprises a biotin on the 5' end.
  • each ligated product comprises in various embodiments one or more of (e.g., in various combinations), e.g., a PCR priming site; a sequencing primer site; an index, barcode, or key; a nucleic acid to be sequenced (e.g., a nucleic acid fragment produced from the sample); and a second end comprising a PCR priming site; a sequencing primer site; and an index, barcode, or key.
  • Nucleic acid sequencing can be by any method.
  • high throughput next-generation sequencing is employed to sequence, in parallel, all smRNA from a sample.
  • a sequencing library is generated prior to sequencing. Exemplary sequencing methods are described below.
  • nucleic acid sequencing techniques include, but are not limited to, chain terminator (Sanger) sequencing and dye terminator sequencing.
  • Chain terminator sequencing uses sequence-specific termination of a DNA synthesis reaction using modified nucleotide substrates. Extension is initiated at a specific site on the template DNA by using a short radioactive, or other labeled, oligonucleotide primer complementary to the template at that region.
  • the oligonucleotide primer is extended using a DNA polymerase, standard four deoxynucleotide bases, and a low concentration of one chain terminating nucleotide, most commonly a di-deoxynucleotide.
  • This reaction is repeated in four separate tubes with each of the bases taking turns as the di-deoxynucleotide. Limited incorporation of the chain terminating nucleotide by the DNA polymerase results in a series of related DNA fragments that are terminated only at positions where that particular di-deoxynucleotide is used.
  • the fragments are size-separated by electrophoresis in a slab polyacrylamide gel or a capillary tube filled with a viscous polymer. The sequence is determined by reading which lane produces a visualized mark from the labeled primer as you scan from the top of the gel to the bottom.
  • Dye terminator sequencing alternatively labels the terminators. Complete sequencing can be performed in a single reaction by labeling each of the di-deoxynucleotide chain- terminators with a separate fluorescent dye, which fluoresces at a different wavelength.
  • a set of methods referred to as "next-generation sequencing” techniques have emerged as alternatives to Sanger and dye-terminator sequencing methods (Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; each herein incorporated by reference in their entirety). Most current methods describe the use of next-generation sequencing technology for de novo sequencing of whole genomes to determine the primary nucleic acid sequence of an organism.
  • NGS technology produces large amounts of sequencing data points.
  • a typical run can easily generate tens to hundreds of megabases per run, with a potential daily output reaching into the gigabase range. This translates to several orders of magnitude greater than a standard 96-well plate, which can generate several hundred data points in a typical multiplex run.
  • Target amplicons that differ by as little as one nucleotide can easily be distinguished, even when multiple targets from related species are present. This greatly enhances the ability to do accurate genotyping.
  • NGS alignment software programs used to produce consensus sequences can easily identify novel point mutations, which could result in new strains with associated drug resistance.
  • the use of primer bar coding also allows multiplexing of different patient samples within a single sequencing run.
  • NGS methods share the common feature of massively parallel, high-throughput strategies, with the goal of lower costs in comparison to older sequencing methods. NGS methods can be broadly divided into those that require template amplification and those that do not. Amplification-requiring methods include pyrosequencing developed by Solexa and commercialized by Illumina. Non-amplification approaches, also known as single-molecule sequencing, include the Ion Torrent platform commercialized by Life Technologies and the single molecule real time sequencing (also known as SMRT) technologies developed by Pacific Biosciences.
  • template DNA is fragmented, end- repaired, ligated to adaptors, and clonally amplified in-situ by capturing single template molecules with beads bearing oligonucleotides complementary to the adaptors.
  • Each bead bearing a single template type is compartmentalized into a water-in-oil microvesicle, and the template is clonally amplified using a technique referred to as emulsion PCR.
  • the emulsion is disrupted after amplification and beads are deposited into individual wells of a picotitre plate functioning as a flow cell during the sequencing reactions. Ordered, iterative
  • each of the four dNTP reagents occurs in the flow cell in the presence of sequencing enzymes and luminescent reporter such as luciferase.
  • luminescent reporter such as luciferase.
  • the resulting production of ATP causes a burst of luminescence within the well, which is recorded using a CCD camera. It is possible to achieve read lengths greater than or equal to 400 bases, and 1 ⁇ 10 6 sequence reads can be achieved, resulting in up to 500 million base pairs (Mb) of sequence.
  • sequencing data are produced in the form of shorter-length reads.
  • single- stranded fragmented DNA is end-repaired to generate 5'-phosphorylated blunt ends, followed by Klenow-mediated addition of a single A base to the 3' end of the fragments.
  • A-addition facilitates addition of T-overhang adaptor oligonucleotides, which are subsequently used to capture the template-adaptor molecules on the surface of a flow cell that is studded with oligonucleotide anchors.
  • the anchor is used as a PCR primer, but because of the length of the template and its proximity to other nearby anchor oligonucleotides, extension by PCR results in the "arching over" of the molecule to hybridize with an adjacent anchor oligonucleotide to form a bridge structure on the surface of the flow cell.
  • These loops of DNA are denatured and cleaved. Forward strands are then sequenced with reversible dye terminators.
  • sequence of incorporated nucleotides is determined by detection of post-incorporation fluorescence, with each fluor and block removed prior to the next cycle of dNTP addition. Sequence read length ranges from 36 nucleotides to over 50 nucleotides, with overall output exceeding 1 billion nucleotide pairs per analytical run.
  • Sequencing reactions are performed using immobilized template, modified phi29 DNA polymerase, and high local concentrations of fluorescently labeled dNTPs. High local concentrations and continuous reaction conditions allow incorporation events to be captured in real time by fluor signal detection using laser excitation, an optical waveguide, and a CCD camera.
  • single molecule real time (SMRT) DNA sequencing methods using zero-mode waveguides (ZMWs) developed by Pacific Biosciences, or similar methods are employed. With this technology, DNA sequencing is performed on SMRT chips, each containing thousands of zero-mode waveguides (ZMWs).
  • ZMWs zero-mode waveguides
  • a ZMW is a hole, tens of nanometers in diameter, fabricated in a 100 nm metal film deposited on a silicon dioxide substrate.
  • Each ZMW becomes a nanophotonic visualization chamber providing a detection volume of just 20 zeptoliters (10 ⁇ 21 liters). At this volume, the activity of a single molecule can be detected amongst a background of thousands of labeled nucleotides.
  • the ZMW provides a window for watching DNA polymerase as it performs sequencing by synthesis.
  • a single DNA polymerase molecule is attached to the bottom surface such that it permanently resides within the detection volume.
  • Phospholinked nucleotides, each type labeled with a different colored fluorophore, are then introduced into the reaction solution at high concentrations which promote enzyme speed, accuracy, and processivity. Due to the small size of the ZMW, even at these high,
  • the detection volume is occupied by nucleotides only a small fraction of the time.
  • visits to the detection volume are fast, lasting only a few microseconds, due to the very small distance that diffusion has to carry the nucleotides.
  • the result is a very low background.
  • each base is held within the detection volume for tens of milliseconds, which is orders of magnitude longer than the amount of time it takes a nucleotide to diffuse in and out of the detection volume.
  • the engaged fluorophore emits fluorescent light whose color corresponds to the base identity.
  • the polymerase cleaves the bond holding the fluorophore in place and the dye diffuses out of the detection volume. Following incorporation, the signal immediately returns to baseline and the process repeats. Unhampered and uninterrupted, the DNA polymerase continues incorporating bases at a speed of tens per second. In this way, a completely natural long chain of DNA is produced in minutes. Simultaneous and continuous detection occurs across all of the thousands of ZMWs on the SMRT chip in real time.
  • PacBio have demonstrated this approach has the capability to produce reads thousands of nucleotides in length.
  • nanopore sequencing is employed (see, e.g., Astier et al., Am Chem. Soc. 2006 Feb. 8; 128(5): 1705-10, herein incorporated by reference).
  • the theory behind nanopore sequencing has to do with what occurs when the nanopore is immersed in a conducting fluid and a potential (voltage) is applied across it - under these conditions a slight electric current due to conduction of ions through the nanopore can be observed, and the amount of current is exceedingly sensitive to the size of the nanopore. If DNA molecules pass (or part of the DNA molecule passes) through the nanopore, this can create a change in the magnitude of the current through the nanopore, thereby allowing the sequences of the DNA molecule to be determined.
  • the nanopore may be a solid-state pore fabricated on a metal and/or nonmetal surface, or a protein-based nanopore, such as alpha-hemolysin (Clarke et al., Nat. Nanotech., 4, Feb. 22, 2009: 265-270).
  • HeliScope by Helicos Biosciences (Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S. Pat. No. 7,169,560; U.S. Pat. No. 7,282,337; U.S. Pat. No. 7,482, 120; U.S. Pat. No. 7,501,245; U.S. Pat. No. 6,818,395; U.S. Pat. No. 6,911,345; U.S. Pat. No. 7,501,245; each herein incorporated by reference in their entirety) is the first commercialized single-molecule sequencing platform.
  • Template DNA is fragmented and polyadenylated at the 3' end, with the final adenosine bearing a fluorescent label.
  • Denatured polyadenylated template fragments are ligated to poly(dT) oligonucleotides on the surface of a flow cell.
  • Initial physical locations of captured template molecules are recorded by a CCD camera, and then label is cleaved and washed away.
  • Sequencing is achieved by addition of polymerase and serial addition of fluorescently-labeled dNTP reagents. Incorporation events result in fluor signal corresponding to the dNTP, and signal is captured by a CCD camera before each round of dNTP addition.
  • Sequence read length ranges from 25-50 nucleotides, with overall output exceeding 1 billion nucleotide pairs per analytical run.
  • Other single molecule sequencing methods include real-time sequencing by synthesis using a VisiGen platform (Voelkerding et al., Clinical Chem., 55: 641-658, 2009; U.S. Pat. No. 7,329,492; U.S. patent application Ser. No. 11/671,956; U.S. patent application Ser. No.
  • sequencing data are produced. Following the production of sequencing data, the sequencing data are reported to a data analysis operation in some embodiments. To facilitate data analysis in some embodiments, the sequencing data are analyzed by a digital computer. In some embodiments, the computer is appropriately programmed for receipt and storage of the sequencing data and for analysis and reporting of the sequencing data gathered, e.g., to provide a nucleic acid sequence in a human or machine readable format.
  • the methods and systems described herein are associated with a programmable machine designed to perform a sequence of arithmetic or logical operations as provided by the methods described herein.
  • some embodiments of the technology are associated with (e.g., implemented in) computer software and/or computer hardware.
  • the technology relates to a computer comprising a form of memory, an element for performing arithmetic and logical operations, and a processing element (e.g., a processor or a microprocessor) for executing a series of instructions (e.g., a method as provided herein) to read, manipulate, and store data.
  • a processing element e.g., a processor or a microprocessor
  • Some embodiments comprise one or more processors.
  • a microprocessor is part of a system comprising one or more of a CPU, a graphics card, a user interface (e.g., comprising an output device such as a display and an input device such as a keyboard), a storage medium, and memory components.
  • Memory components e.g., volatile and/or nonvolatile memory find use in storing
  • Programmable machines associated with the technology comprise conventional extant technologies and technologies in development or yet to be developed (e.g., a quantum computer, a chemical computer, a DNA computer, an optical computer, a spintronics based computer, etc.).
  • Some embodiments provide a computer that includes a computer-readable medium.
  • the embodiment includes a random access memory (RAM) coupled to a processor.
  • the processor executes computer-executable program instructions stored in memory.
  • processors may include a microprocessor, an ASIC, a state machine, or other processor, and can be any of a number of computer processors, such as processors from Intel Corporation of Santa Clara, California and Motorola Corporation of Schaumburg, Illinois.
  • Such processors include, or may be in communication with, media, for example computer- readable media, which stores instructions that, when executed by the processor, cause the processor to perform the steps described herein.
  • Embodiments of computer-readable media include, but are not limited to, an electronic, optical, magnetic, or other storage or transmission device capable of providing a processor, such as the processor of client, with computer-readable instructions.
  • suitable media include, but are not limited to, a floppy disk, CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions.
  • various other forms of computer-readable media may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless.
  • the instructions may comprise code from any suitable computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, Python, Perl, Swift, Ruby, Unix, and JavaScript.
  • Computers are connected in some embodiments to a network or, in some
  • Computers may also include a number of external or internal devices such as a mouse, a CD-ROM, DVD, a keyboard, a display, or other input or output devices.
  • Examples of computers are personal computers, digital assistants, personal digital assistants, cellular phones, mobile phones, smart phones, pagers, digital tablets, laptop computers, internet appliances, and other processor-based devices.
  • the computer-related to aspects of the technology provided herein may be any type of processor-based platform that operates on any operating system, such as Microsoft Windows, Linux, UNIX, Mac OS X, etc., capable of supporting one or more programs comprising the technology provided herein. All such components, computers, and systems described herein as associated with the technology may be logical or virtual.
  • a system comprises one or more or all of: a sample collection component, a sample lysis component, a nucleic acid isolation component, a nucleic acid amplification component, a nucleic acid sequencing component, and a data analysis and reporting component.
  • the systems and methods predict amplified smRNA peaks de novo from size-selected high-throughput sequencing data by applying successive statistical models to identify coherently overexpressed peaks relative to background.
  • this process comprises or consists of three steps that identify, call, and filter putative smRNA loci, which are discussed separately in detail. All of these steps are linked together to allow for separate processes to easily communicate with each other, and for downstream computation to begin before the previous steps have completely finished. Identification
  • a sequence data file is input into a computer processor of the system.
  • the data file comprises sorted smRNA short-read alignment data from size selected sequencing libraries.
  • an input to the system is a file of sorted smRNA short-read alignment data in SAM format from size selected sequencing libraries.
  • the SAM Format is a text format for storing sequence data in a series of tab delimited ASCII columns. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form.
  • most SAM format data is output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome.
  • resource utilization has been optimized to handle datasets using less than 8 GB of memory, which is typically in the range of 10 to 30 million (M) aligned reads per
  • the system streams in the input to identify small putative regions on the order of 200 bp. Upon identification of these smRNA loci, they are streamed off to the calling step to be scanned for mature amplified smRNAs.
  • Putative smRNA loci identification proceeds by extracting regions that have statistically overrepresented numbers of smRNA alignments against a null model in which all of the sequenced smRNAs are uniformly distributed across the genome.
  • the test statistic is the total count of aligned reads within a given interval and is binomially distributed. Hence, the corresponding p-value can be unambiguously calculated from the binomial cumulative density function.
  • a user settable cutoff is then applied to filter out regions of low interest, preventing them from being analyzed during the more computationally expensive calling stage. If the locus passes the cutoff, reads aligning to that region (e.g., all of the reads aligning to that region) are forwarded on to the calling step. In some embodiments, if the locus passes the cutoff, reads aligning upstream and/or downstream of that region (e.g., 50 base pairs, 100 base pairs, 200 base pairs, 300 base pairs, etc.) are forwarded on to the calling step.
  • a system comprises a calling algorithm that fits a simple model of smRNA amplification to the observed distribution of smRNA alignments to predict putative mature smRNA loci. Algorithmically, this is done via Maximum Likelihood
  • Estimation which optimizes a family of parameterized models to identify the parameters that best explain the given observations (e.g. predicts the most likely model).
  • M and R are mx2 and nx2 dimensional matrices of read and model start and stop positions respectively with m and n being the number of model mature smRNA loci and the total number of reads, respectively.
  • Epsilon is a constant probability used during derivation of the likelihood function to model single base loss or gain due to sequencing error. Being a constant term in the final log likelihood it can be dropped from the model during
  • the log likelihood function is also monotonically increasing as a function of the dimension of the parameter space, because adding an additional locus will always increase the likelihood by reducing the distance of one or more reads to a model locus. This problem is solved by iteratively increasing the number of model loci to fit.
  • This process stops when an optimal number has been identified via a heuristic.
  • a score is assigned to each locus quantifying how amplified the read support for that locus is compared to the surrounding region. As before, this is done using a binomially distributed test statistic under a null model that assumes all of the reads across the locus are uniformly distributed.
  • the optimal number of smRNA model loci is identified when the maximum locus p-value increases for two successive iterations, which means that adding additional loci to the model does not identify additional high quality mature smRNAs.
  • Filtering Filtering is performed by using the amplification p-value assigned to each model locus from the calling stage. Combined with other criteria such as minimum read count or length, low quality loci are discarded from results. Additionally, the user can provide a fixed cutoff, which can be used to filter down to only the most amplified loci. Finally, the filtered results are processed to determine the strand and written to a GFF3 (General Feature Format) file.
  • GFF3 General Feature Format
  • the system is implemented as an open source, multi -threaded application developed using JDK 7 (Java Development Kit) and standard open source libraries including Apache commons 3.3 and Picard 1.120. It can be run on any platform with JRE 7 (Java Runtime Environment) or higher. For typical size datasets with approximately 10 - 30 M reads, ideal performance is found on systems with a minimum of 8 GB of RAM.
  • smRNAs identified by the systems and methods may be further characterized.
  • smRNA are isolated, chemically synthesized, placed into vectors, or otherwise prepared for further analysis. In some embodiments, such smRNAs are provided to cells or in vitro environments and their function analyzed.
  • binding assays are performed to determine nucleic acids, proteins, or other molecules that specifically bind to such smRNAs.
  • endogenous versions of such smRNAs are monitored in vitro or in vivo to identify their expression levels, tissue-specificity, and response to biological or environment cues.
  • exogenous smRNAs are administered to subjects (e.g., animals such as humans) for research or therapeutic indications.
  • such smRNAs from a variety of cells, individuals, animals, etc. are sequenced to identify variants, homologues, and the like.
  • piRNA as an example. It should be understood that these techniques may be employed with other smRNAs.
  • Nucleic acid was isolated from mouse testes and sequenced using Illumina high throughput sequencing of total smRNA to generate a sequencing data set.
  • piRNA clusters were identified and known piRNAs were characterized.
  • a piRNA cluster on chromosome 2 was identified that contained known piRNA mmu_piR_034537 ( Figure 1). Consistent with piRNA clusters, uneven basal expression is found across the entire piRNA cluster, and there is a sharp amplification of the piRNA mmu_piR_034537.
  • the systems and methods can deconvolute smRNA peaks.
  • the exact definition of a smRNA is obscured by spurious reads or true biological variation. Other times several smRNA originate from the same loci.
  • the systems and methods identify the reads that are associated with sub-peaks ( Figure 3). This feature of systems and methods is important for distinguishing the boundaries of smRNAs, which provides information as to the biogenesis and downstream targets.
  • Brownstein MJ Kuramochi-Miyagawa S, Nakano T, Chien M, Russo JJ, Ju J, Sheridan R, Sander C, Zavolan M, Tuschl T.
  • a novel class of small RNAs bind to MILI protein in mouse testes. Nature. 2006 Jul 13;442(7099):203-7.
  • Girard A Sachidanandam R, Hannon GJ, Carmell MA.
  • a germline-specific class of small RNAs binds mammalian Piwi proteins. Nature. 2006 Jul 13 ;442(7099): 199-202.
  • Grivna ST Beyret E, Wang Z, Lin H.
  • Peng JC Lin H. Beyond transposons: the epigenetic and somatic functions of the Piwi- piRNA mechanism. Curr Opin Cell Biol. 2013 Apr;25(2): 190-4.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Organic Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Bioethics (AREA)
  • Microbiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Immunology (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne des systèmes et des procédés d'identification de petits ARN. En particulier, l'invention porte sur des systèmes et des procédés permettant d'identifier des petits ARN, par identification de petits ARN amplifiés de novo à partir d'informations de séquençage à haut rendement.
PCT/US2016/018680 2015-02-20 2016-02-19 Systèmes et procédés pour l'identification et l'utilisation de petits arn WO2016134258A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201562118767P 2015-02-20 2015-02-20
US62/118,767 2015-02-20

Publications (1)

Publication Number Publication Date
WO2016134258A1 true WO2016134258A1 (fr) 2016-08-25

Family

ID=56689134

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2016/018680 WO2016134258A1 (fr) 2015-02-20 2016-02-19 Systèmes et procédés pour l'identification et l'utilisation de petits arn

Country Status (1)

Country Link
WO (1) WO2016134258A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102141312B1 (ko) * 2019-04-19 2020-08-04 주식회사 제노헬릭스 짧은 RNA-primed 제노 센서 모듈 증폭 기반 짧은 RNA 탐지 기법
CN113539367A (zh) * 2021-01-22 2021-10-22 南京集思慧远生物科技有限公司 基于二代高通量测序的哺乳动物piRNA数据分析方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110294699A1 (en) * 2007-08-29 2011-12-01 Sequenom, Inc. Methods and compositions for universal size-specific pcr
US20130184999A1 (en) * 2012-01-05 2013-07-18 Yan Ding Systems and methods for cancer-specific drug targets and biomarkers discovery
WO2014008434A2 (fr) * 2012-07-06 2014-01-09 Nant Holdings Ip, Llc Gestion de flux d'analyse de soins de santé
US20140275216A1 (en) * 2013-03-14 2014-09-18 Ibis Biosciences, Inc. ALTERATION OF NEURONAL GENE EXPRESSION BY SYNTHETIC piRNAs AND BY ALTERATION OF piRNA FUNCTION
WO2014179765A2 (fr) * 2013-05-02 2014-11-06 Thomas Jefferson University Nouveaux miarn humains à utiliser dans le diagnostic, le pronostic et la thérapie de maladies et d'états humains
WO2014197835A2 (fr) * 2013-06-06 2014-12-11 The General Hospital Corporation Méthodes et compositions pour le traitement du cancer

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110294699A1 (en) * 2007-08-29 2011-12-01 Sequenom, Inc. Methods and compositions for universal size-specific pcr
US20130184999A1 (en) * 2012-01-05 2013-07-18 Yan Ding Systems and methods for cancer-specific drug targets and biomarkers discovery
WO2014008434A2 (fr) * 2012-07-06 2014-01-09 Nant Holdings Ip, Llc Gestion de flux d'analyse de soins de santé
US20140275216A1 (en) * 2013-03-14 2014-09-18 Ibis Biosciences, Inc. ALTERATION OF NEURONAL GENE EXPRESSION BY SYNTHETIC piRNAs AND BY ALTERATION OF piRNA FUNCTION
WO2014179765A2 (fr) * 2013-05-02 2014-11-06 Thomas Jefferson University Nouveaux miarn humains à utiliser dans le diagnostic, le pronostic et la thérapie de maladies et d'états humains
WO2014197835A2 (fr) * 2013-06-06 2014-12-11 The General Hospital Corporation Méthodes et compositions pour le traitement du cancer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LL, H ET AL.: "The Sequence Alignment/Map Format and SAMtools.", BIOINFORMATICS., vol. 25, no. 16, 2009, pages 2078 - 2079, XP055229864, DOI: doi:10.1093/bioinformatics/btp352 *
SEVERINO, P ET AL.: "Small RNAs in Metastatic and Non-metastatic Oral Squamous Cell Carcinoma.", BMC MEDICAL GENOMICS, vol. 8, no. 31, 24 June 2015 (2015-06-24) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102141312B1 (ko) * 2019-04-19 2020-08-04 주식회사 제노헬릭스 짧은 RNA-primed 제노 센서 모듈 증폭 기반 짧은 RNA 탐지 기법
WO2020213800A1 (fr) * 2019-04-19 2020-10-22 주식회사 제노헬릭스 Technique de détection d'arn court basée sur l'amplification d'un module de capteur xeno amorcée par un arn court
US11591646B2 (en) 2019-04-19 2023-02-28 Xenoheltx Co., Ltd Small RNA detection method based on small RNA primed xenosensor module amplification
CN113539367A (zh) * 2021-01-22 2021-10-22 南京集思慧远生物科技有限公司 基于二代高通量测序的哺乳动物piRNA数据分析方法
CN113539367B (zh) * 2021-01-22 2023-11-03 南京集思慧远生物科技有限公司 基于二代高通量测序的哺乳动物piRNA数据分析方法

Similar Documents

Publication Publication Date Title
AU2019250200B2 (en) Error Suppression In Sequenced DNA Fragments Using Redundant Reads With Unique Molecular Indices (UMIs)
AU2018331434B2 (en) Universal short adapters with variable length non-random unique molecular identifiers
US20150051088A1 (en) Next-generation sequencing libraries
CA3220983A1 (fr) Sequences index optimales pour sequencage multiplex massivement parallele
EP3211100A1 (fr) Amorces et procédés d'amplification
US20210108263A1 (en) Methods and Compositions for Preparing Sequencing Libraries
WO2018170660A1 (fr) Procédé de correction de biais d'amplification dans le séquençage d'amplicons
WO2016134258A1 (fr) Systèmes et procédés pour l'identification et l'utilisation de petits arn
US20230235394A1 (en) Chimeric amplicon array sequencing
WO2021081235A1 (fr) Associations de k-mères de novo entre des états moléculaires

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16753146

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16753146

Country of ref document: EP

Kind code of ref document: A1