WO2017070498A1 - Procédés pour évaluer la qualité de bibliothèques d'expression génique - Google Patents

Procédés pour évaluer la qualité de bibliothèques d'expression génique Download PDF

Info

Publication number
WO2017070498A1
WO2017070498A1 PCT/US2016/058165 US2016058165W WO2017070498A1 WO 2017070498 A1 WO2017070498 A1 WO 2017070498A1 US 2016058165 W US2016058165 W US 2016058165W WO 2017070498 A1 WO2017070498 A1 WO 2017070498A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene expression
predetermined
libraries
expression libraries
markers
Prior art date
Application number
PCT/US2016/058165
Other languages
English (en)
Inventor
Craig E. Nelson
Original Assignee
Smpl Bio, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smpl Bio, Llc filed Critical Smpl Bio, Llc
Publication of WO2017070498A1 publication Critical patent/WO2017070498A1/fr

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6809Methods for determination or identification of nucleic acids involving differential detection
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6845Methods of identifying protein-protein interactions in protein mixtures

Definitions

  • the present invention relates to quality control methods for gene expression libraries.
  • High-throughput gene expression profiling technologies such as next-generation sequencing and proteomics methods, have enabled the rapid and efficient generation of large quantities of gene expression data. Improvements in techniques for producing libraries suitable for gene expression profiling allow analysis of increasingly granular samples. For example, libraries may now be generated from mRNA obtained from single cells. Such processes may be multiplexed, thereby permitting the generation of large numbers of single-cell libraries in parallel at high throughput.
  • the invention features methods for assessing the quality of gene expression libraries and methods for identifying gene expression libraries meeting a quality threshold.
  • the methods generally involve surveying a set of gene expression data (e.g., gene expression data corresponding to one or more gene expression libraries) for the presence and/or expression level of a plurality of predetermined markers.
  • the predetermined markers correspond to a set of coordinately regulated genes, such as ribosomal protein genes.
  • the invention features a method for assessing the quality of a gene expression library.
  • the method involves: (a) providing gene expression data corresponding to the gene expression library; and (b) surveying the gene expression data for the presence of a plurality of coordinately regulated predetermined markers; in which detection of the presence of at least a predetermined threshold number or percentage of the predetermined markers indicates that the gene expression library meets a quality threshold.
  • the invention features a method for assessing the quality of a gene expression library.
  • the method involves: (a) providing a plurality of polynucleotides and/or polypeptides from a biological sample, (b) producing a gene expression library from the plurality of polynucleotides and/or polypeptides; (c) determining the quantity of each distinct polynucleotide and/or polypeptide in the gene expression library, thereby generating gene expression data corresponding to the gene expression library; and (d) surveying the gene expression data for the presence of a plurality of coordinately regulated predetermined markers; in which detection of the presence of at least a predetermined threshold number or percentage of the predetermined markers indicates that the gene expression library meets a quality threshold.
  • the predetermined threshold number is at least 10 (e.g., at least about 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, or more). In some embodiments, the predetermined threshold percentage is at least 50% (e.g., at least about 50%, 60%, 70%, 75%, 80%, 90%, 95%, 97%, 98%, 99%, or 100%).
  • the gene expression library is obtained from a cell.
  • the cell is a prokaryotic cell or a eukaryotic cell.
  • the prokaryotic cell is a bacterial cell (e.g., an E. coli cell).
  • the eukaryotic cell is a mammalian, fungal, or insect cell.
  • the fungal cell is a yeast cell (e.g., a Saccharomyces cerevisiae or Schizosaccharomyces pombe cell).
  • the mammalian cell is a human cell.
  • the gene expression library is obtained from a tissue (e.g., a human tissue).
  • the tissue is obtained from a tumor.
  • the invention features a method for identifying gene expression libraries meeting a quality threshold from among a plurality of gene expression libraries.
  • the method involves: (a) providing gene expression data corresponding to each of a plurality of gene expression libraries; and (b) surveying the gene expression data corresponding to each of the gene expression libraries for the presence of a plurality of coordinately regulated predetermined markers; in which detection of the presence of at least a predetermined threshold number or percentage of the predetermined markers in the gene expression data corresponding to a particular gene expression library indicates that the particular gene expression library meets a quality threshold.
  • the predetermined threshold number is at least 10 (e.g., at least about 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, or more).
  • the predetermined threshold percentage is at least 50% (e.g., at least about 50%, 60%, 70%, 75%, 80%, 90%, 95%, 97%, 98%, 99%, or 1 00%).
  • the invention features a method for assessing the quality of a plurality of gene expression libraries.
  • the method involves: (a) providing gene expression data corresponding to each of the gene expression libraries; (b) surveying the gene expression data corresponding to each of the gene expression libraries for the presence of a plurality of coordinately regulated predetermined markers; and (c) determining a number of the predetermined markers detected for each of the gene expression libraries; in which the mean of the numbers being above a predetermined threshold indicates that the plurality of gene expression libraries meets a quality threshold.
  • the predetermined threshold number is at least 10 (e.g., at least about 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, or more).
  • the invention features a method for assessing the quality of a plurality of gene expression libraries.
  • the method involves: (a) providing gene expression data corresponding to each of the gene expression libraries; and (b) surveying the gene expression data corresponding to each of the gene expression libraries for the presence of a plurality of coordinately regulated predetermined markers; in which the percentage of the gene expression libraries expressing at least a predetermined threshold number or percentage of the predetermined markers being above a predetermined threshold indicates that the plurality of gene expression libraries meets a quality threshold.
  • the predetermined threshold number is at least 10 (e.g., at least about 1 0, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 1 00, or more).
  • the predetermined threshold percentage is at least 50% (e.g., at least about 50%, 60%, 70%, 75%, 80%, 90%, 95%, 97%, 98%, 99%, or 100%).
  • the invention features a method for identifying gene expression libraries meeting a quality threshold from among a plurality of gene expression libraries.
  • the method involves: (a) providing gene expression data corresponding to each of a plurality of gene expression libraries; (b) surveying the gene expression data corresponding to each of the gene expression libraries for the presence of a plurality of coordinately regulated predetermined markers, thereby determining, for each of the gene expression libraries, a corresponding number or percentage of detected predetermined markers; and (c) identifying a subset of the gene expression libraries for which the corresponding number or percentage of detected predetermined markers is greater than the corresponding numbers or percentages of detected predetermined markers of a predetermined threshold percentage of the plurality of gene expression libraries, thereby identifying the subset as meeting a quality threshold.
  • the predetermined threshold percentage is at least 50% (e.g., at least about 50%, 60%, 70%, 75%, 80%, 90%, 95%, 97%, 98%, 99%, or 1 00%). In one embodiment, the predetermined threshold percentage of the plurality of gene expression libraries is 75%.
  • the invention features a method for assessing the quality of a plurality of gene expression libraries.
  • the method involves: (a) providing gene expression data corresponding to each of the gene expression libraries; (b) surveying the gene expression data corresponding to each of the gene expression libraries for the expression levels of a plurality of coordinately regulated
  • the predetermined threshold number is at least 10 (e.g., at least about 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, or more).
  • the invention features a method for assessing the quality of a plurality of gene expression libraries.
  • the method involves: (a) providing gene expression data corresponding to each of the gene expression libraries; and (b) surveying the gene expression data corresponding to each of the gene expression libraries for the expression levels of a plurality of coordinately regulated predetermined markers, in which each of the predetermined markers has a corresponding threshold expression level; in which the percentage of the gene expression libraries having at least a predetermined number or percentage of the predetermined markers having an expression level above the corresponding threshold expression level being above a predetermined threshold indicates that the plurality of gene expression libraries meets a quality threshold.
  • the predetermined threshold percentage is at least 50% (e.g., at least about 50%, 60%, 70%, 75%, 80%, 90%, 95%, 97%, 98%, 99%, or 100%).
  • the invention features a method for assessing the quality of a plurality of gene expression libraries.
  • the method involves: (a) providing gene expression data corresponding to each of the gene expression libraries; (b) surveying the gene expression data corresponding to each of the gene expression libraries for the presence of a plurality of predetermined markers, in which the predetermined markers are coordinately regulated; and (c) determining the distribution of the quantity of the predetermined markers detected in each of the gene expression libraries; in which the distribution substantially aligning to a predetermined distribution indicates that the plurality of gene expression libraries meets a quality threshold.
  • the invention features a method for assessing the quality of a plurality of gene expression libraries.
  • the method involves: (a) providing gene expression data corresponding to each of the gene expression libraries; (b) surveying the gene expression data corresponding to each of the gene expression libraries for the expression levels of a plurality of coordinately regulated
  • each of the predetermined markers has a corresponding threshold expression level; and (c) determining, for each of the gene expression libraries, a number of the predetermined markers having an expression level above the corresponding threshold expression level; in which the distribution of the numbers substantially aligning to a predetermined distribution indicates that the plurality of gene expression libraries meets a quality threshold.
  • the predetermined distribution is a Gaussian distribution.
  • each of the gene expression libraries is obtained from a cell or cells.
  • each of the cells is a prokaryotic cell or a eukaryotic cell.
  • each of the prokaryotic cells is a bacterial cell (e.g., an E. coli cell).
  • each of the eukaryotic cells is a mammalian, fungal, or insect cell.
  • each of the fungal cells is a yeast cell (e.g., a Saccharomyces cerevisiae or Schizosaccharomyces pombe cell).
  • each of the mammalian cells is a human cell.
  • the gene expression libraries are obtained from a plurality of cell types (e.g., any cells as described herein). In alternate embodiments of the fourth through eleventh aspects, each of the gene expression libraries is obtained from a tissue. In certain embodiments, each of the gene expression libraries is obtained from a human tissue. In particular embodiments, each of the gene expression libraries is obtained from a tissue obtained from a tumor (e.g., a human tumor). In some embodiments, the gene expression libraries are obtained from a plurality of cell and/or tissue types (e.g., and cell or tissue type as described herein).
  • the providing step includes, for each of the gene expression libraries: (i) providing a plurality of polynucleotides and/or polypeptides from a biological sample, (ii) producing a gene expression library from the plurality of polynucleotides and/or polypeptides; and (iii) determining the quantity of each distinct polynucleotide and/or polypeptide in the gene expression library, thereby generating gene expression data corresponding to the gene expression library.
  • the gene expression data corresponding to each of the gene expression libraries each includes a plurality of values, each of the values corresponding to the expression level of a gene.
  • the plurality of gene expression libraries includes one or more bulk libraries. In other embodiments, each of the plurality of gene expression libraries includes a single-cell library.
  • the gene expression data includes sequencing data, microarray data, or proteomics data. In certain embodiments, the sequencing data is obtained by next-generation sequencing. In particular embodiments, the sequencing data includes genomic sequences or transcriptomic sequences. In one embodiment, the transcriptomic sequences are obtained by RNA-Seq. In certain embodiments, the gene expression data includes proteomics data and the plurality of predetermined markers comprises a plurality of ribosomal protein genes.
  • the plurality of ribosomal protein genes includes one or more of RPL22, RPL1 1 , RPS8, RPL5, RPS27, RPS24, RPLP2, RPL27A, RPS13, RPS3, RPS25, RPS26, RPL6, RPLPO, RPL21 , RPS29, RPL36AL, RPL4, RPLP1 , RPS17, RPS2, RPS15A, RPL13, RPL26, RPL23A, RPL23, RPL19, RPL27, RPL38, RPL17, RPS15, RPL36, RPS28, RPL18A, RPS16, RPS19, RPL18, RPL13A, RPS1 1 , RPS9,
  • the plurality of predetermined markers includes a plurality of ribosomal protein genes.
  • the plurality of ribosomal protein genes includes one or more of RPL22, RPL1 1 , RPS8, RPL5, RPS27, RPS24, RPLP2, RPL27A, RPS13, RPS3, RPS25, RPS26, RPL6, RPLPO, RPL21 , RPS29, RPL36AL, RPL4, RPLP1 , RPS17, RPS2, RPS15A, RPL13, RPL26, RPL23A, RPL23, RPL1 9, RPL27, RPL38, RPL17, RPS15, RPL36, RPS28, RPL1 8A, RPS1 6, RPS19, RPL18, RPL13A, RPS1 1 , RPS9, RPL28, RPS5, RPS7, RPS27A, RPL31 , RPL37A, RPS21 , RPL3,
  • each of the predetermined markers corresponds to a gene required for cellular viability.
  • each of the genes is expressed at high levels.
  • each of the genes is expressed at levels among the top half (e.g., top half, top third, top quarter, top 10%, top 5%, top 1 %, top 0.1 %, or higher) of the top half (e.g., top half, top third, top quarter, top 10%, top 5%, top 1 %, top 0.1 %, or higher) of the
  • each of the genes required for cellular viability comprises a polynucleotide encoding a protein and/or a noncoding RNA.
  • quality refers to the level of correspondence between a measured dataset and the sample (e.g., a gene expression library) from which the dataset was obtained.
  • quality may refer to how accurately the measured gene expression of a data set represents the gene expression (e.g., transcriptional or translation activity, e.g., transcript or protein levels) in a corresponding library.
  • the quality of gene expression in a library may be reflected by characteristics, which include, but are not limited to, the quantity of certain gene expression products in a sample, the presence or absence of certain gene expression products, and/or the distribution or relative amounts of different gene expression products as measured in a data set.
  • Quality may also include a determination of the gene expression levels in a library relative to one or more additional libraries.
  • RNA e.g., mRNA, small RNA, siRNA, miRNA, rRNA, tRNA, or snRNA
  • DNA DNA
  • proteins proteins
  • modifications of RNA or protein such as splicing, phosphorylation, acetylation, or methylation
  • gene expression refers to gene products produced by the cell, tissue, or other biological sample.
  • library refers to a population of variants (e.g., nucleic acid variants or polypeptide variants).
  • a library may be a population of nucleic acid sequences that are contained on expression constructs and/or in one or more host cells.
  • a library may also include a plurality of expression constructs each encoding a library member peptide to provide a plurality of different library members.
  • a library may also refer to a population of peptide sequences (e.g., peptide sequences expressed from a library of nucleic acid molecules).
  • Libraries may include, but are not limited to, a DNA (e.g., cDNA, a DNA representation of mRNA), RNA, a PolyA, or protein library.
  • each construct includes a DNA sequence encoding a peptide to be expressed as a library member peptide and each contains appropriate promoter, translation start and stop signals.
  • the methods described herein may employ libraries having about 10 2 , 10 3 , 10 4 , 10 5 , 10 6 , 1 0 7 , 10 s , 10 9 , 10 i 0 , 10 1 ! , and 10 12 different nucleic acid sequences.
  • Presence refers to detectable levels of gene expression within a library.
  • the presence of a gene or gene product may be considered to be detected if it is distinguishable from (e.g., significantly greater than) background noise.
  • the presence of one or more genes or gene expression products may be determined by any methods known in the art, such as by a qualitative measure /or a quantitative measure of gene expression (e.g., as described herein).
  • coordinate ⁇ regulated refers to two or more genes that are regulated in concert. Such genes may be regulated by transcriptional regulation, translational regulation, and/or the relative stability of gene products (e.g., half-life of gene products; poly(A) tail length), or any combination of the above. Coordinately regulated genes may be tightly co-regulated or loosely co-regulated. The coordinated regulation of gene expression may be determined by any methods for determining gene expression levels known in the art (e.g., microarray, quantitative RT-PCR, RNA sequencing, protein quantification, protein sequencing, mass spectrometry (MS)).
  • MS mass spectrometry
  • values refers to the measured expression level of one or more genes determined for a cell and/or library. Values may include both the determined measures of gene expression and also the conversion of the determined gene expression values into other numerical representations of gene expression (e.g., Pearson correlation coefficient).
  • threshold expression level refers to an expression level above which a library is considered to meet a quality minimum.
  • a threshold expression level may be a preselected value (e.g., a value selected prior to performing a quality control method as described herein), or may be determined according to gene expression data derived from one or more (e.g., a population) of gene expression libraries. In some instances, a threshold is determined statistically.
  • cellular viability refers to a gene required for a cell to be characterized as alive and capable of living, developing, and/or reproducing.
  • high levels refers to the level of gene expression that is relatively greater, for example, than the average (e.g., mean or median or an approximation thereof) of detectable genes expressed in a library.
  • a gene expressed at "high levels” is generally expressed at levels among the top half (e.g., top half, top third, top quarter, top 10%, top 5%, top 1 %, top 0.1 %, or higher) of the transcriptome.
  • marker refers generally to a molecule, such as a transcript or protein, the expression of which in a library (e.g., a cell) or a plurality of libraries can be detected by standard methods (or methods disclosed herein).
  • the term “marker” may also genericai!y refer to a particular nucleic acid sequence or polypeptide sequence corresponding to a particular gene or gene product of interest (e.g., as a predetermined marker for use in the methods of the invention).
  • the skilled person has the ability to label the polypeptides or oligonucleotides encompassed by the present invention.
  • hybridization probes for use in detecting can be labeled and visualized according to standard methods known in the art.
  • Non-limiting examples of commonly used systems include the use of radiolabels, enzyme labels, fluorescent tags, biotin-avidin complexes, chemiluminescence, and the like.
  • Fig. 1 is a series of diagrams showing a set of ribosomal protein subunits for the large subunit of
  • Haloarcula marismortui and the small subunit of Thermus aquaticus may be used in the methods described herein as predetermined markers.
  • Figs. 2A and 2B are a series of graphs showing detection of ribosomal protein genes (RPGs) in a set of single cell gene expression libraries. These data demonstrate how low outliers can be identified as technical failures or biologically unique cells by determining the number of RPGs expressed in each library.
  • Fig. 2A shows single-cell libraries graphed according to the total number of detected genes per cell (x-axis) and the number of RPGs detected per cell (y-axis).
  • a quality control method that selects for libraries expressing at least a minimum number of total genes may unnecessarily exclude some libraries that are not technical failures, but that instead represent biologically unique cells (black dots).
  • Fig. 2B shows the per cell variance of ribosomal gene expression versus the number of ribosomal genes detected. As the number of ribosomal genes detected increases, the variance in RPG expression is reduced. The overlapping region including cells excluded (black dots) and cells included (gray dots) on the basis of total gene expression shows that there are cells showing sufficient RPG expression that variance is minimized, but which would ordinarily be excluded as not expressing a sufficient number of total genes.
  • the predetermined markers are generally characterized as being coordinately regulated, such that their expression levels vary in synchrony.
  • the predetermined markers may also be genes required for cellular viability and/or genes that are expressed at high levels in the cell type(s) of interest.
  • the methods described herein may be used, for example, to measure the quality of one or more libraries, to correct for measured imperfections in a library, and/or to distinguish between biological and technical outliers.
  • technical outliers may be characterized as having low expression of the predetermined markers (e.g., in combination with low expression of total genes), whereas biological outliers may show low expression of total genes, but also show higher expression of the predetermined markers.
  • Such biological outliers may represent biological failures or biologically unique cells.
  • a biological outlier in a population of single cell samples may represent a rare cell type of interest (e.g., a disease cell, such as a cancer cell).
  • quality control methods for one or more gene expression libraries involve detecting the presence of and/or measuring the expression level of predetermined markers in gene expression data derived from the gene expression libraries.
  • Expression of a sufficient quantity (e.g., total amount detected, total number detected, or percentage detected) of the predetermined markers may indicate that a particular gene expression library meets a quality threshold.
  • Determination that a sufficient quantity of predetermined markers is expressed in a population of gene expression libraries may indicate that the population meets a quality threshold.
  • libraries may be distinguished between those that cluster around maximal quality and those that cluster around a lower level of quality, which may be distinguished using statistical methods known in the art.
  • a population of gene expression libraries may be ranked, for example, by the number of predetermined markers detected in the libraries and/or by the quantity of each of and/or all of the predetermined markers.
  • a quality score may be calculated for each library according to statistical methods known in the art to determine the rankings of the libraries.
  • the subset of the libraries showing the greatest quality score, the greatest number of predetermined markers detected, and/or the greatest quantity of each of and/or all of the predetermined markers may be selected and, optionally, separated from the remainder of the libraries. This selected subset may represent the highest quality libraries from the population.
  • the subset may include a predetermined quantity (e.g., number or percentage) of the libraries from the population.
  • the subset may include, for example, approximately the top 0.001 %, 0.01 %, 0.1 %, 1 %, 2%, 3%, 4%, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 75%, 80%, 90%, 95%, or more of the libraries from the population.
  • the methods described herein may be used to assess the quality of gene expression data corresponding to one or more samples (e.g., libraries).
  • Such gene expression libraries may be used to generate data pertaining to expression of nucleic acids or polypeptides.
  • Nucleic acids for which levels can be determined include, for example, mRNA and noncoding RNA (e.g., miRNA, tRNA, rRNA, or any other form of noncoding RNAs known in the art).
  • Polypeptides that may be assayed include proteins, peptides, or any other polypeptide known in the art.
  • the samples are single-cell samples that yield, for example, single-cell gene expression profiles.
  • a cell may be isolated by any method known in the art (e.g., fluorescence activated cell sorting), and nucleic acid or polypeptides may then be purified from the cell and used to produce a single cell library.
  • a plurality of cells is individually sorted and lysed, and a library (e.g., a uniquely tagged library) is generated from each of the cells.
  • a sample may include material obtained from multiple cells (e.g., multiple cells in suspension or a tissue sample).
  • a bulk library may be obtained by collecting nucleic acids or proteins from multiple cells (e.g., multiple cells lysed simultaneously), and then gene expression data may be generated from the bulk library.
  • a library may be obtained from a tissue sample of interest (e.g., a tumor sample). Samples may include material obtained from, e.g., eukaryotic or prokaryotic cells.
  • the eukaryotic cells may be mammalian cells (e.g., human cells or mouse cells).
  • the prokaryotic cells may be bacterial cells.
  • a sample may include material generated in vitro.
  • markers that may be used as predetermined markers in these methods are generally genes that are coordinately regulated.
  • the markers may include a plurality of genes under tight transcriptional control, such that if one of the genes is upregulated, the other gene(s) are also upregulated. Conversely, if one of the genes is downregulated, the other genes may be downregulated. In some instances, the expression levels of the markers may be maintained in approximate synchrony.
  • the expression level of two markers may be maintained at a particular ratio (e.g., approximately a 1 :1 , 1 :2, 1 :3, 1 :4, 1 :5, 1 :6, 1 :7, 1 :8, 1 :9, 1 :10, 1 :20, 1 :50, 1 :100, 1 :1000, 1 :10,000 ratio, or more).
  • the markers may be maintained in synchrony according to their biological function (e.g., markers for which their relative levels are held at a particular stoichiometric ratio).
  • Markers that may be useful in the quality control methods described herein may include any plurality of markers known in the art.
  • the markers may be coordinately regulated as described herein.
  • the markers may be required for cellular viability.
  • the markers may normally be expressed at high levels in a cell.
  • the markers may include housekeeping genes commonly used, for example, in quantitative real-time PCR, as well known in the art (e.g., GAPDH or beta-actin).
  • the markers may include a panel of ribosomal protein genes.
  • Ribosomes are large complexes of polypeptides and nucleic acids present in all cells, which operate as the sites of protein translation from messenger RNA (mRNA). Each ribosome is composed of a number of protein components and one or more ribosomal RNA (rRNA) molecules. A given ribosome includes two primary subunits (Fig. 1 ), each made up of a large number of ribosomal proteins - a large subunit that binds to tRNAs and amino acids, and a small subunits that binds to the mRNA. Due to their crucial role in translation, ribosomal proteins may be required for cellular viability and their expression may be coordinately regulated.
  • RPGs ribosomal protein genes
  • the transcription of RPGs may be tightly and coordinately regulated, for example, such that there is close to a 1 :1 ratio in expression level between each RPG transcript.
  • RPGs may be utilized as predetermined markers in the methods described herein.
  • RPG transcript expression level may be used as a metric for the quality of one or more libraries (e.g., single cell libraries).
  • detection of the presence of transcripts corresponding to at least a threshold number of RPGs may be indicative of a library meeting a minimum quality threshold.
  • detection of the presence of transcripts corresponding to at least a threshold percentage of a set of RPGs e.g., at least about 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 60%, or 70% RPGs
  • RPGs that may be used as predetermined markers in the methods described herein include, without limitation, RPL22, RPL1 1 , RPS8, RPL5, RPS27, RPS24, RPLP2, RPL27A, RPS13, RPS3, RPS25, RPS26, RPL6, RPLP0, RPL21 , RPS29, RPL36AL, RPL4, RPLP1 , RPS17, RPS2, RPS15A, RPL13, RPL26, RPL23A, RPL23, RPL1 9, RPL27, RPL38, RPL17, RPS15, RPL36, RPS28, RPL18A, RPS16, RPS19, RPL18, RPL13A, RPS1 1 , RPS9, RPL28, RPS5, RPS7, RPS27A, RPL31 , RPL37A, RPS21 , RPL3, RPL32, RPL15, RPSA, RPL14, RPL29, RPL24, RPL35A, R
  • the methods described herein may be used for quality control of gene expression data (e.g., single-cell expression data) from one or multiple samples (e.g., single cells).
  • Expression of markers in a sample can be analyzed by a number of methodologies, many of which are known in the art and understood by the skilled artisan, including but not limited to, nucleic acid sequencing, microarray analysis, proteomics, in-situ hybridization (e.g., fluorescence in-situ hybridization), amplification-based assays, in situ hybridization, fluorescence activated cell sorting (FACS), northern analysis and/or PCR analysis of mRNAs.
  • FACS fluorescence activated cell sorting
  • Nucleic acid-based datasets suitable for analysis according to the methods of the invention include gene expression profiles. Such profiles may include whole transcriptome sequencing data (e.g., RNA-Seq data), panels of mRNAs, noncoding RNAs, or any other nucleic acid sequence that may be expressed from genomic DNA. Other nucleic acid datasets suitable for use in the methods of the invention may include expression data collected by imaging-based techniques (e.g., Northern blotting or Southern blotting). Northern blot analysis is a conventional technique well known in the art and is described, for example, in Molecular Cloning, a Laboratory Manual, second edition, 1989, Sambrook, Fritch, Maniatis, Cold Spring Harbor Press, 10 Skyline Drive, Plainview, NY 1 1803-2500.
  • imaging-based techniques e.g., Northern blotting or Southern blotting
  • Gene expression profiles to be analyzed according to the methods described herein may alternatively include, for example, microarray data or nucleic acid sequencing data produced by any sequencing method known in the art (e.g., Sanger sequencing and next-generation sequencing methods, also known as high-throughput sequencing or deep sequencing).
  • Exemplary next generation sequencing technologies include, without limitation, lllumina sequencing, Ion Torrent sequencing, 454 sequencing, SOLiD sequencing, and nanopore sequencing platforms. Additional methods of sequencing known in the art can also be used.
  • mRNA expression levels may be determined using RNA-Seq (e.g., as described in Mortazavi et al., Nat. Methods 5:621 -628, 2008, hereby incorporated by reference).
  • RNA-Seq is a robust technology for monitoring expression by direct sequencing the RNA molecules in a sample.
  • this methodology may involve fragmentation of RNA to an average length of 200 nucleotides, conversion to cDNA by random priming, and synthesis of double-stranded cDNA (e.g., using the Just cDNA DoubleStranded cDNA Synthesis Kit from Agilent Technology). Then, the cDNA is converted into a molecular library for sequencing by addition of sequence adapters for each library (e.g., from lllumina®/Solexa), and the resulting 50-1 00 nucleotide reads are mapped onto the genome.
  • sequence adapters for each library e.g., from lllumina®/Solexa
  • Expression levels may also be determined using microarray-based platforms (e.g., single- nucleotide polymorphism (SNP) arrays), as microarray technology offers high resolution. Details of various microarray methods can be found in the literature. See, for example, U.S. Pat. No. 6,232,068 and Pollack et al., Nat. Genet. 23:41 -46, 1999.
  • SNP single- nucleotide polymorphism
  • Hybridization of a labeled probe with a particular array member indicates that the sample from which the probe was derived expresses that gene. Expression level may be quantified according to the amount of signal detected from hybridized probe-sample complexes.
  • a typical microarray experiment involves the following steps: 1 ) preparation of fluorescently labeled target from RNA isolated from the sample, 2) hybridization of the labeled target to the microarray, 3) washing, staining, and scanning of the array, 4) analysis of the scanned image and 5) generation of gene expression profiles.
  • a microarray processor is the Affymetrix GENECHIP® system, which is commercially available and comprises arrays fabricated by direct synthesis of oligonucleotides on a glass surface. Other systems may be used as known to one skilled in the art.
  • Amplification-based assays also can be used to measure the expression level of one or more markers (e.g., genes).
  • the nucleic acid sequences of the gene act as a template in an amplification reaction (for example, a polymerase chain reaction (PCR) or quantitative PCR).
  • PCR polymerase chain reaction
  • the amount of amplification product will be proportional to the amount of template in the original sample.
  • Comparison to appropriate controls provides a measure of the expression level of the gene, corresponding to the specific probe used, according to the principles discussed above.
  • Methods of real-time quantitative PCR using TaqMan probes are well known in the art. Detailed protocols for real-time quantitative PCR are provided, for example, in Gibson et al., Genome Res.
  • Probes used for PCR may be labeled with a detectable marker, such as, for example, a radioisotope, fluorescent compound, bioluminescent compound, a chemiluminescent compound, metal chelator, or enzyme. Proteins
  • protein expression data can be assessed according to the methods described herein.
  • protein expression analysis that generate data suitable for use in the methods described herein include, without limitation, proteomics approaches, immunohistochemical and/or western blot analysis, immunoprecipitation, molecular binding assays, ELISA, ELIFA, mass spectrometry, mass spectrometric immunoassay, and biochemical enzymatic activity assays.
  • proteomics methods can be used to generate large-scale protein expression datasets in multiplex.
  • Proteomics methods may utilize mass spectrometry to detect and quantify polypeptides (e.g., proteins) and/or peptide microarrays utilizing capture reagents (e.g., antibodies) specific to a panel of target proteins to identify and measure expression levels of proteins expressed in a sample (e.g., a single cell sample).
  • capture reagents e.g., antibodies
  • the resultant datasets can be assessed according to the methods of the invention, for example, to identify samples meeting a quality threshold or to determine whether the dataset as a whole is of sufficient quality.
  • Exemplary peptide microarrays have a substrate-bound plurality of polypeptides, the binding of a oligonucleotide, a peptide, or a protein to each of the plurality of bound polypeptides being separately detectable.
  • the peptide microarray may include a plurality of binders, including but not limited to monoclonal antibodies, polyclonal antibodies, phage display binders, yeast two-hybrid binders, aptamers, which can specifically detect the binding of specific oligonucleotides, peptides, or proteins. Examples of peptide arrays may be found in U.S. Patent Nos. 6,268,210, 5,766,960, and 5,143,854, the disclosures of which are incorporated herein by reference in their entireties.
  • Mass spectrometry may be used in methods described herein to identify and characterize the protein composition of complex samples, e.g., a library. Any method of MS known in the art may be used to determine, detect, and/or measure a peptide or peptides of interest, e.g., LC-MS, ESI-MS, ESI- MS/MS, MALDI-TOF-MS, MALDI-TOF/TOF-MS, tandem MS, and the like.
  • Mass spectrometers generally consist of an ion source and optics, mass analyzer, and data processing electronics.
  • Mass analyzers include scanning and ion-beam mass spectrometers, such as time-of-flight (TOF) and quadruple (Q), and trapping mass spectrometers, such as ion trap (IT), Orbitrap, and Fourier transform ion cyclotron resonance (FT-ICR), may be used in the methods described herein. Details of various MS methods can be found in the literature. See, for example, Yates et al., Annu. Rev. Biomed. Eng. 1 1 :49-79, 2009.
  • proteins in a sample Prior to MS analysis, proteins in a sample are first digested into smaller peptides by chemical or enzymatic (e.g., trypsin) digestion. Complex peptide samples also benefit from the use of front-end separation techniques, e.g., 2D-PAGE, HPLC, RPLC, and affinity chromatography. The digested, and optionally separated, sample is then ionized using an ion source to create charged molecules for further analysis.
  • chemical or enzymatic e.g., trypsin
  • Ionization of the sample may be performed, e.g., by electrospray ionization (ESI), atmospheric pressure chemical ionization (APCI), photoionization, electron ionization, fast atom bombardment (FAB)/liquid secondary ionization (LSIMS), matrix assisted laser desorption/ionization (MALDI), field ionization, field desorption, thermospray/plasmaspray ionization, and particle beam ionization. Additional information relating to the choice of ionization method is known to those of skill in the art.
  • Tandem MS also known as MS/MS
  • Tandem MS may be particularly useful for methods described herein allowing for ionization followed by fragmentation a complex peptide sample, such as a library described herein.
  • Tandem MS involves multiple steps of MS selection, with some form of ion fragmentation occurring in between the stages, which may be accomplished with individual mass spectrometer elements separated in space or using a single mass spectrometer with the MS steps separated in time.
  • spatially separated tandem MS the elements are physically separated and distinct, with a physical connection between the elements to maintain high vacuum.
  • temporally separated tandem MS separation is accomplished with ions trapped in the same place, with multiple separation steps taking place over time.
  • Signature MS/MS spectra may then be compared against a peptide sequence database (e.g., SEQUEST).
  • Post-translational modifications to peptides may also be determined, for example, by searching spectra against a database while allowing for specific peptide modifications.
  • the present invention features methods for assessing the quality of gene expression libraries (e.g., nucleic acid expression libraries and/or polypeptide expression libraries) based on the expression of a set of predetermined markers.
  • a library is considered to be of sufficient quality (i.e., meet a quality minimum or quality threshold) if a threshold number of the predetermined markers is found to be expressed by the library.
  • a library may meet a quality minimum if at least 5 (e.g., at least about 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90 or 100) of the predetermined markers is present.
  • a library is considered to meet a quality minimum if a threshold percentage of the predetermined markers is found to be expressed by the library.
  • a library may meet a quality minimum if at least 50% (e.g., at least about 50%, 60%, 70%, 80%, 90%, 95%, 97%, 99% or 100%) of the predetermined markers is present.
  • a threshold number or percentage may be selected according to gold-standard data in the literature or may be determined experimentally, e.g., according to methods well understood in the art.
  • a library may meet a quality minimum if it aligns substantially with a preselected distribution.
  • a sufficient quantity of predetermined markers e.g., ribosomal protein genes
  • gene expression libraries may be
  • a population of gene expression libraries may be ranked, for example, by the number of predetermined markers detected in each or all of the libraries and/or by the quantity of each of and/or all of the predetermined markers in each or all of the libraries.
  • a quality score may be calculated for each library according to statistical methods known in the art, e.g., to determine the rankings of the libraries.
  • Gene expression libraries having a quality score greater than or equal to a preselected minimum quality score may be considered to meet a quality threshold.
  • the subset of the libraries showing the greatest quality score, the greatest number of predetermined markers detected, and/or the greatest quantity of each of and/or all of the predetermined markers may be selected and, optionally, separated from the remainder of the libraries. Distributions
  • the distribution of the number of predetermined markers expressed by each library and/or the distribution of the quantity of each predetermined marker expressed may be indicative of the quality of the multiple gene expression libraries.
  • a distribution that aligns substantially with a predetermined distribution e.g., a Gaussian distribution
  • Ribosomal protein gene expression can be used, for example, as an effective way to measure the quality of genomic data from single cells (e.g., gene expression profiling by, for example, RNA-Seq). This approach may be more effective than using the global number of genes detected per cell.
  • Figs. 2A and 2B illustrate a non-linear and imperfect relationship between the number of genes detected in a given cell, and the number of RPG's detected in the same cell. Many libraries that would be excluded by a conventional threshold using the number of genes detected (e.g. 1 000 genes detected) nonetheless capture as large a percentage of the RPG's as libraries over the threshold. These are likely to be high- quality libraries that are mistakenly removed from further analysis. Similarly, many cells over an arbitrary threshold for the number of genes detected, show poor representation of the RPG's, indicating that the library does not accurately represent the underlying transcriptional phenotype of the cell, but these cells would nonetheless be inappropriately included for further analysis.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Organic Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Microbiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hematology (AREA)
  • Biotechnology (AREA)
  • Biomedical Technology (AREA)
  • Urology & Nephrology (AREA)
  • Biophysics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Cell Biology (AREA)
  • Medicinal Chemistry (AREA)
  • Food Science & Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne des procédés pour évaluer la qualité de bibliothèques d'expression génique et des procédés pour identifier des bibliothèques d'expression génique satisfaisant à un seuil de qualité. Les procédés consistent généralement à étudier un ensemble de données d'expression génique (par exemple des données d'expression génique correspondant à une ou plusieurs bibliothèques d'expression génique) pour déterminer la présence et/ou le niveau d'expression d'une pluralité de marqueurs prédéterminés. Dans certains cas, les marqueurs prédéterminés correspondent à un ensemble de gènes régulés de manière coordonnée, tels que des gènes de protéines ribosomiques.
PCT/US2016/058165 2015-10-21 2016-10-21 Procédés pour évaluer la qualité de bibliothèques d'expression génique WO2017070498A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201562244614P 2015-10-21 2015-10-21
US62/244,614 2015-10-21

Publications (1)

Publication Number Publication Date
WO2017070498A1 true WO2017070498A1 (fr) 2017-04-27

Family

ID=58557890

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2016/058165 WO2017070498A1 (fr) 2015-10-21 2016-10-21 Procédés pour évaluer la qualité de bibliothèques d'expression génique

Country Status (1)

Country Link
WO (1) WO2017070498A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6303297B1 (en) * 1992-07-17 2001-10-16 Incyte Pharmaceuticals, Inc. Database for storage and analysis of full-length sequences
US20080318801A1 (en) * 2005-10-28 2008-12-25 Leung Conrad L Method and kit for evaluating rna quality
US20120310537A1 (en) * 2011-06-03 2012-12-06 Paul Kenneth Wolber Identification of aberrant microarray features

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6303297B1 (en) * 1992-07-17 2001-10-16 Incyte Pharmaceuticals, Inc. Database for storage and analysis of full-length sequences
US20080318801A1 (en) * 2005-10-28 2008-12-25 Leung Conrad L Method and kit for evaluating rna quality
US20120310537A1 (en) * 2011-06-03 2012-12-06 Paul Kenneth Wolber Identification of aberrant microarray features

Similar Documents

Publication Publication Date Title
CN110475864B (zh) 用于识别或量化在生物样品中的靶标的方法和组合物
Burke et al. Spliceosome profiling visualizes operations of a dynamic RNP at nucleotide resolution
JP2008521383A (ja) p53の状態と遺伝子発現プロファイルとの関連性に基づき、癌を分類し、予後を予測し、そして診断する方法、システム、およびアレイ
EP1885890A2 (fr) Quantification d'acides nucléiques et de protéines au moyen d'étiquettes de masse d'oligonucléotides
Jakob et al. Interrelationships between yeast ribosomal protein assembly events and transient ribosome biogenesis factors interactions in early pre-ribosomes
US20180057859A1 (en) Method for identifying rare cell types by single cell assisted deconvolution of population gene expression data
Yap et al. Molecular diagnostics in oral cancer and oral potentially malignant disorders—A clinician’s guide
US20150184223A1 (en) Method for improved quantification of mirnas
US20170362641A1 (en) Dual polarity analysis of nucleic acids
David et al. Functional Genomics meets neurodegenerative disorders: part I: transcriptomic and proteomic technology
US20100035265A1 (en) Biomarkers for Drug-Induced Liver Injury
Khanna et al. A systematic characterization of Cwc21, the yeast ortholog of the human spliceosomal protein SRm300
WO2016178236A1 (fr) Méthodes et nécessaires pour le pronostic du cancer du sain
US20230236189A1 (en) Microbial Identification and Quantitation Using MS Cleavable Tags
WO2017070498A1 (fr) Procédés pour évaluer la qualité de bibliothèques d'expression génique
Anjum et al. Understanding Stress‐Responsive Mechanisms in Plants: An Overview of Transcriptomics and Proteomics Approaches
WO2006119996A1 (fr) Procede de normalisation de donnees d'expressions geniques
Widłak High-throughput technologies in molecular biology
US20210164020A1 (en) Degradable carrier nucleic acid for use in the extraction, precipitation and/or purification of nucleic acids
Dodel et al. TREX reveals proteins that bind to specific RNA regions in living cells
Thibivilliers et al. Plant Single-Cell/Nucleus RNA-seq Workflow
US20130023429A1 (en) Transcription chip
US20230138328A1 (en) Methods and use of chimeric proteins
Kowalski et al. Accelerating discoveries in the proteome and genome with MALDI TOF MS
US20150105270A1 (en) Biomarkers for increased risk of drug-induced liver injury from exome sequencing studies

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16858312

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16858312

Country of ref document: EP

Kind code of ref document: A1