WO2020068731A1 - Procédés d'intégration de données d'expression de gènes cellulaires à partir de multiples ensembles de données de cellules individuelles et utilisations associées - Google Patents
Procédés d'intégration de données d'expression de gènes cellulaires à partir de multiples ensembles de données de cellules individuelles et utilisations associées Download PDFInfo
- Publication number
- WO2020068731A1 WO2020068731A1 PCT/US2019/052570 US2019052570W WO2020068731A1 WO 2020068731 A1 WO2020068731 A1 WO 2020068731A1 US 2019052570 W US2019052570 W US 2019052570W WO 2020068731 A1 WO2020068731 A1 WO 2020068731A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cell
- data
- cells
- data sets
- populations
- Prior art date
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6881—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for tissue or cell typing, e.g. human leukocyte antigen [HLA] probes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/158—Expression markers
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N2800/00—Detection or diagnosis of diseases
- G01N2800/52—Predicting or monitoring the response to treatment, e.g. for selection of therapy based on assay results in personalised medicine; Prognosis
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/68—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
- G01N33/6893—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids related to diseases not provided for elsewhere
- G01N33/6896—Neurological disorders, e.g. Alzheimer's disease
Definitions
- the disclosure relates to methods for integrating cell gene expression data from multiple single-cell data sets. More particularly, the disclosure relates to methods of using a dimensionality reduction approach to obtain a lower rank representation of cell gene expression in two or more populations of cells to identify shared maximum factor values that may be used to cluster cells in the two or more populations into one or more subpopulations.
- the present disclosure provides a method of identifying subpopulations within groups of populations.
- the techniques herein allow one or more subpopulations of cells within two or more populations of cells to be identified by obtaining, via a dimensionality reduction approach, a lower rank representation of cell gene expression in the two or more populations of cells and then representing each cell within the two or more populations of cells as a vector of maximum factor values of each cell’s k nearest neighbors, where k ranges from about 1 to about 100 so that a number of shared maximum factor values of k nearest neighbors for each cell within the two or more populations of cells may be identified and compared, thereby making it possible to cluster cells into the one or more subpopulations based on the number of shared maximum factor values.
- the disclosure provides a method for identifying one or more subpopulations of cells within two or more populations of cells, that includes the steps of: obtaining, by a dimensionality reduction approach, a lower rank representation of cell gene expression in the two or more populations of cells; representing each cell within the two or more populations of cells as a vector of maximum factor values of each cell’s k nearest neighbors, where k ranges from about 1 to about 100; identifying and comparing a number of shared maximum factor values of k nearest neighbors for each cell within the two or more populations of cells; and clustering cells into the one or more subpopulations based on the number of shared maximum factor values.
- the cell gene expression is selected from the group consisting of single- cell RNA expression data, single-cell methylation data (optionally methylomics data), ATAC-seq data, CITE-seq data, CYTOF data, CODEX data, MERFISH data, SeqFISH data, osmFISH data, STARmap data and spatial transcriptomics data, optionally wherein the single-cell RNA expression data are selected from the group consisting of Drop-seq data, inDrop data, Chromium (lOx) data, Split-seq data, MARS-seq data and sci-Rna-seq data, optionally wherein the Drop-seq data are Slide-seq data.
- the two or more populations of cells are from distinct species.
- the dimensionality reduction approach is selected from the group consisting of principal component analysis, independent component analysis, and non-negative matrix factorization (NMF).
- NMF non-negative matrix factorization
- the NMF is integrated NMF (INMF).
- k ranges from about 5 to about 50.
- data for the individual cells comprises a positional/location value that is employed in performing the comparing step that identifies the one or more subpopulations of cells within the two or more populations of cells, optionally wherein the positional/location value allows for inclusion of “miscalled” individual cells within an otherwise homogenous subpopulation of cells.
- the one or more populations of cells are glial cells, optionally wherein a first subpopulation of cells comprises substantia nigra cells.
- the disclosure provides a method for treating and/or modulating treatment of a subject having or at risk of developing a neurodegenerative disease or disorder, that includes the steps of: obtaining a test tissue sample from the subject having or at risk of developing a neurodegenerative disease or disorder; obtaining, by a dimensionality reduction approach, a lower rank representation of cell gene expression in two or more populations of cells within the test tissue sample; representing each cell within the two or more populations of cells as a vector of maximum factor values of each cell’s k nearest neighbors, where k can range from about 1 to about 100; identifying and comparing a number of shared maximum factor values of k nearest neighbors for each cell within the two or more populations of cells; and clustering cells into the one or more subpopulations based on the number of shared maximum factor values; identifying a first subpopulation of cells within the test sample as corresponding to a first subpopulation of cells within a control sample, wherein identifying the first subpopulation of cells within the test sample detects a neurodegenerative disease or disorder and
- the therapeutic agent is selected from the group consisting of donepezil, galantamine, rivastigmine, memantine, and combinations thereof.
- the disclosure provides a method of integrating cell gene expression data from multiple single-cell data sets, that includes the steps of: hosting a dimensionality reduction element (DRE) at a capable node in a computer network; receiving, at the capable node, two or more single- cell data sets; processing, at the capable node with the DRE, the two or more single-cell data sets to obtain a lower rank representation of cell gene expression in the two or more single-cell data sets; representing each cell within the two or more single-cell data sets as a vector of maximum factor values of each cell’s k nearest neighbors, wherein k ranges from about 1 to about 100; computing, at the capable node, a number of shared maximum factor values of k nearest neighbors for each cell within the two or more single-cell data sets; and clustering cells into one or more subpopulation data sets based on the number of shared maximum factor values.
- DRE dimensionality reduction element
- the one or more single-cell data sets are selected from the group consisting of single-cell RNA expression data, single-cell methylation data, and single-cell chromatin accessibility data.
- one or more single-cell data sets are from distinct species.
- the DRE is selected from the group consisting of principal component analysis, independent component analysis, and non-negative matrix factorization (NMF).
- NMF non-negative matrix factorization
- the NMF is integrated NMF (INMF).
- k ranges from about 5 to about 50.
- the disclosure provides an apparatus, that includes: one or more network interfaces to communicate with a computer network; a processor coupled to the network interfaces and adapted to execute one or more processes; and a memory configured to store a process executable by the processor, the process when executed operable to: host a dimensionality reduction element (DRE) at a capable node in a computer network; receive, at the capable node, two or more single-cell data sets; process, at the capable node with the DRE, the two or more single-cell data sets to obtain a lower rank representation of cell gene expression in the two or more single-cell data sets; represent each cell within the two or more single-cell data sets as a vector of maximum factor values of each cell’s k nearest neighbors, where k ranges from about 1 to about 100; compute, at the capable node, a number of shared maximum factor values of k nearest neighbors for each cell within the two or more single-cell data sets; and cluster cells into one or more subpopulation data sets based on the number of shared maximum factor values.
- the DRE is selected from the group consisting of principal component analysis, independent component analysis, and non-negative matrix factorization (NMF).
- NMF non-negative matrix factorization
- the NMF is integrated NMF (INMF).
- the term“about” is understood as within a range of normal tolerance in the art, for example within 2 standard deviations of the mean. About can be understood as within 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, 1%, 0.5%, 0.1%, 0.05%, or 0.01% of the stated value. Unless otherwise clear from the context, all numerical values provided herein are modified by the term about.
- biological sample is meant any tissue, cell, fluid, or other material derived from an organism or collected from the environment.
- an effective amount is meant the amount of an agent required to ameliorate the symptoms of a disease relative to an untreated patient.
- the effective amount of active agent(s) used to practice the present disclosure for therapeutic treatment of a disease varies depending upon the manner of administration, the age, body weight, and general health of the subject. Ultimately, the attending physician or veterinarian will decide the appropriate amount and dosage regimen. Such amount is referred to as an“effective” amount.
- Ranges provided herein are understood to be shorthand for all of the values within the range.
- a range of 1 to 50 is understood to include any number, combination of numbers, or sub-range from the group consisting 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
- a nested sub-range of an exemplary range of 1 to 50 may comprise 1 to 10, 1 to 20, 1 to 30, and 1 to 40 in one direction, or 50 to 40, 50 to 30, 50 to 20, and 50 to 10 in the other direction.
- subject is meant a mammal, including, but not limited to, a human or non-human mammal, such as a bovine, canine equine, feline, ovine, or primate.
- A“therapeutically effective amount” is an amount sufficient to effect beneficial or desired results, including clinical results.
- An effective amount can be administered in one or more administrations.
- the terms“treat,” treating,”“treatment,” and the like refer to reducing or ameliorating a disorder and/or symptoms (e.g., PPARG activated cancer, bladder cancer, and the like) associated therewith. It will be appreciated that, although not precluded, treating a disorder or condition does not require that the disorder, condition or symptoms associated therewith be completely eliminated. Where applicable or not specifically disclaimed, any one of the embodiments described herein are contemplated to be able to combine with any other one or more embodiments, even though the embodiments are described under different aspects of the disclosure. These and other embodiments are disclosed and/or encompassed by, the following Detailed Description.
- FIGS. 1A to 1C demonstrate that data integration requires reconciling remarkable variation in dataset scale and distribution.
- FIG. 1 A shows a plot that identifies the heterogeneity of data distributions confronted in attempting to compare between frontal cortex single-cell methylation (scMe) data and single-cell RNAseq (transcriptome) data sets, with the distributions of average values (left hand pair of distributions) and variance of such values (right hand pair of distributions) observed across 21,490 assessed genes shown to differ widely between scMe (left- hand, more bunched distributions) and RNAseq (right-hand, more spread distributions) data sets.
- scMe frontal cortex single-cell methylation
- RNAseq transcription factorome
- FIG. 1B shows that data obtained from frontal cortex using StarMap also differs widely in its distribution from that obtained from frontal cortex using scRNAseq.
- the scRNAseq approach Drop-seq provided high density of coverage for a more limited set of total transcripts
- StarMap provided less density of coverage but a rate level of gene detection, on average.
- FIG. 1C shows a scatter plot of subtantia nigra mouse versus human data, which exhibited a rho value of 0.59.
- FIGS. 2A to 2C provide further demonstration that data integration requires reconciling remarkable variation in dataset scale and distribution.
- FIG. 2A shows a comparison of frontal cortex scMe versus scRNAseq data sets.
- FIG. 2B shows StarMap versus scRNAseq data sets.
- FIG. 2C shows substantia nigra mouse versus human data set comparisons.
- FIGS. 3A and 3B demonstrate that the LIGER (Linked Inference of Genomic Experimental Relationships) methods of the instant disclosure provides a new approach for finding shared and unique patterns across datasets.
- FIG. 3A shows that LIGER initially employed a dimensionality reduction approach, such as the integrated non-negative matrix factorization (iNMF) approach of Yang et al. Bioinformatics 32: 1-8, to obtain a lower rank representation of cell gene expression in the two or more populations of cells, which provided shared and dataset- specific metagenes.
- FIG. 3B shows that such single-cell dataset-specific metagenes were then used to define clusters of similar/related cells by“factor neighborhoods.”
- FIGS. 4A and 4B demonstrate integrated scRNAseq datasets while also preserving cell diversity.
- FIG. 4A shows comparisons between LIGER and Seurat, which demonstrated increased levels of alignment and agreement by LIGER when performed using PBMC datasets that comprised seqwell and tenx data.
- FIG. 4B shows comparisons between LIGER and Seurat, which demonstrated decreased levels of alignment but increased levels of agreement by LIGER when performed using respective oligodendrocytes and interneurons data.
- FIGS. 5A and 5B demonstrate LIGER-mediated integration of data across modalities - here, methylation + RNA.
- FIG. 5A shows mouse frontal cortex neurons, where a data set of 50,000 was obtained by Drop-seq, while methylation results for 3000 single cells were obtained from Luo et al. (, Science 357: 600-604).
- FIG. 5B shows the connections of alterations in DNA methylation at several loci— including genes also implicated by GWAS— with the earliest onset of AD.
- FIGS. 6A to 6E demonstrate further LIGER-mediated integration of data across modalities - here, methylation + RNA.
- FIG. 6A shows a representation of clusters.
- FIG. 6B shows a representation of transcriptome data.
- FIG. 6C shows a plot of excitatory and inhibitory classes, plotted for observed Mecp2 expression levels (y-axis) and observed global methylation levels (x- axis) - Mecp2 levels were observed to be generally higher in inhibitory neurons than in excitatory neurons.
- FIG. 6A shows a representation of clusters.
- FIG. 6B shows a representation of transcriptome data.
- FIG. 6C shows a plot of excitatory and inhibitory classes, plotted for observed Mecp2 expression levels (y-axis) and observed global methylation levels (x- axis) - Mecp2 levels were observed to be generally higher in inhibitory neurons than in excitatory neurons.
- FIG. 6D shows a plot of excitatory and inhibitory classes, plotted for observed Tet3 expression levels (y-axis) and observed global methylation levels (x-axis) - Tet3 levels were observed to be generally higher in excitatory neurons than in inhibitory neurons.
- FIG. 6E shows a plot of excitatory and inhibitory classes, plotted for observed Dnmt3a expression levels (y-axis) and observed global methylation levels (x-axis) - Dnmt3a levels were observed to be roughly equal in both classes.
- FIGS. 7A to 7C demonstrate LIGER-mediated integration of multiplexed in situ data (STARmap) with single-cell analysis. Initially, approximately 1,000 genes were measured using a targeted in situ RNA sequencing approach, using SNAIL probes, hydrogel-tissue chemistry, SEDAL sequencing with error-reduction and mapping over six cycles, as reported in Wang et al. ( Science 361 : 2018).
- FIG. 7A shows Drop-seq (light/pink) and StarMap (dark/purple) results at left, while at right, clusters are shown.
- FIG. 7B shows clustering of glial types.
- FIG. 7C shows a further representation of clustering.
- FIGS. 8A and 8B demonstrate LIGER-mediated cross-species and cross-individual analysis of substantia nigra.
- FIG. 8A shows human-mouse comparisons and cross-individual comparisons.
- FIG. 8B shows a chart of clusters and their relative prevalence in the analysis.
- FIGS. 9A and 9B demonstrate that“Eccentric” SPNs were spread evenly across the striatum.
- FIG. 9A shows localization and in situ hybridization results. Multiple multiplex in situ hybridization experiments were performed, which confirmed their presence in the striatum, and characterized their spatial distribution. Two genes— Otof and Cacng5— were widely expressed in the eccentric SPN population, but were never co-expressed elsewhere.
- FIG. 9B shows quantification of cells that were triply positive for these markers and for the pan-SPN marker Ppplrlb. As shown, relatively even distribution was observed across the dorsal striatum in slide scans that were analyzed.
- FIG. 10 demonstrates that the eccentric identity may be a state change.
- these SPNs could also be divided further by traditional markers of direct and indirect pathway, shown here with Drdl and Adora2a expression.
- Drdl and Adora2a expression A subset of the indirect-like eccentrics express the dopamine synthesis gene tyrosine hydroxylase.
- upregulation of Th expression has been previously observed in spiny projection neurons when the striatum was deprived of dopamine input (see Darmopil et ak, Eur J Neurosci., 27: 580-592). Accordingly, it is likely that this“eccentric” identity constituted a state response to changes in circuit dynamics.
- FIG. 11 illustrates an example simplified procedure for data set integration.
- the present disclosure is based, at least in part, on the discovery that one or more subpopulations of cells within two or more populations of cells may be identified by obtaining, via a dimensionality reduction approach, a lower rank representation of cell gene expression in the two or more populations of cells and then representing each cell within the two or more populations of cells as a vector of maximum factor values of each cell’s k nearest neighbors, where k ranges from about 1 to about 100 so that a number of shared maximum factor values of k nearest neighbors for each cell within the two or more populations of cells may be identified and compared, thereby making it possible to cluster cells into the one or more subpopulations based on the number of shared maximum factor values.
- the comparison methods of the instant disclosure in many aspects directed to comparison/analysis of deep sets of single-cell data (e.g., transcriptome data for each of a majority of cells within a population, optionally carrying localization data for each such cell) - are contemplated to help address: (1) how neural pathology of AD might be connected with glia-localized genetic results; (2) what the most veridical models of the AD neurodegenerative process actually are; and (3) which cell states are most relevant to AD progression.
- single-cell data e.g., transcriptome data for each of a majority of cells within a population, optionally carrying localization data for each such cell
- Certain aspects of the instant disclosure address a critical need in the field for a computational tool that allows for reliable data integration, using single-cell datasets, to examine, e.g., individual variation across such data sets, and cross-species variation across such data sets.
- Many data modalities are being obtained at the single-cell genomic level, including epigenetic modification (methylation, acylation, etc.), genetic/genomic variation, and RNA expression.
- the instant disclosure provides techniques that have initially allowed high throughput, droplet-based single-nucleus RNA-seq to be used to characterize specific genetic regions within cell populations and to understand variation across 20 (and optionally more) individual human donors, further relating these profiles to those generated from the cognate mouse regions, and then, for a select number of conserved types in these regions, to connect physiological data to the molecular data gathered using Patch-Seq.
- RNAseq RNA expression assessment
- Drop-seq see e.g., US Patent Publication No. 2018/0030515
- inDrop Karlin et al. (2015) Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.
- Cell 161 1187-1201
- Chromium see e.g., lOx Genomics
- Split-seq Rosenberg et al.
- RNA expression data In addition to single-cell RNA expression data other modalities, such as methylation, can also now be detected at the single-cell level with genomic coverage.
- Approaches for identifying other modalities at genomic scale include methylomics (Yu et al. 2017 Genome-wide, Single-Cell DNA Methylomics Reveals Increased Non-CpG Methylation during Human Oocyte Maturation. Stem Cell Reports 9(1)397-407), ATAC-seq (Buenrostro et al. 2015 ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Current Protocols Mol. Biol. Buenrostro J, Wu B, Chang H, Greenleaf W.
- ATAC-seq A Method for Assaying Chromatin Accessibility Genome- Wide. 2015;109:21.29.1-21.29.9), CITE-seq (Stoeckius et al. 2017 Simultaneous epitope and transcriptome measurement in single cells. Nature Methods 14:865-68) and CYTOF (Kay et al. 2017 Application of Mass Cytometry (CyTOF) for Functional and Phenotypic Analysis of Natural Killer Cells. Methods Mol. Biol. 1441 : 13-26).
- Multiplexed in situ data sets can also be examined using the methods of the instant disclosure, with such multiplexed in situ data sets including, for example, CODEX, MERFISH (Chen et al. 2015 Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348), SeqFISH (Shah et al. 2017 seqFISH Accurately Detects Transcripts in Single Cells and Reveals Robust Spatial Organization in the Hippocampus. Neuron 94(4):752-58), osmFISH, STARmap (Wang et al. 2018 Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361), and spatial transcriptomics.
- CODEX CODEX
- MERFISH Chen et al. 2015 Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348
- SeqFISH Shah et al. 2017 seqFISH Accurately Detects Transcripts in Single Cells and Rev
- FIGS. 1A-1C demonstrate that data integration requires reconciling remarkable variation in dataset scale and distribution.
- scMe frontal cortex single-cell methylation
- RNAseq transcription
- FIG. 1A shows the distributions of average values (left hand pair of distributions) and variance of such values (right hand pair of distributions) observed across 21,490 assessed genes shown to differ widely between scMe (left-hand, more bunched distributions) and RNAseq (right-hand, more spread distributions) data sets.
- FIG. 1B shows that data obtained from frontal cortex using StarMap also differs widely in its distribution from that obtained from frontal cortex using scRNAseq.
- the scRNAseq approach Drop-seq provided high density of coverage for a more limited set of total transcripts
- StarMap provided less density of coverage but a rate level of gene detection, on average. Similar heterogeneity issues are also encountered in cross species data sets.
- FIG. 1C shows a scatter plot of subtantia nigra mouse versus human data, which exhibited a rho value of 0.59.
- FIGS. 2A-2C provide further demonstration that data integration requires reconciling remarkable variation in dataset scale and distribution.
- FIG. 2A shows a comparison of frontal cortex scMe versus scRNAseq data sets
- FIG. 2B shows StarMap versus scRNAseq data sets
- FIG. 2C shows substantia nigra mouse versus human data set comparisons.
- the techniques herein provide methods of identifying subpopulations within groups of populations by integrating cell gene expression data (e.g., single-cell RNA expression data, single- cell methylation data, single-cell chromatin accessibility data, and the like) from multiple single- cell data sets.
- cell gene expression data e.g., single-cell RNA expression data, single- cell methylation data, single-cell chromatin accessibility data, and the like
- the techniques herein allow one or more subpopulations of cells within two or more populations of cells to be identified by obtaining, via a dimensionality reduction approach, a lower rank representation of cell gene expression in the two or more populations of cells, and then representing each cell within the two or more populations of cells as a vector of maximum factor values of each cell’s k nearest neighbors, wherein k ranges from about 1 to about 100 so that a number of shared maximum factor values of k nearest neighbors for each cell within the two or more populations of cells may be identified and compared, thereby making it possible to cluster cells into the one or more subpopulations based on the number of shared maximum factor values.
- LIGER Linked Inference of Genomic Experimental Relationships
- FIGS. 3A and 3B demonstrate that the LIGER data set integration methods of the instant disclosure provides a new approach for finding shared and unique patterns across heterogeneous and/or disparate datasets.
- FIG. 3A shows that LIGER initially employs a dimensionality reduction approach, such as the integrated non-negative matrix factorization (iNMF) approach of Yang et al. Bioinformatics 32: 1-8, to obtain a lower rank representation of cell gene expression in the two or more populations of cells, which provided shared and dataset- specific metagenes.
- FIG. 3B shows that such single-cell dataset-specific metagenes may then be used to define clusters of similar/related cells by“factor neighborhoods.”
- FIG. 4A shows comparisons between LIGER and Seurat, which demonstrate increased levels of alignment and agreement by LIGER when performed using PBMC datasets that comprised seqwell and tenx data.
- FIG. 4B shows comparisons between LIGER and Seurat that demonstrate decreased levels of alignment but increased levels of agreement by LIGER when performed using respective oligodendrocytes and interneurons data.
- FIGS. 5 A and 5B demonstrate LIGER-mediated integration of data across modalities such as, for example, methylation and RNA levels.
- FIG. 5A shows mouse frontal cortex neurons, where a data set of 50,000 cells was obtained by Drop-seq, while methylation results for 3,000 single cells were obtained from Luo et al. (Science 357: 600-604).
- FIG. 5B shows the connections of alterations in DNA methylation at several loci— including genes also implicated by GWAS— with the earliest onset of AD.
- FIGS. 6A to 6E demonstrate further LIGER-mediated integration of data across methylation and RNA modalities.
- FIG. 6A shows a representation of clusters.
- FIG. 6B shows a representation of transcriptome data in which the color coding correlates to the color coding shown in FIG. 6A.
- FIG. 6B illustrates how disparate data can be integrated together, such that a single cell clustering (FIG. 6A) corresponds to the appropriate spatial layout in the tissue (FIGS. 6B).
- FIG. 6C shows a plot of excitatory and inhibitory classes, plotted for observed Mecp2 expression levels (y-axis) and observed global methylation levels (x-axis) - Mecp2 levels were observed to be generally higher in inhibitory neurons than in excitatory neurons.
- FIG. 6A shows a representation of clusters.
- FIG. 6B shows a representation of transcriptome data in which the color coding correlates to the color coding shown in FIG. 6A.
- FIG. 6B illustrates how
- FIG. 6D shows a plot of excitatory and inhibitory classes, plotted for observed Tet3 expression levels (y-axis) and observed global methylation levels (x-axis) - Tet3 levels were observed to be generally higher in excitatory neurons than in inhibitory neurons.
- FIG. 6E shows a plot of excitatory and inhibitory classes, plotted for observed Dnmt3a expression levels (y-axis) and observed global methylation levels (x- axis) - Dnmt3a levels were observed to be roughly equal in both classes.
- FIGS. 7 A to 7C demonstrate LIGER-mediated integration of multiplexed in situ data (STARmap) with single-cell analysis. Initially, approximately 1,000 genes were measured using a targeted in situ RNA sequencing approach, using SNAIL probes, hydrogel-tissue chemistry, SEDAL sequencing with error-reduction and mapping over six cycles, as reported in Wang et al. (Science 361 : 2018).
- FIG. 7A shows Drop-seq (light/pink) and Starmap (dark/purple) results at left, while at right, clusters are shown.
- FIG. 7B shows clustering of glial types.
- FIG. 7C shows a further representation of clustering that is a subset of the clustering shown in FIGS. 7B.
- FIGS. 8A and 8B demonstrate LIGER-mediated cross-species and cross-individual analysis of substantia nigra.
- FIG. 8A shows human-mouse comparisons and cross-individual comparisons.
- FIG. 8B shows a chart of clusters and their relative prevalence in the analysis.
- FIG. 9A shows localization and in situ hybridization results. Multiple multiplex in situ hybridization experiments were performed, which confirmed the presence of SPNs in the striatum, and characterized their spatial distribution.
- FIG. 9B shows quantification of cells that were triply positive for these markers and for the pan-SPN marker Ppplrlb. As shown, relatively even distribution was observed across the dorsal striatum in slide scans that were analyzed.
- the techniques herein made it possible to show that the eccentric SPN identity may represent a state change.
- FIGS. 10 when the eccentric SPN cluster was examined for substructure, it was observed that these SPNs could also be divided further by traditional markers of direct and indirect pathway, shown here with Drdl and Adora2a expression.
- Drdl and Adora2a expression A subset of the indirect-like eccentrics express the dopamine synthesis gene tyrosine hydroxylase.
- upregulation of Th expression has been previously observed in spiny projection neurons when the striatum was deprived of dopamine input (see Darmopil et al., Eur J Neurosci., 27: 580-592). Accordingly, it is likely that this“eccentric” identity represents a state response to changes in circuit dynamics.
- FIG. 11 illustrates an example simplified procedure for integrating disparate data sets in a computer network in accordance with one or more embodiments described herein.
- the procedure 100 may start at step 110, and continues to step 120, where, as described in greater detail above, a capable node in a computer network may host a dimensionality reduction element (DRE).
- the procedure 100 may then continue to step 130, in which the capable node may receive two or more single cell data sets.
- the DRE may then process the two or more single-cell data sets to obtain a lower rank representation of cell gene expression in the two or more single- cell data sets.
- DRE dimensionality reduction element
- the procedure 700 may then proceed to step 150, in which the capable node may represent each cell within the two or more single-cell data sets as a vector of maximum factor values of each cell’s k nearest neighbors, wherein k ranges from about 1 to about 100.
- the procedure 700 may then proceed to step 160, in which the capable node may compute a number of shared maximum factor values of k nearest neighbors for each cell within the two or more single-cell data sets.
- the procedure 700 illustratively proceeds to the final step 170, in which the capable node clusters cells into one or more subpopulation data sets based on the number of shared maximum factor values.
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Medical Informatics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Public Health (AREA)
- General Engineering & Computer Science (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Organic Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Immunology (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Zoology (AREA)
- Databases & Information Systems (AREA)
- Wood Science & Technology (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Primary Health Care (AREA)
- Cell Biology (AREA)
- Microbiology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Pathology (AREA)
- Biochemistry (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
Abstract
L'invention concerne des procédés permettant d'intégrer des données d'expression de gènes cellulaires à partir de multiples ensembles de données de cellules individuelles. En particulier, l'invention concerne des procédés permettant d'utiliser une approche de réduction de dimensionnalité afin d'obtenir une représentation de rang inférieur d'une expression de gènes de cellules dans au moins deux populations de cellules pour identifier des valeurs de facteur maximales partagées qui peuvent être utilisées pour regrouper des cellules dans les au moins deux populations en une ou plusieurs sous-populations.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862736158P | 2018-09-25 | 2018-09-25 | |
US62/736,158 | 2018-09-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020068731A1 true WO2020068731A1 (fr) | 2020-04-02 |
Family
ID=69950838
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2019/052570 WO2020068731A1 (fr) | 2018-09-25 | 2019-09-24 | Procédés d'intégration de données d'expression de gènes cellulaires à partir de multiples ensembles de données de cellules individuelles et utilisations associées |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020068731A1 (fr) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113114697A (zh) * | 2021-04-21 | 2021-07-13 | 合肥工业大学 | 一种基于特征自降维标记的整车云测试数据在线封装方法 |
EP3901956A1 (fr) * | 2020-04-21 | 2021-10-27 | ETH Zürich | Procédés de détermination de correspondances entre des propriétés biologiques de cellules |
CN114944193A (zh) * | 2022-05-20 | 2022-08-26 | 南开大学 | 整合单细胞转录组与空间转录组数据的分析方法及系统 |
CN114958996A (zh) * | 2021-05-12 | 2022-08-30 | 浙江大学 | 一种超高通量单细胞测序试剂组合 |
WO2023142041A1 (fr) * | 2022-01-29 | 2023-08-03 | Cstone Pharmaceuticals, Vistra (Cayman) Limited | Procédés de traitement de données de séquençage et leurs utilisations |
CN116564418A (zh) * | 2023-04-20 | 2023-08-08 | 深圳湾实验室 | 细胞类群相关性网络构建方法和装置、设备及存储介质 |
WO2024138170A1 (fr) * | 2022-12-23 | 2024-06-27 | Illumina, Inc. | Co-dosages spatiaux d'arnm/protéine à l'aide d'aptamères |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8229876B2 (en) * | 2009-09-01 | 2012-07-24 | Oracle International Corporation | Expediting K-means cluster analysis data mining using subsample elimination preprocessing |
US20150213375A1 (en) * | 2014-01-24 | 2015-07-30 | Facebook, Inc. | Neighbor determination and estimation |
US20160041149A1 (en) * | 2013-03-15 | 2016-02-11 | Whitehead Institute For Biomedical Research | Cellular discovery platform for neurodegenerative diseases |
US20180169082A1 (en) * | 2014-02-04 | 2018-06-21 | Forest Laboratories Holdings Ltd. | Donepezil compositions and methods of treating alzheimers disease |
-
2019
- 2019-09-24 WO PCT/US2019/052570 patent/WO2020068731A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8229876B2 (en) * | 2009-09-01 | 2012-07-24 | Oracle International Corporation | Expediting K-means cluster analysis data mining using subsample elimination preprocessing |
US20160041149A1 (en) * | 2013-03-15 | 2016-02-11 | Whitehead Institute For Biomedical Research | Cellular discovery platform for neurodegenerative diseases |
US20150213375A1 (en) * | 2014-01-24 | 2015-07-30 | Facebook, Inc. | Neighbor determination and estimation |
US20180169082A1 (en) * | 2014-02-04 | 2018-06-21 | Forest Laboratories Holdings Ltd. | Donepezil compositions and methods of treating alzheimers disease |
Non-Patent Citations (2)
Title |
---|
BUTLER ET AL.: "Integrated analysis of single cell transcriptomic data across conditions, technologies, and species", NATURE BIOTECHNOLOGY, 2 April 2018 (2018-04-02), XP055619959, Retrieved from the Internet <URL:https://www.nature.com/articles/nbt.4096> [retrieved on 20191117] * |
HAGHVERDI ET AL.: "Batch effects in single- cell RNA sequencing data are corrected by matching mutual nearest neighbours", NAT BIOTECHNOL., vol. 36, no. 5, 2 April 2018 (2018-04-02), XP055593057, Retrieved from the Internet <URL:https://www.nature.com/articles/nbt.4091> [retrieved on 20191117] * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3901956A1 (fr) * | 2020-04-21 | 2021-10-27 | ETH Zürich | Procédés de détermination de correspondances entre des propriétés biologiques de cellules |
WO2021214102A1 (fr) * | 2020-04-21 | 2021-10-28 | ETH Zürich | Procédé consistant à déterminer une correspondance entre des propriétés biologiques de cellules |
CN113114697A (zh) * | 2021-04-21 | 2021-07-13 | 合肥工业大学 | 一种基于特征自降维标记的整车云测试数据在线封装方法 |
CN113114697B (zh) * | 2021-04-21 | 2022-03-11 | 合肥工业大学 | 一种基于特征自降维标记的整车云测试数据在线封装方法 |
CN114958996A (zh) * | 2021-05-12 | 2022-08-30 | 浙江大学 | 一种超高通量单细胞测序试剂组合 |
CN114958996B (zh) * | 2021-05-12 | 2022-12-20 | 浙江大学 | 一种超高通量单细胞测序试剂组合 |
WO2023142041A1 (fr) * | 2022-01-29 | 2023-08-03 | Cstone Pharmaceuticals, Vistra (Cayman) Limited | Procédés de traitement de données de séquençage et leurs utilisations |
CN114944193A (zh) * | 2022-05-20 | 2022-08-26 | 南开大学 | 整合单细胞转录组与空间转录组数据的分析方法及系统 |
WO2024138170A1 (fr) * | 2022-12-23 | 2024-06-27 | Illumina, Inc. | Co-dosages spatiaux d'arnm/protéine à l'aide d'aptamères |
CN116564418A (zh) * | 2023-04-20 | 2023-08-08 | 深圳湾实验室 | 细胞类群相关性网络构建方法和装置、设备及存储介质 |
CN116564418B (zh) * | 2023-04-20 | 2024-06-11 | 深圳湾实验室 | 细胞类群相关性网络构建方法和装置、设备及存储介质 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020068731A1 (fr) | Procédés d'intégration de données d'expression de gènes cellulaires à partir de multiples ensembles de données de cellules individuelles et utilisations associées | |
Corces et al. | Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer’s and Parkinson’s diseases | |
Arnatkeviciute et al. | Imaging transcriptomics of brain disorders | |
Bakken et al. | Single-cell and single-nucleus RNA-seq uncovers shared and distinct axes of variation in dorsal LGN neurons in mice, non-human primates, and humans | |
CN103797129B (zh) | 使用多态计数来解析基因组分数 | |
Chen et al. | Guided exploration of genomic risk for gray matter abnormalities in schizophrenia using parallel independent component analysis with reference | |
JP5060945B2 (ja) | 癌診断のためのオリゴヌクレオチド | |
AU2013211850A1 (en) | Methods for profiling and quantitating cell-free RNA | |
Dong et al. | Population-level variation in enhancer expression identifies disease mechanisms in the human brain | |
WO2012104764A2 (fr) | Procédé d'évaluation d'un flux d'informations dans des réseaux biologiques | |
Bild et al. | Application of a priori established gene sets to discover biologically important differential expression in microarray data | |
Clark et al. | Lymphocyte DNA methylation mediates genetic risk at shared immune-mediated disease loci | |
US20230348980A1 (en) | Systems and methods of detecting a risk of alzheimer's disease using a circulating-free mrna profiling assay | |
KR20130048217A (ko) | 적은 수의 전사체 측정치를 이용한 유전자 발현 프로파일링 | |
Theofilatos et al. | Discovery of stroke-related blood biomarkers from gene expression network models | |
Mahoney et al. | 2017 WONOEP appraisal: Studying epilepsy as a network disease using systems biology approaches | |
KR102137029B1 (ko) | 필터링된 데이터로 구성되는 게놈 모듈 네트워크에 기반한 샘플 데이터 분석 방법 | |
Aune et al. | Profiles of gene expression in human autoimmune disease | |
Altschuler et al. | Pathprinting: An integrative approach to understand the functional basis of disease | |
CN112877419A (zh) | 预测精神分裂症发生风险的dna甲基化标记物及筛选方法和应用 | |
Bakken et al. | Single-cell RNA-seq uncovers shared and distinct axes of variation in dorsal LGN neurons in mice, non-human primates and humans | |
Papiez et al. | Integrating Expression Data from Different Microarray Platforms in Search of Biomarkers of Radiosensitivity. | |
JP2002528095A (ja) | 同時調節された遺伝子セットを使用して遺伝子発現パターンの検出および分類を向上させる方法 | |
Polioudakis et al. | A single cell transcriptomic analysis of human neocortical development | |
WO2021095040A1 (fr) | Procédés et biomarqueurs pour le diagnostic, la surveillance d'une maladie, la découverte de médicament personnalisé et la thérapie ciblée de pathologies auto-immunes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19866020 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19866020 Country of ref document: EP Kind code of ref document: A1 |