WO2022109330A1 - Cellular clustering analysis in sequencing datasets - Google Patents

Cellular clustering analysis in sequencing datasets Download PDF

Info

Publication number
WO2022109330A1
WO2022109330A1 PCT/US2021/060186 US2021060186W WO2022109330A1 WO 2022109330 A1 WO2022109330 A1 WO 2022109330A1 US 2021060186 W US2021060186 W US 2021060186W WO 2022109330 A1 WO2022109330 A1 WO 2022109330A1
Authority
WO
WIPO (PCT)
Prior art keywords
cells
variants
cell
pseudo
subpopulations
Prior art date
Application number
PCT/US2021/060186
Other languages
French (fr)
Inventor
Saurabh GULATI
Shu Wang
Saurabh PARIKH
Manimozhi MANIVANNAN
Original Assignee
Mission Bio, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mission Bio, Inc. filed Critical Mission Bio, Inc.
Publication of WO2022109330A1 publication Critical patent/WO2022109330A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Definitions

  • SUMMARY 15 Disclosed herein, in various embodiments, are methods and computer readable media for enhanced cellular clustering in sequencing datasets by 1) using ground truth information (e.g., ground truth information from bulk sequencing) and 2) incorporating pseudo cells into the workflow for improved dimensionality reduction and clustering.
  • ground truth information e.g., ground truth information from bulk sequencing
  • pseudo cells into the workflow for improved dimensionality reduction and clustering.
  • the use of ground truth information and pseudo cells enables enhanced detection of minor cellular subpopulations 20 (e.g., to limit of detections of 1%, 0.1%, 0.01% or lower) within a larger cell population.
  • a method for identifying one or more subpopulations within in a cell population comprising: (a) obtaining a first dataset comprising cell sequencing data from single cells in a cell population; (b) filtering the cell sequencing data of the first dataset against at least a ground truth dataset derived from a 25 bulk sequencing analysis; (c) incorporating one or more pseudo cells comprising one or more known variants; (d) generating one or more clusters of cells comprising the pseudo cells; and (e) annotating the one or more clusters of cells based on one or more pseudo cells in the clusters.
  • filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises (i) identifying a first set of variants within the first 30 dataset, and a second set of variants within the ground truth dataset; and (ii) generating a filtered set of variants by removing a subset of variants within the first set, wherein the subset of variants does not appear in the second set of variants.
  • step (c) of the method occurs before step (a).
  • step (c) of the method occurs after step (b).
  • the method further comprises removing the one or more pseudo cells from the one or more clusters prior to subsequent analysis.
  • the one or more known variants of the pseudo cells are determined through a bulk sequencing analysis.
  • annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster comprising a pseudo cell as a cell population with one or more known variants of the pseudo cell.
  • 10 annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster that is devoid of a pseudo cell as a mixed cell population.
  • the method further comprises performing a dimensionality reduction analysis prior to generating clusters of cells.
  • the dimensionality reduction analysis is UMAP or PCA.
  • generating 15 clusters of cells comprises implementing a HDBSCAN method.
  • the variants are filtered based on thresholds of allele frequency, read depth, and/or genotyping quality. In some embodiments, the variants are filtered based on thresholds of allele frequency. In some embodiments, the variants are filtered based on thresholds of read depth. In some embodiments, the variants are filtered 20 based on thresholds of genotyping quality.
  • the one or more subpopulations are detected with a lower limit below a lower limit of detection where no pseudo cells are incorporated.
  • the one or more subpopulations are detected with a lower limit below 1%, below 0.5%, below 0.1%, or below 0.01% of the total population. In some embodiments, the 25 one or more subpopulations are detected with a lower limit below 1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.5% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.01% of the total population.
  • a method for identifying one or more subpopulations within a cell population comprising: (a) obtaining a first dataset comprising cell sequencing data from single cells in a cell population; (b) filtering the cell sequencing data of the first dataset against at least a ground truth dataset derived from a bulk sequencing analysis; and (c) identifying one or more subpopulations within the cell population through a combined unsupervised and supervised clustering of cells, wherein the supervised clustering involves one or more pseudo cells comprising one or more known variants.
  • filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises: (a) identifying a first set of variants within the first 5 dataset and a second set of variants within the ground truth dataset; and (b) generating a filtered set of variants by removing a subset of variants in the first set, wherein the subset of variants does not appear in the second set of variants.
  • the method further comprises removing the one or more pseudo cells from the one or more clusters prior to subsequent analysis.
  • the one or more known variants of the pseudo cells is determined through a bulk sequencing analysis.
  • the method further comprises performing a dimensionality reduction analysis prior to generating clusters of cells.
  • the dimensionality reduction analysis is UMAP or PCA.
  • the combined 15 unsupervised and supervised clustering of cells comprises implementing a HDBSCAN method.
  • the variants are filtered based on thresholds of allele frequency, read depth, and/or genotyping quality. In some embodiments, the variants are filtered based on thresholds of allele frequency. In some embodiments, the variants are 20 filtered based on thresholds of read depth. In some embodiments, the variants are filtered based on thresholds of genotyping quality.
  • the one or more subpopulations are detected with a lower limit below a lower limit of detection where no pseudo cells are incorporated. In some embodiments, the one or more subpopulations are detected with a lower limit below 1%, 25 below 0.5%, below 0.1%, or below 0.01% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.5% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.1% of the total population.
  • the one 30 or more subpopulations are detected with a lower limit below 0.01% of the total population.
  • the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: (a) filter the cell sequencing data of a first dataset against at least a ground truth dataset derived from a bulk sequencing analysis; (b) incorporate one or more pseudo cells comprising one or more known variants; (c) generate one or more clusters of cells comprising the pseudo cells according to at least the filtered set of variants; and (d) annotate the one or more clusters of cells based on one or more pseudo cells in the clusters, 5 wherein the first dataset comprises cell sequencing data from single cells in a cell population.
  • filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises: (i) identifying a first set of variants within the first dataset and a second set of variants within the ground truth dataset; and; (ii) generating a filtered set of variants by removing a subset of variants in the first set, wherein the subset of variants 10 does not appear in the second set of variants.
  • the instructions further cause the processor to remove the one or more pseudo cells from the one or more clusters prior to subsequent analysis.
  • the one or more known variants of the pseudo cells is determined through a bulk sequencing analysis.
  • annotating the 15 clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster comprising a pseudo cell as a cell population with one or more known variants of the pseudo cell. In some embodiments, annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster that is devoid of a pseudo cell as a mixed cell population. 20 [0018] In some embodiments, wherein the instructions further cause the processor to perform a dimensionality reduction analysis prior to generating clusters of cells. In some embodiments, wherein the dimensionality reduction analysis is UMAP or PCA. In some embodiments, generating clusters of cells comprises implementing a HDBSCAN method.
  • the variants are filtered based on thresholds of allele 25 frequency, read depth, and/or genotyping quality. In some embodiments, the variants are filtered based on thresholds of allele frequency. In some embodiments, the variants are filtered based on thresholds of read depth. In some embodiments, the variants are filtered based on thresholds of genotyping quality. [0020] In some embodiments, the one or more subpopulations are detected with a lower limit 30 below a lower limit of detection where no pseudo cells are incorporated. In some embodiments, the one or more subpopulations are detected with a lower limit below 1%, below 0.5%, below 0.1%, or below 0.01% of the total population.
  • the one or more subpopulations are detected with a lower limit below 1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.5% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.01% of the total population.
  • non-transitory computer readable 5 medium for identifying one or more subpopulations within in a cell population
  • the non- transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: filter cell sequencing data of a first dataset against at least a ground truth dataset derived from a bulk sequencing analysis; and identify one or more subpopulations within a cell population through a combined unsupervised and supervised 10 clustering of cells, wherein the supervised clustering involves one or more pseudo cells comprising one or more known variants, wherein the first dataset comprises cell sequencing data obtained from single cells in the cell population.
  • filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises: (i) identifying a first set of variants within the first dataset and a second set of variants within the 15 ground truth dataset; and (ii) generating a filtered set of variants by removing a subset of variants in the first set, wherein the subset of variants does not appear in the second set of variants.
  • the instructions further cause the processor to remove the one or more pseudo cells from the one or more clusters prior to subsequent analysis.
  • the one or more known variants of the pseudo cells is determined through a bulk sequencing analysis.
  • the instructions further cause the processor to perform a dimensionality reduction analysis prior to generating clusters of cells.
  • the dimensionality reduction analysis is UMAP or PCA.
  • the combined 25 unsupervised and supervised clustering of cells comprises implementing a HDBSCAN method.
  • the variants are filtered based on thresholds of allele frequency, read depth, and/or genotyping quality. In some embodiments, the variants are filtered based on thresholds of allele frequency. In some embodiments, the variants are 30 filtered based on thresholds of read depth. In some embodiments, the variants are filtered based on thresholds of genotyping quality.
  • the one or more subpopulations are detected with a lower limit below a lower limit of detection where no pseudo cells are incorporated. In some embodiments, the one or more subpopulations are detected with a lower limit below 1%, below 0.5%, below 0.1%, or below 0.01% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.5% of the total population. In some embodiments, the one or more subpopulations are 5 detected with a lower limit below 0.1% of the total population.
  • FIG. 1A depicts an overall system environment including a cell analysis workflow device and a computing device for identifying cellular subpopulations, in accordance with an embodiment.
  • FIG.1B depicts a block diagram of separate modules of a computing device, in accordance with an embodiment.
  • FIG.2 is a flow diagram of the clustering analysis, in accordance with an embodiment.
  • FIG.3 depicts an example computing device for implementing system and methods 20 described in reference to FIGs.1A, 1B, and 2.
  • FIG.4 depicts results of a clustering analysis that does not use pseudo cells representing a truth dataset and does not filter variants using a ground truth dataset (e.g., bulk sequencing dataset).
  • FIG.5 depicts results of a clustering analysis that implements pseudo cells 25 representing a truth dataset and filters variants using a ground truth dataset (e.g., bulk sequencing dataset).
  • FIG.6A depicts results of a clustering analysis that implements pseudo cells representing a truth dataset and filters variants using a ground truth dataset (e.g., bulk sequencing datasest) using a 100 plex targeting sequencing panel.
  • FIG.6B depicts a heatmap of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIG.6A.
  • FIG.7A depicts results of a clustering analysis that implements pseudo cells representing a truth dataset and filters variants using a ground truth dataset (e.g., bulk sequencing datasest) using a 600 plex targeting sequencing panel.
  • FIG.7B depicts a heatmap of variants across different populations of cells in 5 accordance with the results of the targeting sequencing panel described in FIG.7A.
  • FIG.8A depicts results of a clustering analysis that implements pseudo cells representing a truth dataset using a AML_V2 (128 plex) targeting sequencing panel.
  • FIG.8B depicts a heatmap of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIG.8A.
  • FIG.9A depicts results of a clustering analysis that implements pseudo cells representing a truth dataset using a AML_V2 (128 plex) targeting sequencing panel.
  • FIG.9B depicts a heatmap of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIG.9A.
  • FIG.10A depicts results of a clustering analysis of the same run as FIG.8A that does 15 not implement pseudo cells.
  • FIG.10B depicts results of a clustering analysis of the same run as FIG.9A that does not implement pseudo cells.
  • subject encompasses a cell, tissue, or organism, human or non-human, whether in vivo, ex vivo, or in vitro, male or female. 25 [0045]
  • sample can include a single cell or multiple cells or fragments of cells or an aliquot of body fluid, such as a blood sample, taken from a subject, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art.
  • mismatched base and “alternate base” are used interchangeably and 30 refers to a base at a position that differs from a known reference base at the same position.
  • a mismatched base is erroneously identified (e.g., erroneously identified during sequencing).
  • An erroneous identification of a base can arise from various sources such as PCR errors, sequencing errors, sequencing alignment errors, and/or correction errors.
  • a known base at a reference position can be adenine (A).
  • a mismatched base or alternate base refers to base other than adenine (A) at the same position (e.g., the base is any one of guanine (G), cytosine (C), or thymine (T)).
  • the phrase “reference base” refers to a known base with a known nucleotide base. In 5 one embodiment, the reference base is determined from a reference genome sequence. In one embodiment, the reference base is determined from one or more sequence reads obtained from a control cell.
  • subpopulation refers to a discrete grouping of cells within a larger population of cells that share a common genotype as identified by sequencing analysis.
  • variant encompasses mutations of a cell including polymorphisms, single nucleotide polymorphisms (SNPs), single nucleotide variants (SNVs), insertions, deletions, knock-ins, knock-outs, copy number variations (CNVs), duplications, translocations, and loss of heterozygosity (FOH).
  • SNPs single nucleotide polymorphisms
  • SNVs single nucleotide variants
  • CNVs copy number variations
  • duplications translocations
  • FOH loss of heterozygosity
  • a population of cells (e.g., mixed cells) 20 can undergo single cell analysis to generate DNA sequencing reads of individual cells.
  • Methods for identifying cellular subpopulations also referred to herein as caller pipelines, involve 1) filtering DNA sequencing reads of the individuals cells against a ground truth dataset to remove low quality data and 2) incorporating pseudo cells as truth data such that cellular subpopulations are more accurately clustered and identified.
  • the methods disclosed herein achieve improved limit of detection in comparison to conventional methodologies. In some scenarios, the methods disclosed herein achieve at least a 1% limit of detection, at least a 0.1% limit of detection, or at least a 0.01% limit of detection.
  • the methods disclosed herein involve performing a 30 dimensionality reduction of the sequencing data obtained from the single cells.
  • the methods disclosed herein involve a combination of unsupervised and supervised techniques for clustering and improved identification of cellular subpopulations.
  • the unsupervised technique includes unsupervised clustering of cells based on sequence reads or dimensionally reduced sequenced reads derived from the cells.
  • the supervised technique includes using pseudo cells representing known ground truth labels. Such pseudo cells are useful for labeling clusters generated through unsupervised clustering techniques based on known genotypes of the pseudo cells.
  • the 5 genotypes of the pseudo cells are determined through sequencing methods, such as bulk sequencing methods.
  • FIG. 1A an overall system environment 100 including a cell analysis workflow device 120 and a computing device 130 for variant calling, in accordance with an embodiment.
  • a cell population 110 is obtained.
  • the cell population 10 110 can be isolated from a test sample obtained from a subject or a patient.
  • the cell population 110 includes healthy cells taken from a healthy subject.
  • the cell population 110 includes diseased cells taken from a subject.
  • the cell population 110 includes cancer cells taken from a subject previously diagnosed with cancer.
  • cancer cells can be tumor cells available in 15 the bloodstream of the subject diagnosed with cancer.
  • cancer cells can be cells obtained through a tumor biopsy.
  • the cell population 110 may be a mixed cell population. In various embodiments, the cell population 110 may include at least 2 cell lines. In various embodiments, the cell population 110 may include at least 3, at least 4, at least 5, at least 6, at 20 least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 25, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, or at least 100 cell lines. [0055] In various embodiments, the cell population 110 may represent a sample of cells pooled from a plurality of subjects.
  • test samples may be obtained from X 25 number of subjects and the test samples are pooled to generate the cell population 110.
  • test samples may be obtained and pooled from at least 2 subjects.
  • test samples may be obtained and pooled from at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 25, at 30 least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, or at least 1000 subjects.
  • the cell population 110 undergoes bulk sequencing 112 to generate sequence reads of cells of the cell population 110.
  • the pseudo cells 115 are incorporated into the cell population 110 and therefore, bulk sequencing 112 is performed on cells of the cell population as well as the incorporated pseudo cells 115.
  • Bulk sequencing can involve lysing cells in bulk and extracting nucleic acids from the cell population. The nucleic acids are then sequenced using 5 sequencing techniques described herein (e.g., next generation sequencing (NGS) platforms, including platforms that perform any of sequencing by synthesis, sequencing by ligation, pyrosequencing, using reversible terminator chemistry, using phospholinked fluorescent nucleotides, or real-time sequencing).
  • NGS next generation sequencing
  • the bulk sequencing 112 may generate sequencing reads that are informative for 10 determining cellular genotypes across the full cell population (e.g., average cellular genotype) as opposed to information at the single-cell level.
  • the bulk sequencing 112 methods are simpler (e.g., using fewer reagents and chemical reactions) than the single-cell processing methods performed by the cell analysis workflow device 120 and therefore, the sequencing reads generated via bulk sequencing 112 may be less prone to 15 chemistry-based errors in comparison to the sequencing reads generated via the cell analysis workflow device 120.
  • the cell population 110 is split such that a first portion of the cell population 110 is provided to the cell analysis workflow device 120 and a second portion of the cell population 110 is provided to the bulk sequencing 112. Therefore, 20 sequencing reads derived from the cell population 110 that are generated via the bulk sequencing 112 can be used to filter sequencing reads and variants that are generated via the cell analysis workflow device 120, as is described in further detail herein.
  • the cell analysis workflow device 120 refers to a device that processes cells and generates nucleic acids for sequencing.
  • the cell analysis workflow 25 device 120 refers to a system comprising one or more devices that process cells and generate nucleic acids for sequencing.
  • the cell analysis workflow device 120 is a workflow device that generates nucleic acids from single cells, thereby enabling the subsequent identification of sequence reads and individual cells from which the sequence reads originated. Further details of sequencing and/or read alignment are described herein. 30 [0059] In various embodiments, the cell analysis workflow device 120 can perform single-cell processing by encapsulating individual cells into emulsions, lysing cells within emulsions, performing cell barcoding of cell lysate in emulsions, and performing a nucleic amplification reaction in emulsions. Thus, amplified nucleic acids can be collected and sequenced. In various embodiments, the single-cell processing involves amplifying nucleic acids derived from genomic DNA of the cells.
  • the single-cell processing involves amplifying nucleic acids derived from RNA of the cells. Therefore, obtained sequence reads provide information about the gene expression of the cells.
  • the single-cell processing involves amplifying nucleic acids derived from genomic DNA of the cells and further amplifying nucleic acids derived from RNA of the cells. Therefore, obtained sequence reads provide information about both genomic DNA and gene expression of the cells. Further description of example embodiments of single-cell workflow processes for analyzing genomic DNA or RNA of single-cells is found in US 10 Application No.14/420,646 and WO2020206184, each of which is hereby incorporated by reference in its entirety.
  • the cell analysis workflow device 120 can be any of the TapestriTM Platform, inDropTM system, NadiaTM instrument, or the ChromiumTM instrument.
  • the cell analysis workflow device 120 includes a 15 sequencer for sequencing the nucleic acids to generate sequence reads.
  • the method for identifying cellular subpopulations involves incorporating pseudo cells 115.
  • pseudo cells 115 are incorporated as physical cells in the cell population 110.
  • pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 0.1% of cells in the 20 total cell population.
  • pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 0.5% of cells in the total cell population. In some embodiments, pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 1% of cells in the total cell population. In some embodiments, pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 25 5% of cells in the total cell population. In some embodiments, pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 10% of cells in the total cell population.
  • pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 0.2%, less than 0.3%, less than 0.4%, less than 0.5%, less than 0.6%, less than 0.7%, less than 0.8%, less than 0.9%, less than 1%, less than 30 2%, less than 3%, less than 4%, less than 5%, less than 6%, less than 7%, less than 8%, less than 9%, or less than 10% of cells in the total cell population 110. Further details of regarding pseudo cells are described herein. [0062] In some embodiments, pseudo cells are not physically incorporated into the cell population 110. As shown in FIG.1A, pseudo cells 115 can be incorporated as data with the sequence reads from the cell analysis workflow device 120.
  • sequence reads derived from the pseudo cells 115 can be obtained and incorporated along with sequence reads generated by the cell analysis workflow device 120.
  • the sequence reads derived from the pseudo cells 115 are generated via a single-cell analysis 5 (e.g., any of TapestriTM Platform, inDropTM system, NadiaTM instrument, or the ChromiumTM instrument) or via a bulk sequencing analysis.
  • pseudo cells 115 undergo bulk sequencing to generate sequence reads.
  • the sequence reads derived from the pseudo cells 115 undergo subsequent processing with the sequence reads generated by the cell analysis workflow device 120, which leads to the improved clustering and subpopulation 10 identification.
  • the sequence reads derived from the pseudo cells 115 can be obtained from a third party (e.g., a third party who processes the pseudo cells 115).
  • the third party may operate a single-cell analysis workflow device (e.g., any of TapestriTM Platform, inDropTM system, NadiaTM instrument, or the ChromiumTM instrument) to generate sequence reads of the pseudo cells 115.
  • the third party may 15 perform bulk sequencing on at least the pseudo cells 115 to generate the sequence reads of the pseudo cells 115.
  • a quantity of sequence reads of pseudo cells 115 that are incorporated represents less than 0.01% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110.
  • a quantity of 20 sequence reads of pseudo cells 115 that are incorporated represents less than 0.1% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110. In some embodiments, a quantity of sequence reads of pseudo cells 115 that are incorporated represents less than 0.5% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110. In some embodiments, a quantity of 25 sequence reads of pseudo cells 115 that are incorporated represents less than 1% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110. In some embodiments, a quantity of sequence reads of pseudo cells 115 that are incorporated represents less than 5% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110.
  • a quantity of 30 sequence reads of pseudo cells 115 that are incorporated represents less than 10% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110.
  • a quantity of sequence reads of pseudo cells 115 that are incorporated represents less than 0.2%, less than 0.3%, less than 0.4%, less than 0.5%, less than 0.6%, less than 0.7%, less than 0.8%, less than 0.9%, less than 1%, less than 2%, less than 3%, less than 4%, less than 5%, less than 6%, less than 7%, less than 8%, less than 9%, or less than 10% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110.
  • the computing device 130 is configured to receive the sequence reads from the cell 5 analysis workflow device 120 and to process the sequence reads to identify one or more cellular subpopulations 140.
  • the computing device 130 is communicatively coupled to the cell analysis workflow device 120, and therefore, directly receives the sequence reads from the cell analysis workflow device 120.
  • the computing device 130 filters sequence reads and/or variants (e.g., by 10 thresholding variants and/or by filtering against a ground truth dataset), performs dimensionality reduction of sequencing data of cells and pseudo cells 115, and further performs clustering of cells and pseudo cells 115.
  • FIG.1B details modules within the computing device 130. Specifically, FIG.1B introduces a filter module 132, a dimensionality reduction module 134, a clustering module 136, and an annotation module 138. In various embodiments, the computing device 130 may include additional or fewer modules.
  • the filter module 132 performs filtering of sequence reads of cells that are obtained 20 via the cell analysis workflow device 120 (see FIG.1A). In various embodiments, the filter module 132 may filter sequence reads of cells against threshold values of any of allele frequency, read depth, and/or genotyping quality.
  • the filter module 132 may remove variants that do not meet a threshold value of allele frequency, read depth, and/or genotyping quality.
  • the filter module 132 may filter sequence reads 25 of cells against a ground truth dataset, such as a ground truth dataset derived from bulk sequencing of the cells.
  • the filter module 132 removes variants of the sequence reads of the cells detected via the single-cell analysis (e.g., by the cell analysis workflow device 120) that do not appear in the sequence reads of the cells that underwent bulk sequencing 112 (see FIG.1A). Further details of filtering the sequence reads of the cells are described herein.
  • the dimensionality reduction module 134 performs dimensionality reduction on the sequencing reads of the cells that are obtained via the cell analysis work flow device 120.
  • the dimensionality reduction module 134 further incorporates sequence reads from one or more pseudo cells (e.g., sequence reads of pseudo cells obtained via bulk sequencing). Therefore, the dimensionality reduction module 134 performs dimensionality reduction of sequence reads of cells obtained via the cell analysis workflow device 120 and sequence reads of pseudo cells.
  • the dimensionality reduction module 134 performs dimensionality reduction on filtered sequencing reads of the cells, wherein the filtering has been performed by the filter module 132.
  • the clustering module 136 clusters cells according to the sequence reads and/or detected variants of the cells. In various embodiments, the clustering module 136 clusters 10 cells according to the filtered sequence reads and/or filtered variants of the cells. In various embodiments, the clustering module 136 clusters cells according to the dimensionally reduced filtered sequence reads and/or dimensionally reduced filtered variants of the cells.
  • the clustering module 136 clusters both cells and pseudo cells (e.g., according to the filtered sequence reads and/or filtered variants of the cells and the sequence 15 reads and/or variants of the pseudo cells).
  • the clustering module 136 generates one or more clusters of cells.
  • each cluster of cells represents a subpopulation of cells.
  • the annotation module 138 annotates clusters as subpopulations based on the presence or absence of one or more pseudo cells within the clusters.
  • the pseudo cell may have a known genotype and is known to originate from a source. Therefore, the 20 annotation module 138 may annotate a cluster in which the pseudo cell is located with the source from which the pseudo cell originated from.
  • the pseudo cell may be obtained from a subject (e.g., a human subject) and is located in a cluster of cells. Therefore, the annotation module 138 may annotate the cluster of cells as originating from the subject that the pseudo cell was obtained from. In such an example, the annotation module 138 may 25 identify subjects that clusters of cells are obtained from, thereby de-multiplexing cells and enabling subject-specific analysis.
  • FIG.2 is a flow diagram of the clustering analysis, in accordance with an embodiment.
  • a filter is applied to remove low quality reads and/or low quality variants. 30 In this step, variants are filtered using standard variant filters included in Tapestri Insights.
  • step 220 variants that do not appear in a ground truth dataset are removed.
  • ground truth information can be used for variant filtering, where sequence reads with variants that are not found in the ground truth dataset can be removed.
  • sequence reads with variants which were also found in ground truth dataset are kept whereas sequence reads with variants that do not appear in the ground truth dataset can be removed.
  • pseudo cells are added. Pseudo cells represent cells that have genotypes 5 derived from bulk truth. The addition of pseudo cells improves the clustering and further improves the cluster annotation (e.g., step 260).
  • dimensionality reduction is performed on the cell sequencing data which includes sequencing data of single cells of a cell population (e.g., cell population 110 shown in FIG.1A) as well as sequencing data of pseudo cells that were added at step 230.
  • dimensionality reduction is one of principal components analysis or uniform manifold approximation and approximation (UMAP) analysis.
  • UMAP uniform manifold approximation and approximation
  • clusters are generated.
  • generating clusters involves implementing a hierarchical density-based spatial clustering of applications with noise (HDBSCAN) method.
  • the clusters include both cells of the cell population (e.g., 15 cell population 110 shown in FIG.1A) as well as pseudo cells that were added at step 230.
  • the generated clusters are annotated. Specifically, the clusters are annotated using the truth variants of pseudo cells. In various embodiments, clusters can contain a pure cell line, or can contain a mixture of cell lines.
  • pseudo cells are removed from the analysis. In this step, pseudo cells are 20 removed before calculating the ratios of different cell types. This prevents the pseudo cells from altering the subsequent analysis. Filtering of Variants [0078]
  • Embodiments of the method disclosed herein involve filtering sequencing data corresponding to single cells. Filtering of sequencing data can comprise removing low- 25 quality data. Low quality data can arise, for example, from chemistry-based errors occurring during sample preparation or processing.
  • data are filtered based on thresholds of allele frequency, read depth, 30 genotyping quality, frequency of genotypes per cell, frequency of genotypes present in cells, and/or variant mutation frequency per cell.
  • variants are filtered based on thresholds of allele frequency. For example, if a variant across the sequence reads is observed at a frequency that is below an allele frequency threshold, then the variant is removed.
  • the allele frequency threshold is 25%. In various embodiments, the allele frequency threshold is 20%.
  • the allele frequency threshold is 15%. In various embodiments, the allele frequency threshold is 10%. In various embodiments, the allele frequency threshold is 5 5%. In various embodiments, the allele frequency threshold is 1%. In various embodiments, the allele frequency threshold is 0.5%. In various embodiments, the allele frequency threshold is 0.1%. In various embodiments, the allele frequency threshold is 0.05%. In various embodiments, the allele frequency threshold is 0.01%.
  • the allele frequency threshold is any of 25%, 20%, 15%, 10%, 5%, 4%, 3%, 2%, 1%, 0.9%, 10 0.8%, 0.7%, 0.6%, 0.5%, 0.4%, 0.3%, 0.2%, 0.1%, 0.09%, 0.08%, 0.07%, 0.06%, 0.05%, 0.04%, 0.03%, 0.02%, or 0.01%, [0080]
  • data are filtered based on thresholds of read depth. For example, if a variant across the sequence reads is observed at a depth that is below a read depth threshold, then the variant is removed. Read depth can be read in terms of number of 15 reads per cell per amplicon.
  • the read depth threshold is 25 reads per cell per amplicon. In various embodiments, the read depth threshold is 20 reads per cell per amplicon. In various embodiments, the read depth threshold is 20,000 reads per cell. In various embodiments, the read depth threshold is 15 reads per cell per amplicon. In various embodiments, the read depth threshold is 10 reads per cell per amplicon. In various 20 embodiments, the read depth threshold is 5 reads per cell per amplicon. In various embodiments, the read depth threshold is any of 25 reads per cell per amplicon, 20 reads per cell per amplicon, 15 reads per cell per amplicon, 10 reads per cell per amplicon, or 5 reads per cell per amplicon reads per cell per amplicon.
  • data are filtered based on thresholds of genotyping quality. For 25 example, if a variant across the sequence reads is observed at a quality that is below a read genotyping quality threshold, then the variant is removed.
  • the genotyping quality threshold is a Phred score of 20. In various embodiments, the genotyping quality threshold is a Phred score of 21. In various embodiments, the genotyping quality 30 threshold is a Phred score of 22. In various embodiments, the genotyping quality threshold is a Phred score of 23. In various embodiments, the genotyping quality threshold is a Phred score of 24.
  • the genotyping quality threshold is a Phred score of 25. In various embodiments, the genotyping quality threshold is a Phred score of 26. In various embodiments, the genotyping quality threshold is a Phred score of 27. In various embodiments, the genotyping quality threshold is a Phred score of 28. In various embodiments, the genotyping quality threshold is a Phred score of 29. In various embodiments, the genotyping quality threshold is a Phred score of 30. In various embodiments, the genotyping quality threshold is a Phred score of 31. In various 5 embodiments, the genotyping quality threshold is a Phred score of 32. In various embodiments, the genotyping quality threshold is a Phred score of 33. In various embodiments, the genotyping quality threshold is a Phred score of 34.
  • the genotyping quality threshold is a Phred score of 35. In various embodiments, the genotyping quality threshold is any of a Phred score of 20, 21, 22, 23, 24, 10 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, or 35. [0082] In various embodiments data are filtered based on thresholds of frequency of variants genotyped per cell. For example, if a variant across the sequence reads is observed at a frequency that is below a threshold of variants genotyped per cell, then the variant is removed. In various embodiments, the threshold of variants genotyped per cell is 65%. In 15 various embodiments, the threshold of variants genotyped per cell is 60%. In various embodiments, the threshold of variants genotyped per cell is 55%.
  • the threshold of variants genotyped per cell is 50%. In various embodiments, the threshold of variants genotyped per cell is 45%. In various embodiments, the threshold of variants genotyped per cell is 40%. In various embodiments, the threshold of variants genotyped per 20 cell is 65%, 60%, 55%, 50%, 45%, or 40%. [0083] In various embodiments data are filtered based on thresholds of frequency of genotypes present in cells. For example, if a cell is observed at a frequency that is below a threshold of genotypes present, then the cell is removed (e.g., sequence reads including the cell are removed). In various embodiments, the threshold of frequency of genotypes present 25 in cells is 65%.
  • the threshold of frequency of genotypes present in cells is 60%. In various embodiments, the threshold of frequency of genotypes present in cells is 55%. In various embodiments, the threshold of frequency of genotypes present in cells is 50%. In various embodiments, the threshold of frequency of genotypes present in cells is 45%. In various embodiments, the threshold of frequency of genotypes present in 30 cells is 40%. In various embodiments, the threshold of frequency of genotypes present in cells is 65%, 60%, 55%, 50%, 45%, or 40%. [0084] In various embodiments, data are filtered based on thresholds of variant mutation frequency per cell.
  • the threshold of variant mutation frequency per cell is 5%. In various embodiments, the threshold of variant mutation frequency per cell is 4%. In various embodiments, the threshold of variant mutation frequency per cell is 3%. In various embodiments, the threshold of variant mutation frequency per cell is 2%. In various 5 embodiments, the threshold of variant mutation frequency per cell is 1%. In various embodiments, the threshold of variant mutation frequency per cell is 0.75%. In various embodiments, the threshold of variant mutation frequency per cell is 0.5%. In various embodiments, the threshold of variant mutation frequency per cell is 0.25%.
  • the threshold of variant mutation frequency per cell is 5%, 4%, 3%, 2%, 1%, 10 0.75%, 0.5%, or 0.25%.
  • sequencing data of single cells are filtered against a ground truth dataset.
  • a ground truth dataset can include sequencing reads informative for determining a genotype of the single cells.
  • the ground truth dataset includes sequencing reads that were 15 obtained via a sequencing method, such as a bulk sequencing method. For example, referring again to FIG.1A, a portion of the cell population 110 may undergo bulk sequencing 112, thereby generating sequencing reads of the ground truth dataset.
  • the ground truth dataset may include one or more variants that are present in the cells. Additionally, the ground truth dataset may include less or limited low quality data that would arise from 20 chemistry-based errors in comparison to the sequencing data acquired from the cell analysis workflow device 120. Therefore, the ground truth dataset may include more accurate, higher quality sequencing reads, but does not have the single-cell resolution of the sequencing reads obtained via the cell analysis workflow device 120. [0086] In various embodiments, sequencing data of the single cells (e.g., obtained via the cell 25 analysis workflow device 120) are filtered against the ground truth dataset to remove sequencing reads and/or variants identified through the single cell sequencing that do not appear in the ground truth dataset.
  • sequencing data of the single cells are filtered against the ground truth dataset on a per-sequence read basis. For example, for each sequence read, a variant in the sequence read is identified as present. In various embodiments, presence of a variant in the sequence read can be conducted by aligning the sequence read to a reference genome, as is described in further detail herein. For each sequence read, the variant in the sequence read is queried against the ground truth dataset to determine whether the ground truth dataset includes the variant at the position of the reference genome.
  • sequencing data of the single cells are filtered against the 5 ground truth dataset on a per-variant basis.
  • sequence reads of the single cells are aligned to a reference genome and collapsed.
  • a variant is identified as being present in at least one sequence read of the single cells.
  • the variant is queried against the ground truth dataset to determine whether the ground truth dataset also includes the variant at the position of the genome. If the variant was not 10 identified in the ground truth dataset, the variant is removed from further consideration.
  • the sequencing data of the single cells e.g., obtained via the cell analysis workflow device 120
  • at least 2 variants are removed.
  • at least 5 variants are removed.
  • at least 10 variants are removed.
  • at least 10 variants are removed.
  • at least 50 variants are removed.
  • the sequencing data of the single cells e.g., obtained via the cell analysis workflow device 120
  • at least 100 variants are removed.
  • by filtering the sequencing data of the single cells e.g., obtained via the cell analysis workflow device 120
  • at least 500 variants are removed.
  • by filtering the sequencing data of the 25 single cells e.g., obtained via the cell analysis workflow device 120 against the ground truth dataset, at least 1000 variants are removed.
  • Pseudo Cells [0090] In various embodiments, methods disclosed herein involve incorporation of pseudo 5 cells.
  • “Pseudo cell” (or “true cell,” used interchangeably) refers to a cell with a known genotype.
  • a “pseudo cell” can refer to a cell with one or more known variants.
  • the known genotype (e.g., one or more known variants) of the pseudo cell is determined through bulk sequencing techniques.
  • pseudo cells are removed from the one or more clusters prior to subsequent analysis. 10 [0091] In some embodiments, one or more pseudo cells are introduced as a physical cell in a mixed population of cells.
  • one or more pseudo cells are introduced as sequencing data into a sequencing analysis. In various embodiments, the one or more pseudo cells are removed from the clusters prior to subsequent analysis.
  • Pseudo cells can be any suitable cell type comprising distinguishing variants to 15 distinguish pseudo cells from other cells in the analysis. In some embodiments, pseudo cells have at least 4 variants distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 5 variants distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 6 variants distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 7 variants 20 distinguishing them from other cells in the analysis.
  • pseudo cells have at least 8 variants distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 9 variants distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 10 variants distinguishing them from other cells in the analysis.. In some embodiments, pseudo cells have at least 4, at least 5, at least 6, 25 at least 7, at least 8, at least 9, or at least 10 variants distinguishing them from other cells in the analysis.
  • pseudo cells are immune cells such as peripheral blood mononuclear cells (e.g., including any of lymphocytes such as T-cells, B-cells, monocytes, NK-cells or monocytes such as erythrocytes, platelets, neutrophils, basophils, eosinophils),
  • pseudo cells are lymphocytes.
  • pseudo cells are T-cells.
  • pseudo cells are B-cells.
  • pseudo cells are NK-cells.
  • pseudo cells are attached cells (e.g., epithelial cells, connective tissue cells, endothelial cells, muscle cells, or neural cells, In various embodiments, pseudo cells are epithelial cells.. In various embodiments, pseudo cells are connective tissue cells. In various embodiments, pseudo cells are endothelial cells. In various embodiments, pseudo cells are muscle cells. In various embodiments, pseudo cells are nerve cells such as neurons, glia, etc. 5 [0094] In various embodiments, pseudo cells are cells associated with a disease state (e.g., cancer cells, infected cells, activated immune cells, etc.). In some embodiments, pseudo cells are cancer cells. In some embodiments, pseudo cells are infected cells.
  • a disease state e.g., cancer cells, infected cells, activated immune cells, etc.
  • pseudo cells are cancer cells. In some embodiments, pseudo cells are infected cells.
  • pseudo cells are activated immune cells.
  • pseudo cells are obtained from a subject.
  • pseudo cells are isolated from a from a test sample obtained from a subject.
  • pseudo cells are obtained from a test sample in which a cell population (e.g., cell population 110) is also obtained.
  • a test sample can be obtained from a subject.
  • the test sample can be processed to separately obtain a first portion comprising pseudo cells and a second portion comprising the cell population (e.g., cell 15 population 110).
  • the pseudo cells can be analyzed (e.g., through bulk sequencing) whereas the cell population can undergo single cell analysis.
  • the pseudo cells are allogeneic with respect to the cell population that undergoes single cell analysis.
  • the cell population can be pooled with other cell populations from other subjects prior to single-cell analysis.
  • the pseudo cells and/or 20 the analysis of the pseudo cells can be used to later de-multiplex the single-cell analysis of the mixed cell population.
  • pseudo cells are allogenic to other cells in the analysis.
  • pseudo cells are heterologous to other cells in the analysis.
  • pseudo cells are allogenic to some cells in the analysis and heterologous to 25 other cells in the analysis.
  • pseudo cells can be established cell lines.
  • dimensionality reduction is performed on sequencing data (e.g., sequencing data of single cells and/or sequencing data of incorporated pseudo cells).
  • Dimensionality reduction is a process for translating a data set 30 having many dimensions to a data set having fewer of dimensions.
  • data points that are closely spaced in the high-dimensionality data set can be closely spaced in the low-dimensionality data set
  • data points that are widely spaced in the high-dimensionality data set can be widely spaced in the low dimensionality data set.
  • the dimensionality reduction analysis can be performed on a 5 variant (e.g., mutations of a cell including polymorphisms, single nucleotide polymorphisms (SNPs), single nucleotide variants (SNVs), insertions, deletions, knock-ins, knock-outs, copy number variations (CNVs), duplications, translocations, and loss of heterozygosity (FOH)) that are present in the sequencing data of single cells and/or pseudo cells.
  • a 5 variant e.g., mutations of a cell including polymorphisms, single nucleotide polymorphisms (SNPs), single nucleotide variants (SNVs), insertions, deletions, knock-ins, knock-outs, copy number variations (CNVs), duplications, translocations, and loss of heterozygosity (FOH)
  • the dimensionality reduction analysis can be performed on two or more variants (e.g., mutations of a cell including polymorphisms, single nucleotide polymorphisms (SNPs), single nucleotide variants (SNVs), insertions, deletions, knock-ins, knock-outs, copy number variations (CNVs), duplications, translocations, and loss of 15 heterozygosity (FOH)).
  • the dimensionality reduction analysis can be performed on the combination of single nucleotide variants (SNVs) and copy number variations (CNVs).
  • the dimensionality reduction analysis can be performed on the combination of insertions and deletions.
  • the dimensionally reduced dataset can retain the information relevant for the combination of features while eliminating 20 redundancy/correlation across other features.
  • Examples of dimensionality reduction analysis include principal component analysis (PCA), kernel PCA, graph-based kernel PCA, linear discriminant analysis, generalized discriminant analysis, autoencoder, non-negative matrix factorization, T- distributed stochastic neighbor embedding (t-SNE), or uniform manifold approximation and 25 projection (UMAP) and dens-UMAP.
  • the dimensionality reduction is principal component analysis (PCA).
  • the dimensionality reduction is performed via a non-linear algorithm.
  • the dimensionality reduction is Uniform Manifold Approximation and Projection (UMAP).
  • sequencing data 30 corresponding to single cells are clustered.
  • Clustering is process of sorting data points into groups within a feature space.
  • Clustering methods can be supervised, unsupervised, or combined supervised and unsupervised.
  • Clustering algorithms often rely on machine learning to sort data points into discrete groups, or “clusters.”
  • the method describe detects cellular subpopulations by identifying clusters of data points corresponding to single cells.
  • the clustering is combined supervised and unsupervised.
  • unsupervised cluster analysis examples include hierarchical clustering, k- means clustering, clustering using mixture models, density based spatial clustering of applications with noise (DBSCAN), ordering points to identify the clustering structure 5 (OPTICS), or combinations thereof.
  • the clustering algorithm is the hierarchical density-based spatial clustering of applications with noise (HDBSCAN) method.
  • the step of clustering involves clustering cells according to one or variants in the sequencing data of the cells.
  • the sequencing data of the cells may have previously undergone filtering to remove low quality data, as is 10 described in further detail herein.
  • the clustering of the cells involves clustering according to one or more variants in the sequencing data that remains after the filtering of low quality data.
  • the step of clustering involves clustering cells and additionally pseudo cells according to one or more variants in the sequencing data of the cells and pseudo cells.
  • the pseudo cells can be grouped into one or more of the 15 clusters.
  • the step of clustering involves both 1) clustering cells according to one or more variants in sequencing data that has previously undergone filtering to remove low quality data and 2) clustering pseudo cells according to one or more variants in the sequencing data of the pseudo cells.
  • clusters of cells are generated according to detected variants for two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty five, thirty, forty, fifty, sixty, seventy, eighty, ninety, or one hundred genes or more.
  • clusters of cells are generated according to two or more detected variants for one or more genes.
  • clusters of cells are generated according to two or more detected variants for two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty five, thirty, forty, fifty, sixty, seventy, eighty, ninety, or one hundred genes or more and de tected 30 structural variants for two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty five, thirty, forty, fifty, sixty, seventy, eighty, ninety, or one hundred genes or more.
  • methods disclosed herein for identifying subpopulations of cells includes first performing a dimensionality reduction on sequencing data from cells (e.g., single cells and/or pseudo cells) followed by performing clustering of cells and/or pseudo cells according to one or more variants present in the dimensionally reduced sequencing data set.
  • the clustering of the cells involves performing unsupervised clustering of the cells according to one or more variants present in the 5 dimensionally reduced sequencing data set.
  • the clustering of the cells involves performing unsupervised clustering of the cells according information representing dimensionally reduced values of one or more variants present in the dimensionally reduced sequencing data set.
  • the inclusion of pseudo cells enables detection of one or 10 more subpopulations with a limit of detection below in comparison to a limit of detection in which no pseudo cells are incorporated.
  • the inclusion of pseudo cells in the analysis enables detection of one or more subpopulations at a limit of detection below 1%, below 0.5% below 0.1%, or below 0.01% of the total population.
  • the one or more subpopulations are detected with a limit of detection below 1% of the total 15 population.
  • the one or more subpopulations are detected with a limit of detection below 0.5% of the total population.
  • the one or more subpopulations are detected with a limit of detection below 0.1% of the total population.
  • the one or more subpopulations are detected with a limit of detection below 0.01% of the total population.
  • the inclusion of pseudo cells as a ground truth allows annotation of identified clusters as specific cellular subpopulations.
  • the clusters of cells are annotated based on presence of one or more pseudo cells.
  • annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster comprising a pseudo cell as a cell population with one or more 25 known variants of the pseudo cell.
  • annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster that is devoid of a pseudo cell as a mixed cell population.
  • the original cell population (e.g., cell population 110 shown in FIG.1A) includes a mixture of cells obtained from various subjects.
  • different cells from the different subjects can be grouped in different clusters.
  • a first cluster can identify a subpopulation of cells that were obtained from a first subject
  • a second cluster can identify a subpopulation of cells from a second patient, and so on.
  • One or more pseudo cells of known genotypes can be incorporated in the pipeline, and therefore, they are grouped into the clusters. Such pseudo cells may be obtained from the subjects and therefore, can serve as positive labels for annotating the clusters.
  • cells are obtained from 3 subjects and mixed together to generate a cell population (e.g., cell population 110 shown in FIG.1A).
  • the cells 5 can be analyzed using a single-cell analysis workflow device to generate sequencing data of the cells.
  • Pseudo cells also obtained from the 3 subjects can be incorporated, and the sequencing data of the cells and pseudo cells can be filtered, dimensionally reduced, and clustered.
  • the presence of the pseudo cells enables grouping of cells into at least 3 distinct clusters. Therefore, given that a pseudo cell is known to originate from one of the 3 10 subjects, cells in the cluster in which the pseudo cell is clustered within can be annotated and identified as also originating from the one of the 3 subjects.
  • pseudo cells are removed from the clusters prior to subsequent analysis.
  • removal of pseudo cells from the clusters refers to the removal of the corresponding sequencing data of the pseudo cells from a library 15 of sequencing reads.
  • subsequent analysis can involve quantifying proportions of subpopulations of cells in the full cell population. Therefore, by removing pseudo cells, this enables an accurate quantification of the proportions of subpopulations of cells that do not include the pseudo cells.
  • Methods for Sequencing and Read Alignment 20 [00109] Embodiments of the invention disclosed herein involve the sequencing of nucleic acids and the alignment of the sequence reads to a reference genome.
  • sequenced 25 and aligned sequence reads can be analyzed by the dimensionality reduction and clustering device 130 and more specifically, can be analyzed by the base identification module 210 (see FIG.2) to identify bases of interest.
  • Sequence reads can be achieved with commercially available next generation sequencing (NGS) platforms, including platforms that perform any of sequencing by 30 synthesis, sequencing by ligation, pyrosequencing, using reversible terminator chemistry, using phospholinked fluorescent nucleotides, or real-time sequencing.
  • NGS next generation sequencing
  • amplified nucleic acids may be sequenced on an Illumina MiSeq platform.
  • libraries of NGS fragments are cloned, in-situ amplified by capture of one matrix molecule using granules coated with oligonucleotides complementary to adapters.
  • Each granule containing a matrix of the same type is placed in a microbubble of the “water in oil” type and the matrix is cloned amplified using a method called emulsion 5 PCR.
  • the emulsion is destroyed and the granules are stacked in separate wells of a titration picoplate acting as a flow cell during sequencing reactions.
  • each of the four dNTP reagents into the flow cell occurs in the presence of sequencing enzymes and a luminescent reporter, such as luciferase.
  • a luminescent reporter such as luciferase.
  • the resulting ATP 10 produces a flash of luminescence within the well, which is recorded using a CCD camera. It is possible to achieve a read length of more than or equal to 400 bases, and it is possible to obtain 10 6 readings of the sequence, resulting in up to 500 million base pairs (megabytes) of the sequence.
  • An anchor 20 molecule is used as a PCR primer, but due to the length of the matrix and its proximity to other nearby anchor oligonucleotides, elongation by PCR leads to the formation of a “vault” of the molecule with its hybridization with the neighboring anchor oligonucleotide and the formation of a bridging structure on the surface of the flow cell.
  • These DNA loops are denatured and cleaved.
  • Straight chains are then sequenced using reversibly stained 25 terminators. The nucleotides included in the sequence are determined by detecting fluorescence after inclusion, where each fluorescent and blocking agent is removed prior to the next dNTP addition cycle.
  • Sequencing of nucleic acid molecules using SOLiD technology includes clonal amplification of the library of NGS fragments using emulsion PCR.
  • test probes have 16 possible combinations of two bases at the 3 'end of each probe and one of four fluorescent dyes at the 5 5' end. The color of the fluorescent dye and, thus, the identity of each probe, corresponds to a certain color space coding scheme.
  • HeliScope from Helicos BioSciences is used. 15 Sequencing is achieved by the addition of polymerase and serial additions of fluorescently- labeled dNTP reagents. Switching on leads to the appearance of a fluorescent signal corresponding to dNTP, and the specified signal is captured by the CCD camera before each dNTP addition cycle. The reading length of the sequence varies from 25-50 nucleotides with a total yield exceeding 1 billion nucleotide pairs per analytical work cycle. Additional details 20 for performing sequencing using HeliScope are found in Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev.
  • a Roche sequencing system 454 is used. Sequencing 454 involves two steps. In the first step, DNA is cut into fragments of approximately 300-800 base pairs, and these fragments have blunt ends. Oligonucleotide adapters are then ligated to the ends of the fragments. The adapter serve as primers for amplification and sequencing of fragments.
  • Fragments can be attached to DNA-capture beads, for example, streptavidin- 30 coated beads, using, for example, an adapter that contains a 5'-biotin tag. Fragments attached to the granules are amplified by PCR within the droplets of an oil-water emulsion. The result is multiple copies of cloned amplified DNA fragments on each bead. At the second stage, the granules are captured in wells (several picoliters in volume). Pyrosequencing is carried out on each DNA fragment in parallel. Adding one or more nucleotides leads to the generation of a light signal, which is recorded on the CCD camera of the sequencing instrument. The signal intensity is proportional to the number of nucleotides included.
  • pyrophosphate PPi
  • PPi pyrophosphate
  • Luciferase uses 5 ATP to convert luciferin to oxyluciferin, and as a result of this reaction, light is generated that is detected and analyzed. Additional details for performing sequencing 454 are found in Margulies et al. (2005) Nature 437: 376-380, which is hereby incorporated by reference in its entirety.
  • Ion Torrent technology is a DNA sequencing method based on the detection of 10 hydrogen ions that are released during DNA polymerization.
  • the microwell contains a fragment of a library of NGS fragments to be sequenced. Under the microwell layer is the hypersensitive ion sensor ISFET. All layers are contained within a semiconductor CMOS chip, similar to the chip used in the electronics industry.
  • CMOS chip similar to the chip used in the electronics industry.
  • sequencing reads obtained from the NGS methods can be filtered by quality and grouped by barcode sequence using any algorithms known in the art, 25 e.g., Python script barcodeCleanup.py.
  • a given sequencing read may be discarded if more than about 20% of its bases have a quality score (Q-score) less than Q20, indicating a base call accuracy of less than about 99%.
  • a given sequencing read may be discarded if more than about 5%, about 10%, about 15%, about 20%, about 25%, about 30% have a Q-score less than Q10, Q20, Q30, Q40, Q50, Q60, or more, 30 indicating a base call accuracy of less than about 90%, less than about 99%, less than about 99.9%, less than about 99.99%, less than about 99.999%, less than about 99.9999%, or more, respectively.
  • all sequencing reads associated with a barcode containing less than 50 reads may be discarded to ensure that all barcode groups, representing single cells, contain a sufficient number of high-quality reads.
  • all sequencing reads associated with a barcode containing less than 30, less than 40, less than 50, less than 60, less than 70, less than 80, less than 90, less than 100 or more reads may be discarded to ensure the quality of the barcode groups representing single cells.
  • Sequence reads with common barcode sequences e.g., meaning that sequence reads originated from the same cell
  • the alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of 10 a given sequence read.
  • a region in the reference genome may be associated with a target gene or a segment of a gene.
  • Example aligner algorithms include BWA, Bowtie, Spliced Transcripts Alignment to a Reference (STAR), Tophat, or HISAT2. Further details for aligning sequence reads to reference sequences are described in US Application No. 16/279,315, which is hereby incorporated by reference in its entirety.
  • an output file having SAM (sequence alignment map) format or BAM (binary alignment map) format may be generated and output for subsequent analysis.
  • Computer-Readable Media [00120] Also disclosed herein, in various embodiments, is a non-transitory computer readable medium for performing the methods disclosed herein.
  • FIG.3 depicts an example computing device for implementing system and methods described in reference to FIGs.1A, 1B, 2.
  • the example computing device 300 serves as the computing device 130 shown in FIG.1A for performing 30 the methods described in FIG.1B in relation to the filter module 132, dimensionality reduction module 134, clustering module 136, and annotation module 138.
  • Examples of a computing device can include a personal computer, desktop computer laptop, server computer, a computing node within a cluster, message processors, hand-held devices, multi- processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.
  • the computing device 300 includes at 5 least one processor 302 coupled to a chipset 304.
  • the chipset 304 includes a memory controller hub 320 and an input/output (I/O) controller hub 322.
  • a memory 306 and a graphics adapter 312 are coupled to the memory controller hub 320, and a display 318 is coupled to the graphics adapter 312.
  • a storage device 308, an input interface 314, and network adapter 316 are coupled to the I/O controller hub 355.
  • Other embodiments of the 10 computing device 300 have different architectures.
  • the storage device 308 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device.
  • the memory 306 holds instructions and data used by the processor 302.
  • the input interface 314 is a touch-screen interface, a mouse, track ball, or other type of input 15 interface, a keyboard, or some combination thereof, and is used to input data into the computing device 300.
  • the computing device 300 may be configured to receive input (e.g., commands) from the input interface 314 via gestures from the user.
  • the graphics adapter 312 displays images and other information on the display 318.
  • the network adapter 316 couples the computing device 300 to one or more computer networks. 20 [00124]
  • the computing device 300 is adapted to execute computer program modules for providing functionality described herein.
  • the term “module” refers to computer program logic used to provide the specified functionality.
  • a module can be implemented in hardware, firmware, and/or software.
  • program modules are stored on the storage device 308, loaded into the memory 306, and executed by the 25 processor 302.
  • the types of computing devices 300 can vary from the embodiments described herein.
  • the computing device 300 can lack some of the components described above, such as graphics adapters 312, input interface 314, and displays 318.
  • a computing device 300 can include a processor 302 for executing instructions 30 stored on a memory 306.
  • the methods disclosed herein can be implemented in hardware or software, or a combination of both.
  • a non-transitory machine-readable storage medium such as one described above, is provided, the medium comprising a data storage material encoded with machine readable data which, when using a machine programmed with instructions for using said data, is capable of executing instructions for performing the methods disclosed herein.
  • Embodiments of the methods described above can be implemented in computer programs executing on programmable computers, comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage 5 elements), a graphics adapter, an input interface, a network adapter, at least one input device, and at least one output device.
  • a display is coupled to the graphics adapter.
  • Program code is applied to input data to perform the functions described above and generate output information.
  • the output information is applied to one or more output devices, in known fashion.
  • the computer can be, for example, a personal computer, microcomputer, or 10 workstation of conventional design.
  • Each program can be implemented in a high-level procedural or object-oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language can be a compiled or interpreted language.
  • Each such computer program is preferably stored on a 15 storage media or device (e.g., ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein.
  • the system can also be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to 20 operate in a specific and predefined manner to perform the functions described herein.
  • the signature patterns and databases thereof can be provided in a variety of media to facilitate their use. “Media” refers to a manufacture that contains the signature pattern information of the present invention.
  • the databases of the present invention can be recorded on computer readable media, e.g. any medium that can be read and accessed directly by a 25 computer. "Recorded” refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure can be chosen, based on the means used to access the stored information.
  • FIG.4 depicts results of the first caller pipeline that did not use pseudo cells and furthermore did not filter low quality variants using a ground truth dataset.
  • the clustering analysis resulted in two clusters labeled as “A” and “B.” This method without using the ground truth dataset for filtering and without incorporating the pseudo cells yielded incorrect results as it failed to identify the 3 different cell populations.
  • FIG.5 depicts results of the second calling pipeline that incorporated both pseudo cells and filtered low quality variants using a ground truth dataset 20 (e.g., ground truth dataset from bulk sequencing). Specifically, the ground truth dataset was used to filter out variants (e.g., filter out variants at step 220 in FIG.2). Pseudo cells were incorporated into the analysis and improved the clustering of the different cell populations.
  • Example 2 Clustering analysis using Psuedo Cell Truth Datasets Distinguishes Cell Subpopulations with Differing Genotypes
  • a mixture of different cells (HG001, HG002, and HG005) was processed using the Tapestri single cell platform and genomic DNA sequencing data was generated from the mixture of cells.
  • the implemented calling pipeline 1) incorporated pseudo cells and 2) 30 filtered low quality variants using a ground truth dataset.
  • Sequencing data derived from single cells were processed using standard Tapestri variant calling pipeline and analyzed.
  • VCF multi-sample VCF where each sample was a cell in the experiment.
  • the method began by filtering the VCF file to remove low quality data based on thresholds of allele frequency, read depth, genotyping quality etc. These filters aided in removing a significant amount of noise caused by chemistry-based errors.
  • the result was a matrix of cells vs variants with the values being either the genotypes or allele frequencies for each cell/variant combination.
  • the method worked with both 5 genotypes and allele frequencies. This matrix was then reduced to 2 dimensions using UMAP, followed by HDBSCAN clustering. The observed clusters were annotated by comparing their genotype signature with known genotypes of pseudo cells representing a ground truth dataset.
  • FIG.6A depicts results of a clustering analysis that implemented 15 pseudo cells using a 100 plex targeting sequencing panel. Here, four clusters were identified (HG001, HG002, HG005, and mixed cell population).
  • FIG.6B depicts a heatmap of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIG.6A.
  • the heat map of variants shows that different cell lines (e.g., HG001, HG002, and HG005) are characterized by the presence and/or frequency 20 of different variants.
  • HG001 and HG002 were characterized by at least the presence of variants of chr5_67569391_A_C, chr3_182681740_C_T, and chr3_182681853_T_C, whereas HG005 did not have these variants.
  • HG005 was characterized by at least the presence of variants of ch9_139396690_C_T, chr20_39792538_C_T, chr4_55968053_A_C, chr20_40979147_C_T, chr4_55129831_C_T, whereas neither HG001 25 or HG002 exhibited these variants.
  • this demonstrates that the 1) filtering of variants using a ground truth dataset and 2) inclusion of pseudo cells enables accurate clustering of the HG001, HG002, and HG005 cell lines according to presence/absence/frequency of the respective variants.
  • the workflow achieves 1% limit of detection as it successfully identified a cluster corresponding to HG001.
  • FIG.7A depicts results of a clustering analysis that implemented pseudo cells using a 600 plex targeting sequencing panel. Notably, 4 distinct clusters were identified when both the 1) filtering of variants using a ground truth dataset and 2) inclusion of pseudo cells are performed.
  • FIG.7B depicts a heatmap of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIG.7A. In particular, the heat map of variants shows that different cell lines (e.g., HG001, HG002, and HG005) were successfully identified and characterized by the presence and/or frequency of different variants. Again, this demonstrates that the workflow achieves a 1% limit of detection as it successfully identified a cluster corresponding to HG001.
  • different cell lines e.g., HG001, HG002, and HG005
  • Example 3 Clustering Analysis Using Psuedo Cells Allows Enhanced Detection 15 of Minor Cellular Subpopulations
  • a mixture of different cells (Raji, PC3, DU145, and SKMEL28) was processed using the Tapestri single cell platform and genomic DNA sequencing data was generated from the mixture of cells.
  • Two different calling pipelines were implemented: 1) a first caller pipeline that that incorporated both pseudo cells and filtered low quality variants using a 20 ground truth dataset and 2) a second pipeline that did not filter low quality variants using a ground truth dataset (but did use pseudo cells).
  • FIGs.8A, 8B, 9A, and 9B show the results from first caller pipeline where as FIG.10A and 10B show the results from the second caller pipeline (which did not use a ground truth dataset).
  • Sequencing data derived from single cells were processed using standard Tapestri 25 variant calling pipeline and analyzed. This resulted in a multi-sample VCF where each sample is a cell in the experiment. The method began by filtering the VCF file to remove low quality data based on thresholds of allele frequency, read depth, genotyping quality etc. These filters aided in removing a significant amount of noise caused due to chemistry-based errors.
  • the result was a matrix of cells vs variants with the values being either the genotypes or allele frequencies for each cell/variant combination.
  • the method worked with both genotypes and allele frequencies. This matrix was then reduced to 2 dimensions using UMAP, followed by HDBSCAN clustering. The observed clusters were annotated by comparing their genotype 5 signature with known genotypes of pseudo cells .
  • This method was tested using AML_V2 panel of sizes 128 amplicons each, created for mixtures of 4 known cell lines (Raji, PC3, DU145, and SKMEL28) at 49/49/0.5/0.1% ratios, respectively. The method was successfully able to identify all pseudo cell line populations along with mix cells which were caused due to the cell mixing of any 2 10 or more cell lines.
  • FIGs.8A and 9A depict results of clustering analyses that 15 implemented pseudo cells using the AML_V2128 plex targeting sequencing panel. In each analysis, five clusters were identified (Raji, PC3, DU145, SKMEL28, and mixed cell population).
  • FIGs.8B and 9B depict heatmaps of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIGs.8A and 9A, respectively.
  • FIG.8B and FIG.9B show the variants in columns and cells of different subpopulations as rows.
  • the values in each heatmap represent allele frequencies.
  • the allele frequency of variants differed across the different cell lines. This method was successfully able to identify all cell lines, including the 0.5% and 0.1% spike in cell line in 25 these runs.
  • Table 2 Characteristics of cellular subpopulations identified using clustering analysis
  • the second caller pipeline which does not filter variants using ground truth data exhibits a poorer limit of detection (worse than 0.1%). Additionally since this method does not use bulk information it cannot automatically label the cell populations as cell types or cell lines. Further characteristics of 10 cellular subpopulations that were identified using the clustering analysis is described further below in Table 3. Table 3: Characteristics of cellular subpopulations identified using clustering analysis Example 4: Clustering Analysis Using Psuedo Cells Allows Enhanced Detection of Minor Cellular Subpopulations 15 [00146] Sequencing data derived from single cells is processed using standard Tapestri variant calling pipeline and analyzed. This results in a multi-sample VCF where each sample is a cell in the experiment.
  • the method begins by filtering the VCF file to remove low quality data based on thresholds of allele frequency, read depth, genotyping quality, etc. These filters aid in removing a significant amount of noise caused due to chemistry-based errors.
  • the 20 result is a matrix of cells vs variants with the values being either the genotypes or allele frequencies for each cell/variant combination. The method works with both genotypes and allele frequencies. This matrix is then reduced to 2 dimensions using UMAP, followed by HDBSCAN clustering. The observed clusters are annotated by comparing their genotype signature with known genotypes of pseudo cells dataset.
  • This method is tested using, for example, the AML_V2 panel of 128 amplicons each, created for mixtures of 4 known cell lines (for example: Raji, PC3, DU145, and SKMEL28) at 49/49/0.5/0.01% ratios, respectively.
  • the method can successfully identify all pseudo cell line populations along with mix cells which are caused due to the cell mixing of any 2 or more cell lines, including the 0.5% and 0.01% spike in cell line.

Abstract

Described herein are methods for improved detection of cellular subpopulaitons following single-cell analysis. Generally, the method involves removing low quality reads and incorporating pseudo cells as ground truth information, followed by dimensionality reduction and clustering. The incorporation of pseudo cells allows for identification of cellular subpopulations with an improved limit of detection.

Description

CELLULAR CLUSTERING ANALYSIS IN SEQUENCING DATASETS CROSS REFERENCE TO RELATED APPLICATIONS [0001] This application claims the benefit of and priority to U.S. Provisional Patent Application No.63/115,713 filed November 19, 2020, the entire disclosure of which is 5 hereby incorporated by reference in its entirety for all purposes. BACKGROUND [0002] Single-cell sequencing has the potential to provide unique insights on the cellular and genetic composition, drivers, and signatures of cancer at unparalleled sensitivity. A challenge to reach such sensitivities is to have a subclone identification method that can accurately 10 identify small populations of cells which have a distinct genotype. Various chemistry-based errors lead to the creation of many small subpopulations of cells which make identification of subclones challenging. The methods and computer readable media disclosed herein are directed to address this issue. SUMMARY 15 [0003] Disclosed herein, in various embodiments, are methods and computer readable media for enhanced cellular clustering in sequencing datasets by 1) using ground truth information (e.g., ground truth information from bulk sequencing) and 2) incorporating pseudo cells into the workflow for improved dimensionality reduction and clustering. The use of ground truth information and pseudo cells enables enhanced detection of minor cellular subpopulations 20 (e.g., to limit of detections of 1%, 0.1%, 0.01% or lower) within a larger cell population. [0004] Disclosed herein, in various embodiments, is a method for identifying one or more subpopulations within in a cell population, the method comprising: (a) obtaining a first dataset comprising cell sequencing data from single cells in a cell population; (b) filtering the cell sequencing data of the first dataset against at least a ground truth dataset derived from a 25 bulk sequencing analysis; (c) incorporating one or more pseudo cells comprising one or more known variants; (d) generating one or more clusters of cells comprising the pseudo cells; and (e) annotating the one or more clusters of cells based on one or more pseudo cells in the clusters. In some embodiments, filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises (i) identifying a first set of variants within the first 30 dataset, and a second set of variants within the ground truth dataset; and (ii) generating a filtered set of variants by removing a subset of variants within the first set, wherein the subset of variants does not appear in the second set of variants. [0005] In some embodiments, step (c) of the method occurs before step (a). In some embodiments, step (c) of the method occurs after step (b). [0006] In some embodiments, the method further comprises removing the one or more pseudo cells from the one or more clusters prior to subsequent analysis. In some 5 embodiments, the one or more known variants of the pseudo cells are determined through a bulk sequencing analysis. [0007] In some embodiments, annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster comprising a pseudo cell as a cell population with one or more known variants of the pseudo cell. In some embodiments, 10 annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster that is devoid of a pseudo cell as a mixed cell population. [0008] In some embodiments, the method further comprises performing a dimensionality reduction analysis prior to generating clusters of cells. In some embodiments, the dimensionality reduction analysis is UMAP or PCA. In some embodiments, generating 15 clusters of cells comprises implementing a HDBSCAN method. [0009] In some embodiments, the variants are filtered based on thresholds of allele frequency, read depth, and/or genotyping quality. In some embodiments, the variants are filtered based on thresholds of allele frequency. In some embodiments, the variants are filtered based on thresholds of read depth. In some embodiments, the variants are filtered 20 based on thresholds of genotyping quality. [0010] In some embodiments, the one or more subpopulations are detected with a lower limit below a lower limit of detection where no pseudo cells are incorporated. In some embodiments, the one or more subpopulations are detected with a lower limit below 1%, below 0.5%, below 0.1%, or below 0.01% of the total population. In some embodiments, the 25 one or more subpopulations are detected with a lower limit below 1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.5% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.01% of the total population. 30 [0011] Also disclosed herein, in various embodiments, is a method for identifying one or more subpopulations within a cell population, the method comprising: (a) obtaining a first dataset comprising cell sequencing data from single cells in a cell population; (b) filtering the cell sequencing data of the first dataset against at least a ground truth dataset derived from a bulk sequencing analysis; and (c) identifying one or more subpopulations within the cell population through a combined unsupervised and supervised clustering of cells, wherein the supervised clustering involves one or more pseudo cells comprising one or more known variants. In some embodiments, filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises: (a) identifying a first set of variants within the first 5 dataset and a second set of variants within the ground truth dataset; and (b) generating a filtered set of variants by removing a subset of variants in the first set, wherein the subset of variants does not appear in the second set of variants. [0012] In some embodiments, the method further comprises removing the one or more pseudo cells from the one or more clusters prior to subsequent analysis. In some 10 embodiments, the one or more known variants of the pseudo cells is determined through a bulk sequencing analysis. [0013] In some embodiments, the method further comprises performing a dimensionality reduction analysis prior to generating clusters of cells. In some embodiments, the dimensionality reduction analysis is UMAP or PCA. In some embodiments, the combined 15 unsupervised and supervised clustering of cells comprises implementing a HDBSCAN method. [0014] In some embodiments, the variants are filtered based on thresholds of allele frequency, read depth, and/or genotyping quality. In some embodiments, the variants are filtered based on thresholds of allele frequency. In some embodiments, the variants are 20 filtered based on thresholds of read depth. In some embodiments, the variants are filtered based on thresholds of genotyping quality. [0015] In some embodiments, the one or more subpopulations are detected with a lower limit below a lower limit of detection where no pseudo cells are incorporated. In some embodiments, the one or more subpopulations are detected with a lower limit below 1%, 25 below 0.5%, below 0.1%, or below 0.01% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.5% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.1% of the total population. In some embodiments, the one 30 or more subpopulations are detected with a lower limit below 0.01% of the total population. [0016] Also disclosed herein, in various embodiments is a non-transitory computer readable medium for identifying one or more subpopulations within in a cell population, the non- transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: (a) filter the cell sequencing data of a first dataset against at least a ground truth dataset derived from a bulk sequencing analysis; (b) incorporate one or more pseudo cells comprising one or more known variants; (c) generate one or more clusters of cells comprising the pseudo cells according to at least the filtered set of variants; and (d) annotate the one or more clusters of cells based on one or more pseudo cells in the clusters, 5 wherein the first dataset comprises cell sequencing data from single cells in a cell population. In some embodiments, filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises: (i) identifying a first set of variants within the first dataset and a second set of variants within the ground truth dataset; and; (ii) generating a filtered set of variants by removing a subset of variants in the first set, wherein the subset of variants 10 does not appear in the second set of variants. In some embodiments, the instructions further cause the processor to remove the one or more pseudo cells from the one or more clusters prior to subsequent analysis. [0017] In some embodiments, the one or more known variants of the pseudo cells is determined through a bulk sequencing analysis. In some embodiments, annotating the 15 clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster comprising a pseudo cell as a cell population with one or more known variants of the pseudo cell. In some embodiments, annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster that is devoid of a pseudo cell as a mixed cell population. 20 [0018] In some embodiments, wherein the instructions further cause the processor to perform a dimensionality reduction analysis prior to generating clusters of cells. In some embodiments, wherein the dimensionality reduction analysis is UMAP or PCA. In some embodiments, generating clusters of cells comprises implementing a HDBSCAN method. [0019] In some embodiments, the variants are filtered based on thresholds of allele 25 frequency, read depth, and/or genotyping quality. In some embodiments, the variants are filtered based on thresholds of allele frequency. In some embodiments, the variants are filtered based on thresholds of read depth. In some embodiments, the variants are filtered based on thresholds of genotyping quality. [0020] In some embodiments, the one or more subpopulations are detected with a lower limit 30 below a lower limit of detection where no pseudo cells are incorporated. In some embodiments, the one or more subpopulations are detected with a lower limit below 1%, below 0.5%, below 0.1%, or below 0.01% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.5% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.01% of the total population. [0021] Also disclosed herein, in various embodiments, is a non-transitory computer readable 5 medium for identifying one or more subpopulations within in a cell population, the non- transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: filter cell sequencing data of a first dataset against at least a ground truth dataset derived from a bulk sequencing analysis; and identify one or more subpopulations within a cell population through a combined unsupervised and supervised 10 clustering of cells, wherein the supervised clustering involves one or more pseudo cells comprising one or more known variants, wherein the first dataset comprises cell sequencing data obtained from single cells in the cell population. In some embodiments, filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises: (i) identifying a first set of variants within the first dataset and a second set of variants within the 15 ground truth dataset; and (ii) generating a filtered set of variants by removing a subset of variants in the first set, wherein the subset of variants does not appear in the second set of variants. [0022] In some embodiments, the instructions further cause the processor to remove the one or more pseudo cells from the one or more clusters prior to subsequent analysis. In some 20 embodiments, the one or more known variants of the pseudo cells is determined through a bulk sequencing analysis. [0023] In some embodiments, the instructions further cause the processor to perform a dimensionality reduction analysis prior to generating clusters of cells. In some embodiments, the dimensionality reduction analysis is UMAP or PCA. In some embodiments, the combined 25 unsupervised and supervised clustering of cells comprises implementing a HDBSCAN method. [0024] In some embodiments, the variants are filtered based on thresholds of allele frequency, read depth, and/or genotyping quality. In some embodiments, the variants are filtered based on thresholds of allele frequency. In some embodiments, the variants are 30 filtered based on thresholds of read depth. In some embodiments, the variants are filtered based on thresholds of genotyping quality. [0025] In some embodiments, the one or more subpopulations are detected with a lower limit below a lower limit of detection where no pseudo cells are incorporated. In some embodiments, the one or more subpopulations are detected with a lower limit below 1%, below 0.5%, below 0.1%, or below 0.01% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.5% of the total population. In some embodiments, the one or more subpopulations are 5 detected with a lower limit below 0.1% of the total population. In some embodiments, the one or more subpopulations are detected with a lower limit below 0.01% of the total population. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS [0026] These and other features, aspects, and advantages of the present invention will 10 become better understood with regard to the following description, and accompanying drawings, where: [0027] Figure (FIG.) 1A depicts an overall system environment including a cell analysis workflow device and a computing device for identifying cellular subpopulations, in accordance with an embodiment. 15 [0028] FIG.1B depicts a block diagram of separate modules of a computing device, in accordance with an embodiment. [0029] FIG.2 is a flow diagram of the clustering analysis, in accordance with an embodiment. [0030] FIG.3 depicts an example computing device for implementing system and methods 20 described in reference to FIGs.1A, 1B, and 2. [0031] FIG.4 depicts results of a clustering analysis that does not use pseudo cells representing a truth dataset and does not filter variants using a ground truth dataset (e.g., bulk sequencing dataset). [0032] FIG.5 depicts results of a clustering analysis that implements pseudo cells 25 representing a truth dataset and filters variants using a ground truth dataset (e.g., bulk sequencing dataset). [0033] FIG.6A depicts results of a clustering analysis that implements pseudo cells representing a truth dataset and filters variants using a ground truth dataset (e.g., bulk sequencing datasest) using a 100 plex targeting sequencing panel. 30 [0034] FIG.6B depicts a heatmap of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIG.6A. [0035] FIG.7A depicts results of a clustering analysis that implements pseudo cells representing a truth dataset and filters variants using a ground truth dataset (e.g., bulk sequencing datasest) using a 600 plex targeting sequencing panel. [0036] FIG.7B depicts a heatmap of variants across different populations of cells in 5 accordance with the results of the targeting sequencing panel described in FIG.7A. [0037] FIG.8A depicts results of a clustering analysis that implements pseudo cells representing a truth dataset using a AML_V2 (128 plex) targeting sequencing panel. [0038] FIG.8B depicts a heatmap of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIG.8A. 10 [0039] FIG.9A depicts results of a clustering analysis that implements pseudo cells representing a truth dataset using a AML_V2 (128 plex) targeting sequencing panel. [0040] FIG.9B depicts a heatmap of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIG.9A. [0041] FIG.10A depicts results of a clustering analysis of the same run as FIG.8A that does 15 not implement pseudo cells. [0042] FIG.10B depicts results of a clustering analysis of the same run as FIG.9A that does not implement pseudo cells. DETAILED DESCRIPTION 20 Definitions [0043] Terms used in the claims and specification are defined as set forth below unless otherwise specified. [0044] The term “subject” encompasses a cell, tissue, or organism, human or non-human, whether in vivo, ex vivo, or in vitro, male or female. 25 [0045] The term “sample” can include a single cell or multiple cells or fragments of cells or an aliquot of body fluid, such as a blood sample, taken from a subject, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art. [0046] The phrases “mismatched base” and “alternate base” are used interchangeably and 30 refers to a base at a position that differs from a known reference base at the same position. In some scenarios, a mismatched base is erroneously identified (e.g., erroneously identified during sequencing). An erroneous identification of a base can arise from various sources such as PCR errors, sequencing errors, sequencing alignment errors, and/or correction errors. To provide an example, a known base at a reference position can be adenine (A). A mismatched base or alternate base refers to base other than adenine (A) at the same position (e.g., the base is any one of guanine (G), cytosine (C), or thymine (T)). [0047] The phrase “reference base” refers to a known base with a known nucleotide base. In 5 one embodiment, the reference base is determined from a reference genome sequence. In one embodiment, the reference base is determined from one or more sequence reads obtained from a control cell. [0048] The phrase “subpopulation” refers to a discrete grouping of cells within a larger population of cells that share a common genotype as identified by sequencing analysis. 10 [0049] The term “variant” encompasses mutations of a cell including polymorphisms, single nucleotide polymorphisms (SNPs), single nucleotide variants (SNVs), insertions, deletions, knock-ins, knock-outs, copy number variations (CNVs), duplications, translocations, and loss of heterozygosity (FOH). [0050] It must be noted that, as used in the specification, the singular forms “a,” “an” and 15 “the” include plural referents unless the context clearly dictates otherwise. Overview [0051] Disclosed herein is a method for improved clustering and identification of cellular subpopulations in a mixed cell population. Generally, a population of cells (e.g., mixed cells) 20 can undergo single cell analysis to generate DNA sequencing reads of individual cells. Methods for identifying cellular subpopulations, also referred to herein as caller pipelines, involve 1) filtering DNA sequencing reads of the individuals cells against a ground truth dataset to remove low quality data and 2) incorporating pseudo cells as truth data such that cellular subpopulations are more accurately clustered and identified. In various 25 embodiments, the methods disclosed herein achieve improved limit of detection in comparison to conventional methodologies. In some scenarios, the methods disclosed herein achieve at least a 1% limit of detection, at least a 0.1% limit of detection, or at least a 0.01% limit of detection. [0052] In various embodiments, the methods disclosed herein involve performing a 30 dimensionality reduction of the sequencing data obtained from the single cells. In various embodiments, the methods disclosed herein involve a combination of unsupervised and supervised techniques for clustering and improved identification of cellular subpopulations. For example, the unsupervised technique includes unsupervised clustering of cells based on sequence reads or dimensionally reduced sequenced reads derived from the cells. The supervised technique includes using pseudo cells representing known ground truth labels. Such pseudo cells are useful for labeling clusters generated through unsupervised clustering techniques based on known genotypes of the pseudo cells. In various embodiments, the 5 genotypes of the pseudo cells are determined through sequencing methods, such as bulk sequencing methods. [0053] Figure (FIG.) 1A an overall system environment 100 including a cell analysis workflow device 120 and a computing device 130 for variant calling, in accordance with an embodiment. A cell population 110 is obtained. In various embodiments, the cell population 10 110 can be isolated from a test sample obtained from a subject or a patient. In various embodiments, the cell population 110 includes healthy cells taken from a healthy subject. In various embodiments, the cell population 110 includes diseased cells taken from a subject. In one embodiment, the cell population 110 includes cancer cells taken from a subject previously diagnosed with cancer. For example, cancer cells can be tumor cells available in 15 the bloodstream of the subject diagnosed with cancer. As another example, cancer cells can be cells obtained through a tumor biopsy. [0054] In various embodiments, the cell population 110 may be a mixed cell population. In various embodiments, the cell population 110 may include at least 2 cell lines. In various embodiments, the cell population 110 may include at least 3, at least 4, at least 5, at least 6, at 20 least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 25, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, or at least 100 cell lines. [0055] In various embodiments, the cell population 110 may represent a sample of cells pooled from a plurality of subjects. For example, test samples may be obtained from X 25 number of subjects and the test samples are pooled to generate the cell population 110. In various embodiments, test samples may be obtained and pooled from at least 2 subjects. In various embodiments, test samples may be obtained and pooled from at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 25, at 30 least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, or at least 1000 subjects. [0056] In various embodiments, as shown in FIG.1A, the cell population 110 undergoes bulk sequencing 112 to generate sequence reads of cells of the cell population 110. In various embodiments, the pseudo cells 115 are incorporated into the cell population 110 and therefore, bulk sequencing 112 is performed on cells of the cell population as well as the incorporated pseudo cells 115. Bulk sequencing can involve lysing cells in bulk and extracting nucleic acids from the cell population. The nucleic acids are then sequenced using 5 sequencing techniques described herein (e.g., next generation sequencing (NGS) platforms, including platforms that perform any of sequencing by synthesis, sequencing by ligation, pyrosequencing, using reversible terminator chemistry, using phospholinked fluorescent nucleotides, or real-time sequencing). Further details of bulk sequencing are described herein. The bulk sequencing 112 may generate sequencing reads that are informative for 10 determining cellular genotypes across the full cell population (e.g., average cellular genotype) as opposed to information at the single-cell level. However, in various embodiments, the bulk sequencing 112 methods are simpler (e.g., using fewer reagents and chemical reactions) than the single-cell processing methods performed by the cell analysis workflow device 120 and therefore, the sequencing reads generated via bulk sequencing 112 may be less prone to 15 chemistry-based errors in comparison to the sequencing reads generated via the cell analysis workflow device 120. [0057] In various embodiments, the cell population 110 is split such that a first portion of the cell population 110 is provided to the cell analysis workflow device 120 and a second portion of the cell population 110 is provided to the bulk sequencing 112. Therefore, 20 sequencing reads derived from the cell population 110 that are generated via the bulk sequencing 112 can be used to filter sequencing reads and variants that are generated via the cell analysis workflow device 120, as is described in further detail herein. [0058] The cell analysis workflow device 120 refers to a device that processes cells and generates nucleic acids for sequencing. In various embodiments, the cell analysis workflow 25 device 120 refers to a system comprising one or more devices that process cells and generate nucleic acids for sequencing. In various embodiments, the cell analysis workflow device 120 is a workflow device that generates nucleic acids from single cells, thereby enabling the subsequent identification of sequence reads and individual cells from which the sequence reads originated. Further details of sequencing and/or read alignment are described herein. 30 [0059] In various embodiments, the cell analysis workflow device 120 can perform single-cell processing by encapsulating individual cells into emulsions, lysing cells within emulsions, performing cell barcoding of cell lysate in emulsions, and performing a nucleic amplification reaction in emulsions. Thus, amplified nucleic acids can be collected and sequenced. In various embodiments, the single-cell processing involves amplifying nucleic acids derived from genomic DNA of the cells. Therefore, obtained sequence reads provide information about the genomic DNA of the cells. In various embodiments, the single-cell processing involves amplifying nucleic acids derived from RNA of the cells. Therefore, obtained sequence reads provide information about the gene expression of the cells. In 5 various embodiments, the single-cell processing involves amplifying nucleic acids derived from genomic DNA of the cells and further amplifying nucleic acids derived from RNA of the cells. Therefore, obtained sequence reads provide information about both genomic DNA and gene expression of the cells. Further description of example embodiments of single-cell workflow processes for analyzing genomic DNA or RNA of single-cells is found in US 10 Application No.14/420,646 and WO2020206184, each of which is hereby incorporated by reference in its entirety. [0060] In particular embodiments, the cell analysis workflow device 120 can be any of the Tapestri™ Platform, inDrop™ system, Nadia™ instrument, or the Chromium™ instrument. In various embodiments, the cell analysis workflow device 120 includes a 15 sequencer for sequencing the nucleic acids to generate sequence reads. [0061] As shown in FIG.1A, the method for identifying cellular subpopulations involves incorporating pseudo cells 115. In some embodiments, pseudo cells 115 are incorporated as physical cells in the cell population 110. In some embodiments, pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 0.1% of cells in the 20 total cell population. In some embodiments, pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 0.5% of cells in the total cell population. In some embodiments, pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 1% of cells in the total cell population. In some embodiments, pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 25 5% of cells in the total cell population. In some embodiments, pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 10% of cells in the total cell population. In some embodiments, pseudo cells 115 are incorporated into the cell population 110 at a concentration of less than 0.2%, less than 0.3%, less than 0.4%, less than 0.5%, less than 0.6%, less than 0.7%, less than 0.8%, less than 0.9%, less than 1%, less than 30 2%, less than 3%, less than 4%, less than 5%, less than 6%, less than 7%, less than 8%, less than 9%, or less than 10% of cells in the total cell population 110. Further details of regarding pseudo cells are described herein. [0062] In some embodiments, pseudo cells are not physically incorporated into the cell population 110. As shown in FIG.1A, pseudo cells 115 can be incorporated as data with the sequence reads from the cell analysis workflow device 120. For example, sequence reads derived from the pseudo cells 115 can be obtained and incorporated along with sequence reads generated by the cell analysis workflow device 120. In various embodiments, the sequence reads derived from the pseudo cells 115 are generated via a single-cell analysis 5 (e.g., any of Tapestri™ Platform, inDrop™ system, Nadia™ instrument, or the Chromium™ instrument) or via a bulk sequencing analysis. In particular embodiments, pseudo cells 115 undergo bulk sequencing to generate sequence reads. The sequence reads derived from the pseudo cells 115 undergo subsequent processing with the sequence reads generated by the cell analysis workflow device 120, which leads to the improved clustering and subpopulation 10 identification. In various embodiments, the sequence reads derived from the pseudo cells 115 can be obtained from a third party (e.g., a third party who processes the pseudo cells 115). For example, the third party may operate a single-cell analysis workflow device (e.g., any of Tapestri™ Platform, inDrop™ system, Nadia™ instrument, or the Chromium™ instrument) to generate sequence reads of the pseudo cells 115. As another example, the third party may 15 perform bulk sequencing on at least the pseudo cells 115 to generate the sequence reads of the pseudo cells 115. [0063] In some embodiments, a quantity of sequence reads of pseudo cells 115 that are incorporated represents less than 0.01% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110. In some embodiments, a quantity of 20 sequence reads of pseudo cells 115 that are incorporated represents less than 0.1% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110. In some embodiments, a quantity of sequence reads of pseudo cells 115 that are incorporated represents less than 0.5% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110. In some embodiments, a quantity of 25 sequence reads of pseudo cells 115 that are incorporated represents less than 1% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110. In some embodiments, a quantity of sequence reads of pseudo cells 115 that are incorporated represents less than 5% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110. In some embodiments, a quantity of 30 sequence reads of pseudo cells 115 that are incorporated represents less than 10% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110. In various embodiments, a quantity of sequence reads of pseudo cells 115 that are incorporated represents less than 0.2%, less than 0.3%, less than 0.4%, less than 0.5%, less than 0.6%, less than 0.7%, less than 0.8%, less than 0.9%, less than 1%, less than 2%, less than 3%, less than 4%, less than 5%, less than 6%, less than 7%, less than 8%, less than 9%, or less than 10% of total sequence reads generated by the cell analysis workflow device 120 from the cell population 110. [0064] The computing device 130 is configured to receive the sequence reads from the cell 5 analysis workflow device 120 and to process the sequence reads to identify one or more cellular subpopulations 140. In various embodiments, the computing device 130 is communicatively coupled to the cell analysis workflow device 120, and therefore, directly receives the sequence reads from the cell analysis workflow device 120. In various embodiments, the computing device 130 filters sequence reads and/or variants (e.g., by 10 thresholding variants and/or by filtering against a ground truth dataset), performs dimensionality reduction of sequencing data of cells and pseudo cells 115, and further performs clustering of cells and pseudo cells 115. In particular embodiments, by clustering cells and pseudo cells 115 according to their sequencing data, the computing device 130 identifies subpopulations 140 within the cell population 110. 15 [0065] FIG.1B details modules within the computing device 130. Specifically, FIG.1B introduces a filter module 132, a dimensionality reduction module 134, a clustering module 136, and an annotation module 138. In various embodiments, the computing device 130 may include additional or fewer modules. [0066] The filter module 132 performs filtering of sequence reads of cells that are obtained 20 via the cell analysis workflow device 120 (see FIG.1A). In various embodiments, the filter module 132 may filter sequence reads of cells against threshold values of any of allele frequency, read depth, and/or genotyping quality. For example, the filter module 132 may remove variants that do not meet a threshold value of allele frequency, read depth, and/or genotyping quality. In various embodiments, the filter module 132 may filter sequence reads 25 of cells against a ground truth dataset, such as a ground truth dataset derived from bulk sequencing of the cells. Here, the filter module 132 removes variants of the sequence reads of the cells detected via the single-cell analysis (e.g., by the cell analysis workflow device 120) that do not appear in the sequence reads of the cells that underwent bulk sequencing 112 (see FIG.1A). Further details of filtering the sequence reads of the cells are described herein. 30 [0067] The dimensionality reduction module 134 performs dimensionality reduction on the sequencing reads of the cells that are obtained via the cell analysis work flow device 120. In various embodiments, the dimensionality reduction module 134 further incorporates sequence reads from one or more pseudo cells (e.g., sequence reads of pseudo cells obtained via bulk sequencing). Therefore, the dimensionality reduction module 134 performs dimensionality reduction of sequence reads of cells obtained via the cell analysis workflow device 120 and sequence reads of pseudo cells. In various embodiments, the dimensionality reduction module 134 performs dimensionality reduction on filtered sequencing reads of the cells, wherein the filtering has been performed by the filter module 132. In such embodiments, low 5 quality reads and variants have been removed and therefore, the dimensionally reduced dataset resulting from the dimensionality reduction module 134 can retain more relevant information. [0068] The clustering module 136 clusters cells according to the sequence reads and/or detected variants of the cells. In various embodiments, the clustering module 136 clusters 10 cells according to the filtered sequence reads and/or filtered variants of the cells. In various embodiments, the clustering module 136 clusters cells according to the dimensionally reduced filtered sequence reads and/or dimensionally reduced filtered variants of the cells. In particular embodiments, the clustering module 136 clusters both cells and pseudo cells (e.g., according to the filtered sequence reads and/or filtered variants of the cells and the sequence 15 reads and/or variants of the pseudo cells). The clustering module 136 generates one or more clusters of cells. Here, each cluster of cells represents a subpopulation of cells. [0069] The annotation module 138 annotates clusters as subpopulations based on the presence or absence of one or more pseudo cells within the clusters. For example, the pseudo cell may have a known genotype and is known to originate from a source. Therefore, the 20 annotation module 138 may annotate a cluster in which the pseudo cell is located with the source from which the pseudo cell originated from. As an example, the pseudo cell may be obtained from a subject (e.g., a human subject) and is located in a cluster of cells. Therefore, the annotation module 138 may annotate the cluster of cells as originating from the subject that the pseudo cell was obtained from. In such an example, the annotation module 138 may 25 identify subjects that clusters of cells are obtained from, thereby de-multiplexing cells and enabling subject-specific analysis. [0070] FIG.2 is a flow diagram of the clustering analysis, in accordance with an embodiment. [0071] At step 210, a filter is applied to remove low quality reads and/or low quality variants. 30 In this step, variants are filtered using standard variant filters included in Tapestri Insights. These include filters on allele frequency, read depth, percentage of cells in which variants are found. [0072] At step 220, variants that do not appear in a ground truth dataset are removed. Generally, ground truth information can be used for variant filtering, where sequence reads with variants that are not found in the ground truth dataset can be removed. Here, only sequence reads with variants which were also found in ground truth dataset are kept whereas sequence reads with variants that do not appear in the ground truth dataset can be removed. [0073] At step 230, pseudo cells are added. Pseudo cells represent cells that have genotypes 5 derived from bulk truth. The addition of pseudo cells improves the clustering and further improves the cluster annotation (e.g., step 260). [0074] At step 240, dimensionality reduction is performed on the cell sequencing data which includes sequencing data of single cells of a cell population (e.g., cell population 110 shown in FIG.1A) as well as sequencing data of pseudo cells that were added at step 230. In 10 various embodiments, dimensionality reduction is one of principal components analysis or uniform manifold approximation and approximation (UMAP) analysis. [0075] At step 250, clusters are generated. In various embodiments, generating clusters involves implementing a hierarchical density-based spatial clustering of applications with noise (HDBSCAN) method. Here, the clusters include both cells of the cell population (e.g., 15 cell population 110 shown in FIG.1A) as well as pseudo cells that were added at step 230. [0076] At step 260, the generated clusters are annotated. Specifically, the clusters are annotated using the truth variants of pseudo cells. In various embodiments, clusters can contain a pure cell line, or can contain a mixture of cell lines. [0077] At step 270, pseudo cells are removed from the analysis. In this step, pseudo cells are 20 removed before calculating the ratios of different cell types. This prevents the pseudo cells from altering the subsequent analysis. Filtering of Variants [0078] Embodiments of the method disclosed herein involve filtering sequencing data corresponding to single cells. Filtering of sequencing data can comprise removing low- 25 quality data. Low quality data can arise, for example, from chemistry-based errors occurring during sample preparation or processing. The presence of low quality can negatively impact subsequent analysis steps (e.g., dimensionality reduction and/or clustering) which further negatively impacts the ability to accurately identify cellular subpopulations. In some embodiments, data are filtered based on thresholds of allele frequency, read depth, 30 genotyping quality, frequency of genotypes per cell, frequency of genotypes present in cells, and/or variant mutation frequency per cell. [0079] In some embodiments, variants are filtered based on thresholds of allele frequency. For example, if a variant across the sequence reads is observed at a frequency that is below an allele frequency threshold, then the variant is removed. In various embodiments, the allele frequency threshold is 25%. In various embodiments, the allele frequency threshold is 20%. In various embodiments, the allele frequency threshold is 15%. In various embodiments, the allele frequency threshold is 10%. In various embodiments, the allele frequency threshold is 5 5%. In various embodiments, the allele frequency threshold is 1%. In various embodiments, the allele frequency threshold is 0.5%. In various embodiments, the allele frequency threshold is 0.1%. In various embodiments, the allele frequency threshold is 0.05%. In various embodiments, the allele frequency threshold is 0.01%. In various embodiments, the allele frequency threshold is any of 25%, 20%, 15%, 10%, 5%, 4%, 3%, 2%, 1%, 0.9%, 10 0.8%, 0.7%, 0.6%, 0.5%, 0.4%, 0.3%, 0.2%, 0.1%, 0.09%, 0.08%, 0.07%, 0.06%, 0.05%, 0.04%, 0.03%, 0.02%, or 0.01%, [0080] In some embodiments, data are filtered based on thresholds of read depth. For example, if a variant across the sequence reads is observed at a depth that is below a read depth threshold, then the variant is removed. Read depth can be read in terms of number of 15 reads per cell per amplicon. In various embodiments, the read depth threshold is 25 reads per cell per amplicon. In various embodiments, the read depth threshold is 20 reads per cell per amplicon. In various embodiments, the read depth threshold is 20,000 reads per cell. In various embodiments, the read depth threshold is 15 reads per cell per amplicon. In various embodiments, the read depth threshold is 10 reads per cell per amplicon. In various 20 embodiments, the read depth threshold is 5 reads per cell per amplicon. In various embodiments, the read depth threshold is any of 25 reads per cell per amplicon, 20 reads per cell per amplicon, 15 reads per cell per amplicon, 10 reads per cell per amplicon, or 5 reads per cell per amplicon reads per cell per amplicon. [0081] In some embodiments, data are filtered based on thresholds of genotyping quality. For 25 example, if a variant across the sequence reads is observed at a quality that is below a read genotyping quality threshold, then the variant is removed. Genotyping quality can be measured by the Phred Quality Scale, where Q = -log10E. In various embodiments, the genotyping quality threshold is a Phred score of 20. In various embodiments, the genotyping quality threshold is a Phred score of 21. In various embodiments, the genotyping quality 30 threshold is a Phred score of 22. In various embodiments, the genotyping quality threshold is a Phred score of 23. In various embodiments, the genotyping quality threshold is a Phred score of 24. In various embodiments, the genotyping quality threshold is a Phred score of 25. In various embodiments, the genotyping quality threshold is a Phred score of 26. In various embodiments, the genotyping quality threshold is a Phred score of 27. In various embodiments, the genotyping quality threshold is a Phred score of 28. In various embodiments, the genotyping quality threshold is a Phred score of 29. In various embodiments, the genotyping quality threshold is a Phred score of 30. In various embodiments, the genotyping quality threshold is a Phred score of 31. In various 5 embodiments, the genotyping quality threshold is a Phred score of 32. In various embodiments, the genotyping quality threshold is a Phred score of 33. In various embodiments, the genotyping quality threshold is a Phred score of 34. In various embodiments, the genotyping quality threshold is a Phred score of 35. In various In various embodiments, the genotyping quality threshold is any of a Phred score of 20, 21, 22, 23, 24, 10 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, or 35. [0082] In various embodiments data are filtered based on thresholds of frequency of variants genotyped per cell. For example, if a variant across the sequence reads is observed at a frequency that is below a threshold of variants genotyped per cell, then the variant is removed. In various embodiments, the threshold of variants genotyped per cell is 65%. In 15 various embodiments, the threshold of variants genotyped per cell is 60%. In various embodiments, the threshold of variants genotyped per cell is 55%. In various embodiments, the threshold of variants genotyped per cell is 50%. In various embodiments, the threshold of variants genotyped per cell is 45%. In various embodiments, the threshold of variants genotyped per cell is 40%. In various embodiments, the threshold of variants genotyped per 20 cell is 65%, 60%, 55%, 50%, 45%, or 40%. [0083] In various embodiments data are filtered based on thresholds of frequency of genotypes present in cells. For example, if a cell is observed at a frequency that is below a threshold of genotypes present, then the cell is removed (e.g., sequence reads including the cell are removed). In various embodiments, the threshold of frequency of genotypes present 25 in cells is 65%. In various embodiments, the threshold of frequency of genotypes present in cells is 60%. In various embodiments, the threshold of frequency of genotypes present in cells is 55%. In various embodiments, the threshold of frequency of genotypes present in cells is 50%. In various embodiments, the threshold of frequency of genotypes present in cells is 45%. In various embodiments, the threshold of frequency of genotypes present in 30 cells is 40%. In various embodiments, the threshold of frequency of genotypes present in cells is 65%, 60%, 55%, 50%, 45%, or 40%. [0084] In various embodiments, data are filtered based on thresholds of variant mutation frequency per cell. For example, if a variant across the sequence reads is observed at a frequency that is below a threshold of variants mutated per cell, then the variant is removed. In various embodiments, the threshold of variant mutation frequency per cell is 5%. In various embodiments, the threshold of variant mutation frequency per cell is 4%. In various embodiments, the threshold of variant mutation frequency per cell is 3%. In various embodiments, the threshold of variant mutation frequency per cell is 2%. In various 5 embodiments, the threshold of variant mutation frequency per cell is 1%. In various embodiments, the threshold of variant mutation frequency per cell is 0.75%. In various embodiments, the threshold of variant mutation frequency per cell is 0.5%. In various embodiments, the threshold of variant mutation frequency per cell is 0.25%. In various embodiments, the threshold of variant mutation frequency per cell is 5%, 4%, 3%, 2%, 1%, 10 0.75%, 0.5%, or 0.25%. [0085] In various embodiments, sequencing data of single cells (e.g., obtained via the cell analysis workflow device 120) are filtered against a ground truth dataset. A ground truth dataset can include sequencing reads informative for determining a genotype of the single cells. In various embodiments, the ground truth dataset includes sequencing reads that were 15 obtained via a sequencing method, such as a bulk sequencing method. For example, referring again to FIG.1A, a portion of the cell population 110 may undergo bulk sequencing 112, thereby generating sequencing reads of the ground truth dataset. Here, the ground truth dataset may include one or more variants that are present in the cells. Additionally, the ground truth dataset may include less or limited low quality data that would arise from 20 chemistry-based errors in comparison to the sequencing data acquired from the cell analysis workflow device 120. Therefore, the ground truth dataset may include more accurate, higher quality sequencing reads, but does not have the single-cell resolution of the sequencing reads obtained via the cell analysis workflow device 120. [0086] In various embodiments, sequencing data of the single cells (e.g., obtained via the cell 25 analysis workflow device 120) are filtered against the ground truth dataset to remove sequencing reads and/or variants identified through the single cell sequencing that do not appear in the ground truth dataset. Such sequencing reads and/or variants present in the single cell sequencing, but not in the ground truth dataset may arise due to processing or chemistry artifacts and therefore, do not accurately reflect the genotype of the cells. 30 [0087] In various embodiments, sequencing data of the single cells are filtered against the ground truth dataset on a per-sequence read basis. For example, for each sequence read, a variant in the sequence read is identified as present. In various embodiments, presence of a variant in the sequence read can be conducted by aligning the sequence read to a reference genome, as is described in further detail herein. For each sequence read, the variant in the sequence read is queried against the ground truth dataset to determine whether the ground truth dataset includes the variant at the position of the reference genome. If the variant was not identified in the ground truth dataset, the variant is removed from further consideration. [0088] In various embodiments, sequencing data of the single cells are filtered against the 5 ground truth dataset on a per-variant basis. Here, sequence reads of the single cells are aligned to a reference genome and collapsed. At a particular position of the genome, a variant is identified as being present in at least one sequence read of the single cells. Thus, the variant is queried against the ground truth dataset to determine whether the ground truth dataset also includes the variant at the position of the genome. If the variant was not 10 identified in the ground truth dataset, the variant is removed from further consideration. [0089] In various embodiments, by filtering the sequencing data of the single cells (e.g., obtained via the cell analysis workflow device 120) against the ground truth dataset, at least 2 variants are removed. In various embodiments, by filtering the sequencing data of the single cells (e.g., obtained via the cell analysis workflow device 120) against the ground truth 15 dataset, at least 5 variants are removed. In various embodiments, by filtering the sequencing data of the single cells (e.g., obtained via the cell analysis workflow device 120) against the ground truth dataset, at least 10 variants are removed. In various embodiments, by filtering the sequencing data of the single cells (e.g., obtained via the cell analysis workflow device 120) against the ground truth dataset, at least 50 variants are removed. In various 20 embodiments, by filtering the sequencing data of the single cells (e.g., obtained via the cell analysis workflow device 120) against the ground truth dataset, at least 100 variants are removed. In various embodiments, by filtering the sequencing data of the single cells (e.g., obtained via the cell analysis workflow device 120) against the ground truth dataset, at least 500 variants are removed. In various embodiments, by filtering the sequencing data of the 25 single cells (e.g., obtained via the cell analysis workflow device 120) against the ground truth dataset, at least 1000 variants are removed. In various embodiments, by filtering the sequencing data of the single cells (e.g., obtained via the cell analysis workflow device 120) against the ground truth dataset, at least 2 variants, at least 3 variants, at least 4 variants, at least 5 variants, at least 6 variants, at least 7 variants, at least 8 variants, at least 9 variants, at 30 least 10 variants, at least 15 variants, at least 20 variants, at least 25 variants, at least 30 variants, at least 40 variants, at least 50 variants, at least 75 variants, at least 100 variants, at least 150 variants, at least 200 variants, at least 300 variants, at least 400 variants, at least 500 variants, at least 1000 variants, at least 5000 variants, at least 10,000 variants, at least 50,000 variants, at least 100,000 variants, at least 500,000 variants, or at least 1,000,000 variants are removed. Pseudo Cells [0090] In various embodiments, methods disclosed herein involve incorporation of pseudo 5 cells. “Pseudo cell” (or “true cell,” used interchangeably) refers to a cell with a known genotype. For example, a “pseudo cell” can refer to a cell with one or more known variants. In various embodiments, the known genotype (e.g., one or more known variants) of the pseudo cell is determined through bulk sequencing techniques. In various embodiments, pseudo cells are removed from the one or more clusters prior to subsequent analysis. 10 [0091] In some embodiments, one or more pseudo cells are introduced as a physical cell in a mixed population of cells. In some embodiments, one or more pseudo cells are introduced as sequencing data into a sequencing analysis. In various embodiments, the one or more pseudo cells are removed from the clusters prior to subsequent analysis. [0092] Pseudo cells can be any suitable cell type comprising distinguishing variants to 15 distinguish pseudo cells from other cells in the analysis. In some embodiments, pseudo cells have at least 4 variants distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 5 variants distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 6 variants distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 7 variants 20 distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 8 variants distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 9 variants distinguishing them from other cells in the analysis. In some embodiments, pseudo cells have at least 10 variants distinguishing them from other cells in the analysis.. In some embodiments, pseudo cells have at least 4, at least 5, at least 6, 25 at least 7, at least 8, at least 9, or at least 10 variants distinguishing them from other cells in the analysis. [0093] In various embodiments, pseudo cells are immune cells such as peripheral blood mononuclear cells (e.g., including any of lymphocytes such as T-cells, B-cells, monocytes, NK-cells or monocytes such as erythrocytes, platelets, neutrophils, basophils, eosinophils), In 30 particular embodiments, pseudo cells are lymphocytes. In particular embodiments, pseudo cells are T-cells. In particular embodiments, pseudo cells are B-cells. In particular embodiments, pseudo cells are NK-cells. In various embodiments, pseudo cells are attached cells (e.g., epithelial cells, connective tissue cells, endothelial cells, muscle cells, or neural cells, In various embodiments, pseudo cells are epithelial cells.. In various embodiments, pseudo cells are connective tissue cells. In various embodiments, pseudo cells are endothelial cells. In various embodiments, pseudo cells are muscle cells. In various embodiments, pseudo cells are nerve cells such as neurons, glia, etc. 5 [0094] In various embodiments, pseudo cells are cells associated with a disease state (e.g., cancer cells, infected cells, activated immune cells, etc.). In some embodiments, pseudo cells are cancer cells. In some embodiments, pseudo cells are infected cells. In some embodiments, pseudo cells are activated immune cells. [0095] In various embodiments, pseudo cells are obtained from a subject. In various 10 embodiments, pseudo cells are isolated from a from a test sample obtained from a subject. In various embodiments, pseudo cells are obtained from a test sample in which a cell population (e.g., cell population 110) is also obtained. For example, a test sample can be obtained from a subject. Thus, the test sample can be processed to separately obtain a first portion comprising pseudo cells and a second portion comprising the cell population (e.g., cell 15 population 110). Thus, the pseudo cells can be analyzed (e.g., through bulk sequencing) whereas the cell population can undergo single cell analysis. Thus, in such embodiments, the pseudo cells are allogeneic with respect to the cell population that undergoes single cell analysis. In various embodiments, the cell population can be pooled with other cell populations from other subjects prior to single-cell analysis. Thus, the pseudo cells and/or 20 the analysis of the pseudo cells can be used to later de-multiplex the single-cell analysis of the mixed cell population. [0096] In various embodiments, pseudo cells are allogenic to other cells in the analysis. In various embodiments, pseudo cells are heterologous to other cells in the analysis. In various embodiments, pseudo cells are allogenic to some cells in the analysis and heterologous to 25 other cells in the analysis. In various embodiments, pseudo cells can be established cell lines. Dimensionality Reduction and Cell Clustering [0097] In various embodiments of the methods disclosed herein, dimensionality reduction is performed on sequencing data (e.g., sequencing data of single cells and/or sequencing data of incorporated pseudo cells). Dimensionality reduction is a process for translating a data set 30 having many dimensions to a data set having fewer of dimensions. In various embodiments, it is desirable to preserve as much as possible the difference, or “distance,” between pairs of data points. Thus, for example, data points that are closely spaced in the high-dimensionality data set can be closely spaced in the low-dimensionality data set, and data points that are widely spaced in the high-dimensionality data set can be widely spaced in the low dimensionality data set. Such preservation of the sample pair distances can enable preservation of information contained by the data set following dimensionality reduction. [0098] In some embodiments, the dimensionality reduction analysis can be performed on a 5 variant (e.g., mutations of a cell including polymorphisms, single nucleotide polymorphisms (SNPs), single nucleotide variants (SNVs), insertions, deletions, knock-ins, knock-outs, copy number variations (CNVs), duplications, translocations, and loss of heterozygosity (FOH)) that are present in the sequencing data of single cells and/or pseudo cells. Thus, the dimensionally reduced dataset can retain the information relevant for the variant while 10 eliminating redundancy/correlation across other features. [0099] In various embodiments, the dimensionality reduction analysis can be performed on two or more variants (e.g., mutations of a cell including polymorphisms, single nucleotide polymorphisms (SNPs), single nucleotide variants (SNVs), insertions, deletions, knock-ins, knock-outs, copy number variations (CNVs), duplications, translocations, and loss of 15 heterozygosity (FOH)). For example, the dimensionality reduction analysis can be performed on the combination of single nucleotide variants (SNVs) and copy number variations (CNVs). As another example, the dimensionality reduction analysis can be performed on the combination of insertions and deletions. Thus, the dimensionally reduced dataset can retain the information relevant for the combination of features while eliminating 20 redundancy/correlation across other features. [00100] Examples of dimensionality reduction analysis include principal component analysis (PCA), kernel PCA, graph-based kernel PCA, linear discriminant analysis, generalized discriminant analysis, autoencoder, non-negative matrix factorization, T- distributed stochastic neighbor embedding (t-SNE), or uniform manifold approximation and 25 projection (UMAP) and dens-UMAP. In particular embodiments, the dimensionality reduction is principal component analysis (PCA). In particular embodiments, the dimensionality reduction is performed via a non-linear algorithm. In particular embodiments, the dimensionality reduction is Uniform Manifold Approximation and Projection (UMAP). [00101] In various embodiments of the methods disclosed herein, sequencing data 30 corresponding to single cells are clustered. “Clustering” is process of sorting data points into groups within a feature space. Clustering methods can be supervised, unsupervised, or combined supervised and unsupervised. Clustering algorithms often rely on machine learning to sort data points into discrete groups, or “clusters.” In various embodiments, the method describe detects cellular subpopulations by identifying clusters of data points corresponding to single cells. In some embodiments, the clustering is combined supervised and unsupervised. Examples of unsupervised cluster analysis include hierarchical clustering, k- means clustering, clustering using mixture models, density based spatial clustering of applications with noise (DBSCAN), ordering points to identify the clustering structure 5 (OPTICS), or combinations thereof. In some embodiments, the clustering algorithm is the hierarchical density-based spatial clustering of applications with noise (HDBSCAN) method.In particular embodiments, the step of clustering involves clustering cells according to one or variants in the sequencing data of the cells. In various embodiments, the sequencing data of the cells may have previously undergone filtering to remove low quality data, as is 10 described in further detail herein. Thus, the clustering of the cells involves clustering according to one or more variants in the sequencing data that remains after the filtering of low quality data. In particular embodiments, the step of clustering involves clustering cells and additionally pseudo cells according to one or more variants in the sequencing data of the cells and pseudo cells. Thus, the pseudo cells can be grouped into one or more of the 15 clusters. In particular embodiments, the step of clustering involves both 1) clustering cells according to one or more variants in sequencing data that has previously undergone filtering to remove low quality data and 2) clustering pseudo cells according to one or more variants in the sequencing data of the pseudo cells. Altogether, the previously performed filtering step to remove low quality data and the inclusion of pseudo cells enables improved clustering and 20 identification of cellular subpopulations. [00102] In particular embodiments, clusters of cells are generated according to detected variants for two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty five, thirty, forty, fifty, sixty, seventy, eighty, ninety, or one hundred genes or more. In particular 25 embodiments, clusters of cells are generated according to two or more detected variants for one or more genes. In particular embodiments, clusters of cells are generated according to two or more detected variants for two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty five, thirty, forty, fifty, sixty, seventy, eighty, ninety, or one hundred genes or more and de tected 30 structural variants for two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty five, thirty, forty, fifty, sixty, seventy, eighty, ninety, or one hundred genes or more. [00103] In various embodiments, methods disclosed herein for identifying subpopulations of cells includes first performing a dimensionality reduction on sequencing data from cells (e.g., single cells and/or pseudo cells) followed by performing clustering of cells and/or pseudo cells according to one or more variants present in the dimensionally reduced sequencing data set. In various embodiments, the clustering of the cells involves performing unsupervised clustering of the cells according to one or more variants present in the 5 dimensionally reduced sequencing data set. In various embodiments, the clustering of the cells involves performing unsupervised clustering of the cells according information representing dimensionally reduced values of one or more variants present in the dimensionally reduced sequencing data set. [00104] In various embodiments, the inclusion of pseudo cells enables detection of one or 10 more subpopulations with a limit of detection below in comparison to a limit of detection in which no pseudo cells are incorporated. In some embodiments, the inclusion of pseudo cells in the analysis enables detection of one or more subpopulations at a limit of detection below 1%, below 0.5% below 0.1%, or below 0.01% of the total population. In some embodiments, the one or more subpopulations are detected with a limit of detection below 1% of the total 15 population. In some embodiments, the one or more subpopulations are detected with a limit of detection below 0.5% of the total population. In some embodiments, the one or more subpopulations are detected with a limit of detection below 0.1% of the total population. In some embodiments, the one or more subpopulations are detected with a limit of detection below 0.01% of the total population. 20 [00105] In various embodiments, the inclusion of pseudo cells as a ground truth allows annotation of identified clusters as specific cellular subpopulations. In some embodiments, the clusters of cells are annotated based on presence of one or more pseudo cells. In some embodiments, annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster comprising a pseudo cell as a cell population with one or more 25 known variants of the pseudo cell. In some embodiments, annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster that is devoid of a pseudo cell as a mixed cell population. [00106] In various embodiments, the original cell population (e.g., cell population 110 shown in FIG.1A) includes a mixture of cells obtained from various subjects. Following 30 implementation of the pipeline described herein, different cells from the different subjects can be grouped in different clusters. For example, a first cluster can identify a subpopulation of cells that were obtained from a first subject, a second cluster can identify a subpopulation of cells from a second patient, and so on. One or more pseudo cells of known genotypes can be incorporated in the pipeline, and therefore, they are grouped into the clusters. Such pseudo cells may be obtained from the subjects and therefore, can serve as positive labels for annotating the clusters. [00107] To provide an example, assume that cells are obtained from 3 subjects and mixed together to generate a cell population (e.g., cell population 110 shown in FIG.1A). The cells 5 can be analyzed using a single-cell analysis workflow device to generate sequencing data of the cells. Pseudo cells also obtained from the 3 subjects can be incorporated, and the sequencing data of the cells and pseudo cells can be filtered, dimensionally reduced, and clustered. Here, the presence of the pseudo cells enables grouping of cells into at least 3 distinct clusters. Therefore, given that a pseudo cell is known to originate from one of the 3 10 subjects, cells in the cluster in which the pseudo cell is clustered within can be annotated and identified as also originating from the one of the 3 subjects. [00108] In various embodiments, pseudo cells are removed from the clusters prior to subsequent analysis. In various embodiments, removal of pseudo cells from the clusters refers to the removal of the corresponding sequencing data of the pseudo cells from a library 15 of sequencing reads. For example, subsequent analysis can involve quantifying proportions of subpopulations of cells in the full cell population. Therefore, by removing pseudo cells, this enables an accurate quantification of the proportions of subpopulations of cells that do not include the pseudo cells. Methods for Sequencing and Read Alignment 20 [00109] Embodiments of the invention disclosed herein involve the sequencing of nucleic acids and the alignment of the sequence reads to a reference genome. In various embodiments, the steps of sequencing nucleic acids and aligning sequence reads to a reference genome is performed by a sequencer, such as a sequencer of the cell analysis workflow device 120, as described above in reference to FIG.1A. Therefore, the sequenced 25 and aligned sequence reads can be analyzed by the dimensionality reduction and clustering device 130 and more specifically, can be analyzed by the base identification module 210 (see FIG.2) to identify bases of interest. [00110] Sequence reads can be achieved with commercially available next generation sequencing (NGS) platforms, including platforms that perform any of sequencing by 30 synthesis, sequencing by ligation, pyrosequencing, using reversible terminator chemistry, using phospholinked fluorescent nucleotides, or real-time sequencing. As an example, amplified nucleic acids may be sequenced on an Illumina MiSeq platform. [00111] When pyrosequencing, libraries of NGS fragments are cloned, in-situ amplified by capture of one matrix molecule using granules coated with oligonucleotides complementary to adapters. Each granule containing a matrix of the same type is placed in a microbubble of the “water in oil” type and the matrix is cloned amplified using a method called emulsion 5 PCR. After amplification, the emulsion is destroyed and the granules are stacked in separate wells of a titration picoplate acting as a flow cell during sequencing reactions. The ordered multiple administration of each of the four dNTP reagents into the flow cell occurs in the presence of sequencing enzymes and a luminescent reporter, such as luciferase. In the case where a suitable dNTP is added to the 3′ end of the sequencing primer, the resulting ATP 10 produces a flash of luminescence within the well, which is recorded using a CCD camera. It is possible to achieve a read length of more than or equal to 400 bases, and it is possible to obtain 106 readings of the sequence, resulting in up to 500 million base pairs (megabytes) of the sequence. Additional details for pyrosequencing are described in Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; US 15 patent No.6,210,891; US patent No.6,258,568; each of which is hereby incorporated by reference in its entirety. [00112] On the Solexa / Illumina platform, sequencing data is produced in the form of short readings. In this method, fragments of a library of NGS fragments are captured on the surface of a flow cell that is coated with oligonucleotide anchor molecules. An anchor 20 molecule is used as a PCR primer, but due to the length of the matrix and its proximity to other nearby anchor oligonucleotides, elongation by PCR leads to the formation of a “vault” of the molecule with its hybridization with the neighboring anchor oligonucleotide and the formation of a bridging structure on the surface of the flow cell. These DNA loops are denatured and cleaved. Straight chains are then sequenced using reversibly stained 25 terminators. The nucleotides included in the sequence are determined by detecting fluorescence after inclusion, where each fluorescent and blocking agent is removed prior to the next dNTP addition cycle. Additional details for sequencing using the Illumina platform are found in Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; US patent No.6,833,246; US patent No.7,115,400; US patent 30 No.6,969,488; each of which is hereby incorporated by reference in its entirety. [00113] Sequencing of nucleic acid molecules using SOLiD technology includes clonal amplification of the library of NGS fragments using emulsion PCR. After that, the granules containing the matrix are immobilized on the derivatized surface of the glass flow cell and annealed with a primer complementary to the adapter oligonucleotide. However, instead of using the indicated primer for 3 'extension, it is used to obtain a 5' phosphate group for ligation for test probes containing two probe-specific bases followed by 6 degenerate bases and one of four fluorescent labels. In the SOLiD system, test probes have 16 possible combinations of two bases at the 3 'end of each probe and one of four fluorescent dyes at the 5 5' end. The color of the fluorescent dye and, thus, the identity of each probe, corresponds to a certain color space coding scheme. After many cycles of alignment of the probe, ligation of the probe and detection of a fluorescent signal, denaturation followed by a second sequencing cycle using a primer that is shifted by one base compared to the original primer. In this way, the sequence of the matrix can be reconstructed by calculation; matrix bases are checked 10 twice, which leads to increased accuracy. Additional details for sequencing using SOLiD technology are found in Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; US patent No.5,912,148; US patent No.6,130,073; each of which is incorporated by reference in its entirety. [00114] In particular embodiments, HeliScope from Helicos BioSciences is used. 15 Sequencing is achieved by the addition of polymerase and serial additions of fluorescently- labeled dNTP reagents. Switching on leads to the appearance of a fluorescent signal corresponding to dNTP, and the specified signal is captured by the CCD camera before each dNTP addition cycle. The reading length of the sequence varies from 25-50 nucleotides with a total yield exceeding 1 billion nucleotide pairs per analytical work cycle. Additional details 20 for performing sequencing using HeliScope are found in Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; US Patent No. 7,169,560; US patent No.7,282,337; US patent No.7,482,120; US patent No.7,501,245; US patent No.6,818,395; US patent No.6,911,345; US patent No.7,501,245; each of which is incorporated by reference in its entirety. 25 [00115] In some embodiments, a Roche sequencing system 454 is used. Sequencing 454 involves two steps. In the first step, DNA is cut into fragments of approximately 300-800 base pairs, and these fragments have blunt ends. Oligonucleotide adapters are then ligated to the ends of the fragments. The adapter serve as primers for amplification and sequencing of fragments. Fragments can be attached to DNA-capture beads, for example, streptavidin- 30 coated beads, using, for example, an adapter that contains a 5'-biotin tag. Fragments attached to the granules are amplified by PCR within the droplets of an oil-water emulsion. The result is multiple copies of cloned amplified DNA fragments on each bead. At the second stage, the granules are captured in wells (several picoliters in volume). Pyrosequencing is carried out on each DNA fragment in parallel. Adding one or more nucleotides leads to the generation of a light signal, which is recorded on the CCD camera of the sequencing instrument. The signal intensity is proportional to the number of nucleotides included. Pyrosequencing uses pyrophosphate (PPi), which is released upon the addition of a nucleotide. PPi is converted to ATP using ATP sulfurylase in the presence of adenosine 5' phosphosulfate. Luciferase uses 5 ATP to convert luciferin to oxyluciferin, and as a result of this reaction, light is generated that is detected and analyzed. Additional details for performing sequencing 454 are found in Margulies et al. (2005) Nature 437: 376-380, which is hereby incorporated by reference in its entirety. [00116] Ion Torrent technology is a DNA sequencing method based on the detection of 10 hydrogen ions that are released during DNA polymerization. The microwell contains a fragment of a library of NGS fragments to be sequenced. Under the microwell layer is the hypersensitive ion sensor ISFET. All layers are contained within a semiconductor CMOS chip, similar to the chip used in the electronics industry. When dNTP is incorporated into a growing complementary chain, a hydrogen ion is released that excites a hypersensitive ion 15 sensor. If homopolymer repeats are present in the sequence of the template, multiple dNTP molecules will be included in one cycle. This results in a corresponding amount of hydrogen atoms being released and in proportion to a higher electrical signal. This technology is different from other sequencing technologies that do not use modified nucleotides or optical devices. Additional details for Ion Torrent Technology are found in Science 327 (5970): 20 1190 (2010); US Patent Application Publication Nos.20090026082, 20090127589, 20100301398, 20100197507, 20100188073, and 20100137143, each of which is incorporated by reference in its entirety. [00117] In various embodiments, sequencing reads obtained from the NGS methods can be filtered by quality and grouped by barcode sequence using any algorithms known in the art, 25 e.g., Python script barcodeCleanup.py. In some embodiments, a given sequencing read may be discarded if more than about 20% of its bases have a quality score (Q-score) less than Q20, indicating a base call accuracy of less than about 99%. In some embodiments, a given sequencing read may be discarded if more than about 5%, about 10%, about 15%, about 20%, about 25%, about 30% have a Q-score less than Q10, Q20, Q30, Q40, Q50, Q60, or more, 30 indicating a base call accuracy of less than about 90%, less than about 99%, less than about 99.9%, less than about 99.99%, less than about 99.999%, less than about 99.9999%, or more, respectively. [00118] In some embodiments, all sequencing reads associated with a barcode containing less than 50 reads may be discarded to ensure that all barcode groups, representing single cells, contain a sufficient number of high-quality reads. In some embodiments, all sequencing reads associated with a barcode containing less than 30, less than 40, less than 50, less than 60, less than 70, less than 80, less than 90, less than 100 or more reads may be discarded to ensure the quality of the barcode groups representing single cells. 5 [00119] Sequence reads with common barcode sequences (e.g., meaning that sequence reads originated from the same cell) may be aligned to a reference genome using known methods in the art to determine alignment position information. The alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of 10 a given sequence read. A region in the reference genome may be associated with a target gene or a segment of a gene. Example aligner algorithms include BWA, Bowtie, Spliced Transcripts Alignment to a Reference (STAR), Tophat, or HISAT2. Further details for aligning sequence reads to reference sequences are described in US Application No. 16/279,315, which is hereby incorporated by reference in its entirety. In various 15 embodiments, an output file having SAM (sequence alignment map) format or BAM (binary alignment map) format may be generated and output for subsequent analysis. Computer-Readable Media [00120] Also disclosed herein, in various embodiments, is a non-transitory computer readable medium for performing the methods disclosed herein. Such media include, but are 20 not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. One of skill in the art can readily appreciate how any of the presently known computer readable mediums can be used to create a manufacture comprising a recording of the present 25 database information. Computer Embodiments [00121] FIG.3 depicts an example computing device for implementing system and methods described in reference to FIGs.1A, 1B, 2. In various embodiments, the example computing device 300 serves as the computing device 130 shown in FIG.1A for performing 30 the methods described in FIG.1B in relation to the filter module 132, dimensionality reduction module 134, clustering module 136, and annotation module 138. Examples of a computing device can include a personal computer, desktop computer laptop, server computer, a computing node within a cluster, message processors, hand-held devices, multi- processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. [00122] As shown in FIG.3, in some embodiments, the computing device 300 includes at 5 least one processor 302 coupled to a chipset 304. The chipset 304 includes a memory controller hub 320 and an input/output (I/O) controller hub 322. A memory 306 and a graphics adapter 312 are coupled to the memory controller hub 320, and a display 318 is coupled to the graphics adapter 312. A storage device 308, an input interface 314, and network adapter 316 are coupled to the I/O controller hub 355. Other embodiments of the 10 computing device 300 have different architectures. [00123] The storage device 308 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device. The memory 306 holds instructions and data used by the processor 302. The input interface 314 is a touch-screen interface, a mouse, track ball, or other type of input 15 interface, a keyboard, or some combination thereof, and is used to input data into the computing device 300. In some embodiments, the computing device 300 may be configured to receive input (e.g., commands) from the input interface 314 via gestures from the user. The graphics adapter 312 displays images and other information on the display 318. The network adapter 316 couples the computing device 300 to one or more computer networks. 20 [00124] The computing device 300 is adapted to execute computer program modules for providing functionality described herein. As used herein, the term “module” refers to computer program logic used to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, and/or software. In one embodiment, program modules are stored on the storage device 308, loaded into the memory 306, and executed by the 25 processor 302. [00125] The types of computing devices 300 can vary from the embodiments described herein. For example, the computing device 300 can lack some of the components described above, such as graphics adapters 312, input interface 314, and displays 318. In some embodiments, a computing device 300 can include a processor 302 for executing instructions 30 stored on a memory 306. [00126] The methods disclosed herein can be implemented in hardware or software, or a combination of both. In one embodiment, a non-transitory machine-readable storage medium, such as one described above, is provided, the medium comprising a data storage material encoded with machine readable data which, when using a machine programmed with instructions for using said data, is capable of executing instructions for performing the methods disclosed herein. Embodiments of the methods described above can be implemented in computer programs executing on programmable computers, comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage 5 elements), a graphics adapter, an input interface, a network adapter, at least one input device, and at least one output device. A display is coupled to the graphics adapter. Program code is applied to input data to perform the functions described above and generate output information. The output information is applied to one or more output devices, in known fashion. The computer can be, for example, a personal computer, microcomputer, or 10 workstation of conventional design. [00127] Each program can be implemented in a high-level procedural or object-oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language can be a compiled or interpreted language. Each such computer program is preferably stored on a 15 storage media or device (e.g., ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. The system can also be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to 20 operate in a specific and predefined manner to perform the functions described herein. [00128] The signature patterns and databases thereof can be provided in a variety of media to facilitate their use. “Media” refers to a manufacture that contains the signature pattern information of the present invention. The databases of the present invention can be recorded on computer readable media, e.g. any medium that can be read and accessed directly by a 25 computer. "Recorded" refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure can be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g. word processing text file, database format, etc. 30 EXAMPLES [00129] Below are examples of specific embodiments for carrying out the present invention. The examples are offered for illustrative purposes only and are not intended to limit the scope of the present invention in any way. Efforts have been made to ensure accuracy with respect to numbers used (e.g., amounts, temperatures, etc.), but some experimental error and deviation should be allowed for. Example 1: Comparing Clustering Analysis with and without Pseudo Cells [00130] A mixture of different cells (HG001, HG002, and HG005) was processed using 5 the Tapestri single cell platform and genomic DNA sequencing data was generated from the mixture of cells. Two different calling pipelines were implemented: 1) a first caller pipeline that did not incorporate pseudo cells and did not filter low quality variants using a ground truth dataset and 2) a second pipeline that incorporated both pseudo cells and filtered low quality variants using a ground truth dataset. 10 [00131] FIG.4 depicts results of the first caller pipeline that did not use pseudo cells and furthermore did not filter low quality variants using a ground truth dataset. Here, the clustering analysis resulted in two clusters labeled as “A” and “B.” This method without using the ground truth dataset for filtering and without incorporating the pseudo cells yielded incorrect results as it failed to identify the 3 different cell populations. Specifically, 15 dimensionality reduction was noisy, and the clustering method called 3 clusters as a singular cluster (indicated as cluster “A”). Additionally, because of lack of pseudo cell information, the clusters were merely called as “A” and “B.” [00132] In contrast, FIG.5 depicts results of the second calling pipeline that incorporated both pseudo cells and filtered low quality variants using a ground truth dataset 20 (e.g., ground truth dataset from bulk sequencing). Specifically, the ground truth dataset was used to filter out variants (e.g., filter out variants at step 220 in FIG.2). Pseudo cells were incorporated into the analysis and improved the clustering of the different cell populations. Using this second caller pipeline, four different clusters (e.g., HG001, HG002, HG005, and mixed cells) were identified and distinctly labeled according to presence of pseudo cells. 25 Example 2: Clustering analysis using Psuedo Cell Truth Datasets Distinguishes Cell Subpopulations with Differing Genotypes [00133] A mixture of different cells (HG001, HG002, and HG005) was processed using the Tapestri single cell platform and genomic DNA sequencing data was generated from the mixture of cells. The implemented calling pipeline 1) incorporated pseudo cells and 2) 30 filtered low quality variants using a ground truth dataset. [00134] Sequencing data derived from single cells were processed using standard Tapestri variant calling pipeline and analyzed. This resulted in a multi-sample VCF where each sample was a cell in the experiment. The method began by filtering the VCF file to remove low quality data based on thresholds of allele frequency, read depth, genotyping quality etc. These filters aided in removing a significant amount of noise caused by chemistry-based errors. The result was a matrix of cells vs variants with the values being either the genotypes or allele frequencies for each cell/variant combination. The method worked with both 5 genotypes and allele frequencies. This matrix was then reduced to 2 dimensions using UMAP, followed by HDBSCAN clustering. The observed clusters were annotated by comparing their genotype signature with known genotypes of pseudo cells representing a ground truth dataset. [00135] This method was tested using different targeted sequencing panels of sizes 100 to 10 1000 amplicons each, created for mixtures of 3 Genome in a Bottle (GIAB) cell lines at 49.5/49.5/1% ratios (e.g., 49.5% HG002, 49.5% HG005, and 1% HG001). This method successfully identified all pseudo cell line populations along with a small population of mixed cells which were caused due to the cell mixing of any 2 cell lines. [00136] Specifically, FIG.6A depicts results of a clustering analysis that implemented 15 pseudo cells using a 100 plex targeting sequencing panel. Here, four clusters were identified (HG001, HG002, HG005, and mixed cell population). FIG.6B depicts a heatmap of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIG.6A. In particular, the heat map of variants shows that different cell lines (e.g., HG001, HG002, and HG005) are characterized by the presence and/or frequency 20 of different variants. Namely, HG001 and HG002 were characterized by at least the presence of variants of chr5_67569391_A_C, chr3_182681740_C_T, and chr3_182681853_T_C, whereas HG005 did not have these variants. Additionally, HG005 was characterized by at least the presence of variants of ch9_139396690_C_T, chr20_39792538_C_T, chr4_55968053_A_C, chr20_40979147_C_T, chr4_55129831_C_T, whereas neither HG001 25 or HG002 exhibited these variants. Altogether, this demonstrates that the 1) filtering of variants using a ground truth dataset and 2) inclusion of pseudo cells enables accurate clustering of the HG001, HG002, and HG005 cell lines according to presence/absence/frequency of the respective variants. Furthermore, the workflow achieves 1% limit of detection as it successfully identified a cluster corresponding to HG001. 30 [00137] Additionally, FIG.7A depicts results of a clustering analysis that implemented pseudo cells using a 600 plex targeting sequencing panel. Notably, 4 distinct clusters were identified when both the 1) filtering of variants using a ground truth dataset and 2) inclusion of pseudo cells are performed. FIG.7B depicts a heatmap of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIG.7A. In particular, the heat map of variants shows that different cell lines (e.g., HG001, HG002, and HG005) were successfully identified and characterized by the presence and/or frequency of different variants. Again, this demonstrates that the workflow achieves a 1% limit of detection as it successfully identified a cluster corresponding to HG001. 5 [00138] Further characteristics of cellular subpopulations that were identified using the clustering analysis is described further below in Table 1. [00139] The heatmaps in FIG.6B and FIG.7B show the variants in columns and cells of different subpopulations as rows. The values in each heatmap represent allele frequencies. Here, the allele frequency of variants differed across the different cell lines. Altogether, this 10 clustering analysis method successfully identified all cell lines, including the 1% spike in cell line in these runs. Table 1: Characteristics of cellular subpopulations identified using clustering analysis.
Figure imgf000036_0001
Example 3: Clustering Analysis Using Psuedo Cells Allows Enhanced Detection 15 of Minor Cellular Subpopulations [00140] A mixture of different cells (Raji, PC3, DU145, and SKMEL28) was processed using the Tapestri single cell platform and genomic DNA sequencing data was generated from the mixture of cells. Two different calling pipelines were implemented: 1) a first caller pipeline that that incorporated both pseudo cells and filtered low quality variants using a 20 ground truth dataset and 2) a second pipeline that did not filter low quality variants using a ground truth dataset (but did use pseudo cells). FIGs.8A, 8B, 9A, and 9B show the results from first caller pipeline where as FIG.10A and 10B show the results from the second caller pipeline (which did not use a ground truth dataset). [00141] Sequencing data derived from single cells were processed using standard Tapestri 25 variant calling pipeline and analyzed. This resulted in a multi-sample VCF where each sample is a cell in the experiment. The method began by filtering the VCF file to remove low quality data based on thresholds of allele frequency, read depth, genotyping quality etc. These filters aided in removing a significant amount of noise caused due to chemistry-based errors. The result was a matrix of cells vs variants with the values being either the genotypes or allele frequencies for each cell/variant combination. The method worked with both genotypes and allele frequencies. This matrix was then reduced to 2 dimensions using UMAP, followed by HDBSCAN clustering. The observed clusters were annotated by comparing their genotype 5 signature with known genotypes of pseudo cells . [00142] This method was tested using AML_V2 panel of sizes 128 amplicons each, created for mixtures of 4 known cell lines (Raji, PC3, DU145, and SKMEL28) at 49/49/0.5/0.1% ratios, respectively. The method was successfully able to identify all pseudo cell line populations along with mix cells which were caused due to the cell mixing of any 2 10 or more cell lines. Altogether, this demonstrates that the workflow can achieve a limited of detection of 0.5% (as evidenced by the detection of the 0.5% DU145 cell population) and can even achieve a limit of detection of 0.1% (as evidenced by the detection of the 0.1% SKMEL28 cell population). [00143] Specifically, FIGs.8A and 9A depict results of clustering analyses that 15 implemented pseudo cells using the AML_V2128 plex targeting sequencing panel. In each analysis, five clusters were identified (Raji, PC3, DU145, SKMEL28, and mixed cell population). FIGs.8B and 9B depict heatmaps of variants across different populations of cells in accordance with the results of the targeting sequencing panel described in FIGs.8A and 9A, respectively. Further characteristics of cellular subpopulations that were identified 20 using the clustering analysis is described further below in Table 2. [00144] The heatmaps in FIG.8B and FIG.9B show the variants in columns and cells of different subpopulations as rows. The values in each heatmap represent allele frequencies. Here, the allele frequency of variants differed across the different cell lines. This method was successfully able to identify all cell lines, including the 0.5% and 0.1% spike in cell line in 25 these runs. Table 2: Characteristics of cellular subpopulations identified using clustering analysis
Figure imgf000037_0001
[00145] In contrast to the results (e.g., FIGs.8A, 8B, 9A, and 9B) which used the first caller pipeline, the results of FIG.10A and 10B were generated using the second caller pipeline which did not not utilize ground truth data (e.g., bulk sequencing information) (FIG. 10A corresponds to FIGs.8A and 8B; FIG.10B corresponds to FIGs.9A and 9B). As shown 5 in FIG.10A, the clustering fails to identify 5 distinct populations and based on the percentage of cells in each cluster the 0.1% cell population is missing. Thus, the second caller pipeline which does not filter variants using ground truth data exhibits a poorer limit of detection (worse than 0.1%). Additionally since this method does not use bulk information it cannot automatically label the cell populations as cell types or cell lines. Further characteristics of 10 cellular subpopulations that were identified using the clustering analysis is described further below in Table 3. Table 3: Characteristics of cellular subpopulations identified using clustering analysis
Figure imgf000038_0001
Example 4: Clustering Analysis Using Psuedo Cells Allows Enhanced Detection of Minor Cellular Subpopulations 15 [00146] Sequencing data derived from single cells is processed using standard Tapestri variant calling pipeline and analyzed. This results in a multi-sample VCF where each sample is a cell in the experiment. The method begins by filtering the VCF file to remove low quality data based on thresholds of allele frequency, read depth, genotyping quality, etc. These filters aid in removing a significant amount of noise caused due to chemistry-based errors. The 20 result is a matrix of cells vs variants with the values being either the genotypes or allele frequencies for each cell/variant combination. The method works with both genotypes and allele frequencies. This matrix is then reduced to 2 dimensions using UMAP, followed by HDBSCAN clustering. The observed clusters are annotated by comparing their genotype signature with known genotypes of pseudo cells dataset. 25 [00147] This method is tested using, for example, the AML_V2 panel of 128 amplicons each, created for mixtures of 4 known cell lines (for example: Raji, PC3, DU145, and SKMEL28) at 49/49/0.5/0.01% ratios, respectively. The method can successfully identify all pseudo cell line populations along with mix cells which are caused due to the cell mixing of any 2 or more cell lines, including the 0.5% and 0.01% spike in cell line.

Claims

WHAT IS CLAIMED IS: 1. A method for identifying one or more subpopulations within in a cell population, the method comprising: (a) obtaining a first dataset comprising cell sequencing data from single cells in a cell population; (b) filtering the cell sequencing data of the first dataset against at least a ground truth dataset derived from a bulk sequencing analysis; (c) incorporating one or more pseudo cells comprising one or more known variants; (d) generating one or more clusters of cells comprising the pseudo cells; and (e) annotating the one or more clusters of cells based on one or more pseudo cells in the clusters.
2. The method of claim 1, wherein filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises: (i) identifying a first set of variants within the first dataset and a second set of variants within the ground truth dataset; and (ii) generating a filtered set of variants by removing a subset of variants in the first set, wherein the subset of variants does not appear in the second set of variants.
3. The method of claim 2, wherein the one or more clusters comprising the pseudo cells are generated according to at least the filtered set of variants.
4. The method of any one of claims 1-3, wherein step (c) occurs before step (a) .
5. The method of any one of claims 1-3, wherein step (c) occurs after step (b).
6. The method of any one of claims 1-5, further comprising removing the one or more pseudo cells from the one or more clusters prior to subsequent analysis.
7. The method of any one of claims 1-6, wherein the one or more known variants of the pseudo cells are determined through a bulk sequencing analysis.
8. The method of any one of claims 1-7, wherein annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster comprising a pseudo cell as a cell population with one or more known variants of the pseudo cell.
9. The method of any one of claims 1-8, wherein annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster that is devoid of a pseudo cell as a mixed cell population.
10. The method of any one of claims 1-9, further comprising performing a dimensionality reduction analysis prior to generating clusters of cells.
11. The method of claim 10, wherein the dimensionality reduction analysis is UMAP or PCA.
12. The method of any one of claims 1-11, wherein generating clusters of cells comprises implementing a HDBSCAN method.
13. The method of any one of claims 1-12, wherein the variants are filtered based on thresholds of allele frequency, read depth, and/or genotyping quality.
14. The method of any one of claims 1-13, wherein the variants are filtered based on thresholds of allele frequency.
15. The method of any one of claims 1-13, wherein the variants are filtered based on thresholds of read depth.
16. The method of any one of claims 1-13, wherein the variants are filtered based on thresholds of genotyping quality.
17. The method of any one of claims 1-16, wherein, the one or more subpopulations are detected with a lower limit below a lower limit of detection where no pseudo cells are incorporated.
18. The method of any one of claims 1-17, wherein the one or more subpopulations are detected with a lower limit below 1%, below 0.5%, below 0.1%, or below 0.01% of the total population.
19. The method of any one of claims 1-18, wherein the one or more subpopulations are detected with a lower limit below 1% of the total population.
20. The method of any one of claims 1-18, wherein the one or more subpopulations are detected with a lower limit below 0.5% of the total population.
21. The method of any one of claims 1-18, wherein the one or more subpopulations are detected with a lower limit below 0.1% of the total population.
22. The method of any one of claims 1-18, wherein the one or more subpopulations are detected with a lower limit below 0.01% of the total population.
23. A method for identifying one or more subpopulations within a cell population, the method comprising: (a) obtaining a first dataset comprising cell sequencing data from single cells in a cell population; (b) filtering the cell sequencing data of the first dataset against at least a ground truth dataset derived from a bulk sequencing analysis; (c) identifying one or more subpopulations within the cell population through a combined unsupervised and supervised clustering of cells, wherein the supervised clustering involves one or more pseudo cells comprising one or more known variants.
24. The method of claim 23, wherein filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises: (i) identifying a first set of variants within the first dataset and a second set of variants within the ground truth dataset; and (ii) generating a filtered set of variants by removing a subset of variants in the first set, wherein the subset of variants does not appear in the second set of variants.
25. The method of claim 23 or 24, further comprising removing the one or more pseudo cells from the one or more clusters prior to subsequent analysis.
26. The method of any one of claims 23-25, wherein the one or more known variants of the pseudo cells is determined through a bulk sequencing analysis.
27. The method of any one of claims 23-26, further comprising performing a dimensionality reduction analysis prior to generating clusters of cells.
28. The method of claim 27, wherein the dimensionality reduction analysis is UMAP or PCA.
29. The method of any one of claims 23-28, wherein the combined unsupervised and supervised clustering of cells comprises implementing a HDBSCAN method.
30. The method of any one of claims 23-29, wherein the variants are filtered based on thresholds of allele frequency, read depth, and/or genotyping quality.
31. The method of any one of claims 23-30, wherein the variants are filtered based on thresholds of allele frequency.
32. The method of any one of claims 23-30, wherein the variants are filtered based on thresholds of read depth.
33. The method of any one of claims 23-30, wherein the variants are filtered based on thresholds of genotyping quality.
34. The method of any one of claims 23-33, wherein the one or more subpopulations are detected with a lower limit below a lower limit of detection where no pseudo cells are incorporated .
35. The method of any one of claims 23-34, wherein the one or more subpopulations are detected with a lower limit below 1%, below 0.5%, below 0.1%, or below 0.01% of the total population.
36. The method of any one of claims 23-35, wherein the one or more subpopulations are detected with a lower limit below 1% of the total population.
37. The method of any one of claims 23-35, wherein the one or more subpopulations are detected with a lower limit below 0.5% of the total population.
38. The method of any one of claims 23-35, wherein the one or more subpopulations are detected with a lower limit below 0.1% of the total population.
39. The method of any one of claims 23-35, wherein the one or more subpopulations are detected with a lower limit below 0.01% of the total population.
40. A non-transitory computer readable medium for identifying one or more subpopulations within in a cell population, the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: (a) filter the cell sequencing data of a first dataset against at least a ground truth dataset derived from a bulk sequencing analysis; (b) incorporate one or more pseudo cells comprising one or more known variants; (c) generate one or more clusters of cells comprising the pseudo cells according to at least the filtered set of variants; and (d) annotate the one or more clusters of cells based on one or more pseudo cells in the clusters, wherein the first dataset comprises cell sequencing data obtained from single cells in a cell population.
41. The non-transitory computer readable medium of claim of 40, wherein filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises: (i) identifying a first set of variants within the first dataset and a second set of variants within the ground truth dataset; and (ii) generating a filtered set of variants by removing a subset of variants in the first set, wherein the subset of variants does not appear in the second set of variants.
42. The non-transitory computer readable medium of claim 40 or 41, wherein the instructions further cause the processor to remove the one or more pseudo cells from the one or more clusters prior to subsequent analysis.
43. The non-transitory computer readable medium of claim 40 or 42, wherein the one or more known variants of the pseudo cells is determined through a bulk sequencing analysis.
44. The non-transitory computer readable medium of any one of claims 40-43, wherein annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster comprising a pseudo cell as a cell population with one or more known variants of the pseudo cell.
45. The non-transitory computer readable medium of any one of claims 40-44, wherein annotating the clusters of cells based on presence of one or more pseudo cells comprises annotating a cluster that is devoid of a pseudo cell as a mixed cell population.
46. The non-transitory computer readable medium of any one of claims 40-45, wherein the instructions further cause the processor to perform a dimensionality reduction analysis prior to generating clusters of cells.
47. The non-transitory computer readable medium of claim 46, wherein the dimensionality reduction analysis is UMAP or PCA.
48. The non-transitory computer readable medium of any one of claims 40-47, wherein generating clusters of cells comprises implementing a HDBSCAN method.
49. The non-transitory computer readable medium of any one of claims 40-48, wherein the variants are filtered based on thresholds of allele frequency, read depth, and/or genotyping quality.
50. The non-transitory computer readable medium of any one of claims 40-49, wherein the variants are filtered based on thresholds of allele frequency.
51. The non-transitory computer readable medium of any one of claims 40-49, wherein the variants are filtered based on thresholds of read depth.
52. The non-transitory computer readable medium of any one of claims 40-49, wherein the variants are filtered based on thresholds of genotyping quality.
53. The non-transitory computer readable medium of any one of claims 40-52, the one or more subpopulations are detected with a lower limit below a lower limit of detection where no pseudo cells are incorporated.
54. The non-transitory computer readable medium of any one of claims 40-53, wherein the one or more subpopulations are detected with a lower limit below 1%, below 0.5%, below 0.1%, or below 0.01% of the total population.
55. The non-transitory computer readable medium of any one of claims 40-54, wherein the one or more subpopulations are detected with a lower limit below 1% of the total population.
56. The non-transitory computer readable medium of any one of claims 40-54, wherein the one or more subpopulations are detected with a lower limit below 0.5% of the total population.
57. The non-transitory computer readable medium of any one of claims 40-54, wherein the one or more subpopulations are detected with a lower limit below 0.1% of the total population.
58. The non-transitory computer readable medium of any one of claims 40-54, wherein the one or more subpopulations are detected with a lower limit below 0.01% of the total population.
59. A non-transitory computer readable medium for identifying one or more subpopulations within in a cell population, the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: (a) filter cell sequencing data of a first dataset against at least a ground truth dataset derived from a bulk sequencing analysis; and (b) identify one or more one or more subpopulations within a cell population through a combined unsupervised and supervised clustering of cells, wherein the supervised clustering involves one or more pseudo cells comprising one or more known variants, wherein the first dataset comprises cell sequencing data obtained from single cells in the cell population.
60. The non-transitory computer readable medium of claim of 59, wherein filtering the cell sequencing data of the first dataset against at least a ground truth dataset comprises: (i) identifying a first set of variants within the first dataset and a second set of variants within the ground truth dataset; and (ii) generating a filtered set of variants by removing a subset of variants in the first set, wherein the subset of variants does not appear in the second set of variants.
61. The non-transitory computer readable medium of claim 59 or 60, wherein the instructions further cause the processor to remove the one or more pseudo cells from the one or more clusters prior to subsequent analysis.
62. The non-transitory computer readable medium of any one of claims 59-61, wherein the one or more known variants of the pseudo cells is determined through a bulk sequencing analysis.
63. The non-transitory computer readable medium of any one of claims 59-62, wherein the instructions further cause the processor to perform a dimensionality reduction analysis prior to generating clusters of cells.
64. The non-transitory computer readable medium of claim 63, wherein the dimensionality reduction analysis is UMAP or PCA.
65. The non-transitory computer readable medium of any one of claims 59-64, wherein the combined unsupervised and supervised clustering of cells comprises implementing a HDBSCAN method.
66. The non-transitory computer readable medium of any one of claims 59-65, wherein the variants are filtered based on thresholds of allele frequency, read depth, and/or genotyping quality.
67. The non-transitory computer readable medium of any one of claims 59-66, wherein the variants are filtered based on thresholds of allele frequency.
68. The non-transitory computer readable medium of any one of claims 59-66, wherein the variants are filtered based on thresholds of read depth.
69. The non-transitory computer readable medium of any one of claims 59-66, wherein the variants are filtered based on thresholds of genotyping quality.
70. The non-transitory computer readable medium of any one of claims 59-69, the one or more subpopulations are detected with a lower limit below a lower limit of detection where no pseudo cells are incorporated.
71. The non-transitory computer readable medium of any one of claims 59-70, wherein the one or more subpopulations are detected with a lower limit below 1%, below 0.5%, below 0.1%, or below 0.01% of the total population.
72. The non-transitory computer readable medium of any one of claims 59-71, wherein the one or more subpopulations are detected with a lower limit below 1% of the total population.
73. The non-transitory computer readable medium of any one of claims 59-71, wherein the one or more subpopulations are detected with a lower limit below 0.5% of the total population.
74. The non-transitory computer readable medium of any one of claims 59-71, wherein the one or more subpopulations are detected with a lower limit below 0.1% of the total population.
75. The non-transitory computer readable medium of any one of claims 59-71, wherein the one or more subpopulations are detected with a lower limit below 0.01% of the total population.
PCT/US2021/060186 2020-11-19 2021-11-19 Cellular clustering analysis in sequencing datasets WO2022109330A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063115713P 2020-11-19 2020-11-19
US63/115,713 2020-11-19

Publications (1)

Publication Number Publication Date
WO2022109330A1 true WO2022109330A1 (en) 2022-05-27

Family

ID=81708117

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/060186 WO2022109330A1 (en) 2020-11-19 2021-11-19 Cellular clustering analysis in sequencing datasets

Country Status (1)

Country Link
WO (1) WO2022109330A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130294675A1 (en) * 2012-05-03 2013-11-07 General Electric Company Automatic segmentation and characterization of cellular motion
US20200065675A1 (en) * 2017-10-16 2020-02-27 Illumina, Inc. Deep Convolutional Neural Networks for Variant Classification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130294675A1 (en) * 2012-05-03 2013-11-07 General Electric Company Automatic segmentation and characterization of cellular motion
US20200065675A1 (en) * 2017-10-16 2020-02-27 Illumina, Inc. Deep Convolutional Neural Networks for Variant Classification

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN HUIDONG, LAREAU CALEB, ANDREANI TOMMASO, VINYARD MICHAEL E., GARCIA SARA P., CLEMENT KENDELL, ANDRADE-NAVARRO MIGUEL A., BUEN: "Assessment of computational methods for the analysis of single-cell ATAC-seq data", GENOME BIOLOGY, vol. 20, no. 1, 1 December 2019 (2019-12-01), XP055940984, DOI: 10.1186/s13059-019-1854-5 *
CHEN LIANG, ZHAI YUYAO, HE QIUYAN, WANG WEINAN, DENG MINGHUA: "Integrating Deep Supervised, Self-Supervised and Unsupervised Learning for Single-Cell RNA-seq Clustering and Annotation", GENES, vol. 11, no. 7, 14 July 2020 (2020-07-14), pages 792, XP055940983, DOI: 10.3390/genes11070792 *

Similar Documents

Publication Publication Date Title
AU2019250200B2 (en) Error Suppression In Sequenced DNA Fragments Using Redundant Reads With Unique Molecular Indices (UMIs)
AU2023219911A1 (en) Using cell-free DNA fragment size to detect tumor-associated variant
US20230343416A1 (en) Methods and systems for sequence and variant calling
CN112955958A (en) Sequence diagram-based tool for determining changes in short tandem repeat regions
Cheng et al. Methods to improve the accuracy of next-generation sequencing
US20220351804A1 (en) Improved Variant Caller Using Single-Cell Analysis
Pavlovic et al. Next-generation sequencing: The enabler and the way ahead
WO2022109330A1 (en) Cellular clustering analysis in sequencing datasets
WO2020252387A2 (en) Methods for accurate base calling using molecular barcodes
JP2021502072A (en) Correction of sequence errors induced in deamination
US20220068433A1 (en) Computational detection of copy number variation at a locus in the absence of direct measurement of the locus
US20230420080A1 (en) Split-read alignment by intelligently identifying and scoring candidate split groups
RU2766198C9 (en) Methods and systems for obtaining sets of unique molecular indices with heterogeneous length of molecules and correcting errors therein
RU2766198C2 (en) Methods and systems for obtaining sets of unique molecular indices with heterogeneous length of molecules and correcting errors therein
Barbaro Overview of NGS platforms and technological advancements for forensic applications
WO2024025831A1 (en) Sample contamination detection of contaminated fragments with cpg-snp contamination markers
Bajaj et al. MICROBIAL GENOMICS

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21895711

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21895711

Country of ref document: EP

Kind code of ref document: A1