WO2023196928A2 - True variant identification via multianalyte and multisample correlation - Google Patents

True variant identification via multianalyte and multisample correlation Download PDF

Info

Publication number
WO2023196928A2
WO2023196928A2 PCT/US2023/065473 US2023065473W WO2023196928A2 WO 2023196928 A2 WO2023196928 A2 WO 2023196928A2 US 2023065473 W US2023065473 W US 2023065473W WO 2023196928 A2 WO2023196928 A2 WO 2023196928A2
Authority
WO
WIPO (PCT)
Prior art keywords
variant
variants
cells
candidate
true
Prior art date
Application number
PCT/US2023/065473
Other languages
French (fr)
Other versions
WO2023196928A3 (en
Inventor
Adam SCIAMBI
Charles Joseph Murphy
Original Assignee
Mission Bio, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mission Bio, Inc. filed Critical Mission Bio, Inc.
Publication of WO2023196928A2 publication Critical patent/WO2023196928A2/en
Publication of WO2023196928A3 publication Critical patent/WO2023196928A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • Detecting rare variants is valuable for identifying rare cells in samples that contain signatures of the measurable residual disease (MRD). Detecting rare variants from single-cell sequencing data is challenging because these somatic variants are often rare and below the variant-allele frequency (VAF) background. In addition, there are many thousands of false positive variants that occur at low frequencies (typically less than 1%). Thus, distinguishing false positive variants from true rare variants is difficult. These false positives have several causes, an example of which stems from polymerase chain reaction (PCR) and/or next generation sequencing (NGS) base-calling errors (1-100% VAF for some loci) that arise near the end of amplicons, near repeat regions, and amplification of DNA from single cells. Thus, there is a need for improved methods for identifying true variants in cells that are present at low frequencies.
  • PCR polymerase chain reaction
  • NGS next generation sequencing
  • Described herein are embodiments for improved true variant identification via multianalyte and multisample correlation through a two-step process involving 1) detecting one or more of multianalyte and multisample correlations associated with a candidate variant and 2) performing a variant calling process to determine whether the candidate variant is a true variant or not based on the detected one or more of multianalyte and multisample correlations.
  • Embodiments described herein offer some benefits or advantages over other existing variant calling processes due to the improved variant calling performance.
  • the embodiments disclosed herein offer features and benefits including: 1) yield a threshold that is dynamically determined for each variant. By leveraging background error rate, it enables a lower limit of detection for some variants, while having a higher threshold for variants that are more error prone; thereby reducing false positives. 2) Leverage control samples for estimating background error rate for individual variants. Thus, these individual control samples provide an additional utility beyond what a single sample can provide on its own. 3) Use the Beta-Binomial distribution to statistically model the per-variant background error rate.
  • Beta-Binomial distribution offers several distinct advantages: a) allows the calculation of a p-value for each variant, which reflects the probability that the variant is an error, b) Beta-Binomial is a count-based distribution, so it dynamically adjusts to the total number of cells in a sample, which is important because the number of cells a false positive variant is observed in roughly scales with the total number of cells in the sample. Furthermore, unlike the Binomial distribution (also count-based), the BetaBinomial allows for over-dispersion, which enables it to be more flexible in modeling different error distribution shapes.
  • the embodiments disclosed herein offer additional features and benefits including: 1) leverage the single-cell nature of a single cell sequencing data by looking for variants the cooccur together in same cells in a statistically significant way. The error rate of co-occurring variants is much lower compared to individual variants, so the theoretical limit of detection for co-occurring variants is lower than that of individual variants. 2) Use a binomial test for testing the statistical significance of co-occurring variants.
  • a method for identifying a subpopulation of cells from a heterogeneous population of cells comprising: obtaining a set of candidate variants determined through a single-cell analysis workflow; for a variant pair included in the set of candidate variants, determining a quantity of co-occurrence cells where both variants in the variant pair co-occur in each of the co-occurrence cells; determining a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants; and identifying a subset of candidate variants as true variants based on the determined set of variant pairs.
  • the method before determining the quantity of co-occurrence cells, further comprises applying a first set of variant filters to identify a set of variants from the set of candidate variants.
  • both variants included in a variant pair, for determining the quantity of co-occurrence cells are from the set of variants identified after applying the first set of variant filters.
  • applying the first set of variant filters comprises applying a depth of coverage threshold regarding a depth of coverage of a variant in a cell. In various embodiments, the depth of coverage threshold is at least 6, 8, 10, 12,
  • applying the first set of variant filters comprises applying a genotype quality threshold regarding a genotype quality of a variant.
  • the genotype quality threshold is at least 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, or 80.
  • applying the first set of variant filters comprises applying a cell number threshold regarding a number of cells where a variant is present. In various embodiments, the cell number threshold is at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
  • applying the first set of variant filters comprises applying a variant allele frequency threshold regarding variant allele frequency of a variant.
  • the variant allele frequency threshold is at least 30, 35, 40, 45, 50, 55, or 60.
  • determining the quantity of co-occurrence cells comprises determining at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 45, or 50 cells for a variant pair. In various embodiments, determining the quantity of co-occurrence cells comprises determining co-occurrence cells based on a statistical significance of co-occurrence of both variants of the variant pair in a same cell. In various embodiments, determining cooccurrence cells based on the statistical significance comprises determining co-occurrence cells based on a one-sided Binomial test.
  • determining co-occurrence cells based on the one-sided Binomial test comprises determining co-occurrence cells having a statistical significance of p-value of less than 0.03, 0.02, 0.01, 0.009, 0.008, 0.007, 0.006, 0.005, 0.004, 0.003, 0.002, or 0.001 based on the one-sided Binomial test.
  • the method further comprises applying a second set of variant filters to identify a subset of variant pairs.
  • identifying the subset of candidate variants as true variants comprises identifying the subset of candidate variants based on the identified subset of variant pairs.
  • applying the second set of variant filters comprises applying an average variant allele frequency threshold regarding an average variant allele frequency for two variants of a variant pair.
  • the average variant allele frequency threshold is at least 25, 30, 35, 40, 45, 50, 55, or 60.
  • applying the second set of variant filters comprises applying a genomic distance threshold regarding a genomic distance between two variants of a variant pair.
  • the genomic distance threshold is at least 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 base pairs apart from each other.
  • the method further comprises identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants.
  • the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells.
  • a sensitivity for identifying a variant as a true variant is at least 0.9. 0.8, 0.7, 0.6, 0.5, 0.4, or 0.3.
  • a specificity for identifying a variant as a false positive variant is at most 0.998, 0.997, 0.996, 0.995, 0.994, 0.993, 0.992, or 0.991.
  • the heterogeneous population of cells are from measurable residual disease (MRD) samples.
  • the method further comprises, for each variant included in the subset of candidate variants, generating a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determining a sub-subset of candidate variants based on the generated statistical significance for each variant included in the subset of candidate variants; for each variant of the sub-subset of candidate variants, applying a third set of variant filters; and identifying a sub-sub-subset of candidate variants as true variants based on the application of the third set of variant filters.
  • Disclosed herein is another method identifying a subpopulation of cells from a heterogeneous population of cells, the method comprising: obtaining a set of candidate variants determined through a single-cell analysis workflow; for a variant included in the set of candidate variants, generating a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determining a set of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants; and identifying a subset of candidate variants as true variants based on generated statistical significances for the plurality of variants.
  • the parameters for the Beta-Binomial distribution are generated by: acquiring a plurality of control samples; removing germline variants from each control sample; for a control sample, applying a first set of variant filters to identify a set of background variants; for a background variant, determining a quantity of cells containing the background variant in each sample; and generating parameters for the Beta-Binomial distribution for the background variant based on the determined quantity of cells containing the background variant in each sample.
  • the method before generating a statistical significance by using a BetaBinomial distribution, the method further comprises applying a second set of variant filters to identify a set of variants from the set of candidate variants.
  • the variant for generating a statistical significance is a variant included in the identified set of variants.
  • generating a statistical significance by using a Beta-Binomial distribution comprises generating the statistical significance using the parameters of the Beta-Binomial distribution when a variant included in the set of candidate variants is a background variant.
  • generating a statistical significance by using a Beta-Binomial distribution comprises generating the statistical significance using averaging parameters generated from a plurality of background variants when a variant included in the set of candidate variants is not a background variant.
  • the parameters for the Beta-Binomial distribution for the background variant are generated by using two vectors determined based on the determined quantity of cells containing the background variant in each sample.
  • a first vector of the two vectors comprises a quantity of cells that have the background variant in each control sample.
  • a second vector of the two vectors comprises a quantity of total cells in each control sample.
  • applying the first set of variant filters comprises applying a depth of coverage threshold regarding a depth of coverage associated with a variant.
  • the depth of coverage threshold is at least 6, 8, 10, 12, 14, 16, 18, or 20 reads for a variant.
  • applying the first set of variant filters comprises applying a genotype quality threshold regarding a genotype quality associated with a variant.
  • the genotype quality threshold is at least 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, or 80.
  • the second set of variant filters comprises the depth of coverage threshold.
  • the second set of variant filters comprise the genotype quality threshold.
  • the second set of variant filters comprise a genotyped cell percentage threshold regarding a percentage of cells that are genotyped for a genomic position of a variant.
  • the genotyped cell percentage threshold is at least 10%, 20%, 30, 40%, 50%, or 60%.
  • identifying a subset of candidate variants as true variants based on generated statistical significances comprises identifying one or more variants that have a p-value smaller than a p-value threshold.
  • the p-value threshold is at most 0.0005, 0.0004, 0.0003, 0.0002, 0.0001, 0.00009, 0.00008, 0.00007, 0.00006, or 0.00005.
  • the method further comprises applying a third set of variant filters.
  • applying the third set of variant filters comprises applying a cell quantity threshold regarding a quantity of cells that a variant is present.
  • the cell quantity threshold is at least 3, 4, 5, 6, 7, 8, 9, or 10 cells.
  • applying the third set of variant filters comprises applying an average variant allele frequency threshold regarding a variant allele frequency average for a variant.
  • the average variant allele frequency threshold is at least 20, 25, 30, 35, 40, 45, 50, 55, or 60.
  • variants remaining in each control sample after applying the first set of variant filters are false positive variants.
  • control samples there are at least 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 control samples.
  • control samples are non-cancerous samples.
  • control samples are bone marrow samples from healthy subjects.
  • the method further comprises identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants.
  • the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells.
  • a sensitivity for identifying a variant as a true variant is at least 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, or 0.3.
  • a specificity for identifying a variant as a false positive variant is at most 0.998, 0.997, 0.996, 0.995, 0.994, 0.993, 0.992, or 0.991.
  • the heterogeneous population of cells are from MRD samples.
  • the method further comprises: for a variant pair included in the subset of candidate variants, determining a quantity of co-occurrence cells where both variants in the variant pair co-occur in each co-occurrence cell; determining a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the subset of candidate variants; applying a fourth set of variant filters to identify a subset of variant pairs; and identifying a sub-subset of candidate variants as true variants based on the determined subset of variant pairs.
  • Disclosed herein is another method for identifying a subpopulation of cells from a heterogeneous population of cells, the method comprising: obtaining a set of candidate variants; determining allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlating determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; and selecting a subset of candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins.
  • correlating determined allele frequencies of the candidate variants to protein expression comprises generating a correlation matrix.
  • the threshold number of proteins is at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least twelve, at least thirteen, at least fourteen, at least fifteen, at least sixteen, at least seventeen, at least eighteen, at least nineteen, or at least twenty proteins.
  • the method further comprises identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants.
  • the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells.
  • the heterogeneous population of cells represents a pooled sample comprising a plurality of cell samples.
  • the plurality of cell samples are distinguishable by germline variants that correlate with the true variants.
  • the correlation is based on a standard deviation of correlation values for the candidate variant across the plurality of proteins.
  • the standard deviation of correlation values further encompasses correlation values for the candidate variant across a plurality of DNA sites.
  • the candidate variant is correlated if the standard deviation of correlation values is greater than a threshold value. In various embodiments, the candidate variant is not correlated if the standard deviation of correlation values is less than a threshold value.
  • the threshold value is any of 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, or 0.20.
  • non-transitory computer readable medium for calling one or more variants of a cell population
  • the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: obtain a set of candidate variants; for a variant pair included in the set of candidate variants, determine a quantity of co-occurrence cells where both variants in the variant pair co-occur in each of the cooccurrence cells; determine a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants; and identify a subset of candidate variants as true variants based on the determined set of variant pairs.
  • Non-transitory computer readable medium for calling one or more variants of a cell population
  • the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: obtain a set of candidate variants; for a variant included in the set of candidate variants, generate a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determine a set of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants; and identify a subset of candidate variants as true variants based on generated statistical significances for the plurality of variants.
  • Non-transitory computer readable medium for calling one or more variants of a cell population
  • the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: obtain a set of candidate variants; determine allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlate determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; and select a subset of candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins.
  • a system comprising: a single-cell analysis workflow device configured to generate a plurality of sequence reads for cells in a cell population; a computational device communicatively coupled to the single-cell analysis workflow device, the computational device configured to: obtain a set of candidate variants; for a variant pair included in the set of candidate variants, determine a quantity of co-occurrence cells where both variants in the variant pair co-occur in each of the co-occurrence cells; determine a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants; and identify a subset of candidate variants as true variants based on the determined set of variant pairs.
  • a single-cell analysis workflow device configured to generate a plurality of sequence reads for cells in a cell population
  • a computational device communicatively coupled to the single-cell analysis workflow device, the computational device configured to: obtain a set of candidate variants; for a variant included in the set of candidate variants, generate a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determine a set of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants; and identify a subset of candidate variants as true variants based on generated statistical significances for the plurality of variants
  • a single-cell analysis workflow device configured to generate a plurality of sequence reads for cells in a cell population
  • a computational device communicatively coupled to the single-cell analysis workflow device, the computational device configured to: obtain a set of candidate variants; determine allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlate determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; and select a subset of candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins.
  • FIG. 1A depicts an overall system environment including a cell analysis workflow device and a variant caller device for identifying variant calls, in accordance with an embodiment.
  • FIG. IB is a block diagram of separate modules of a variant caller device, in accordance with an embodiment.
  • FIG. 2 is a flow diagram for identifying true variants and a subpopulation of cells containing true variants based on multianalyte and multisample correlation, in accordance with an embodiment.
  • FIG. 3A depicts the implementation of a variant correlation model, in accordance with an embodiment.
  • FIG. 3B depicts the implementation of a variant caller model, in accordance with an embodiment.
  • FIGS. 4A-4E are flow diagrams for identifying true variants based on various multianalyte and multisample correlations, in accordance with an embodiment.
  • FIG. 5 depicts a flow diagram for using a cell analysis workflow device for multi- omics analysis, in accordance with an embodiment.
  • FIG. 6 depicts an example computing device for implementing the system and methods described in reference to FIGS. 1-5.
  • FIG. 7 depicts example outcomes of the statistical significance test for the cooccurrence of three pairs of variants.
  • FIG. 8 depicts example detection limits for variants with or without a background error rate estimation.
  • FIGS. 9A and 9B depict example sensitivities and specificities for different disclosed methods when compared to a control method.
  • FIGS. 10A-10D depict example correlation matrices generated based on the correlations between variants and proteins detected under different application scenarios. DETAILED DESCRIPTION
  • subject encompasses a cell, tissue, or organism, human or non- human, whether in vivo, ex vivo, or in vitro, male or female.
  • sample can include a single cell or multiple cells or fragments of cells or an aliquot of body fluid, such as a blood sample, taken from a subject, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art.
  • sample also refers to data obtained from a sample.
  • samples include analyte information for one or more cells taken from a subject.
  • analyte refers to a component of a cell.
  • Cell analytes can be informative for characterizing a cell, such as for identifying one or more variants.
  • examples of an analyte include nucleic acid (e.g., RNA, DNA, cDNA), a protein, a peptide, an antibody, an antibody fragment, a polysaccharide, a sugar, a lipid, a small molecule, or combinations thereof.
  • a single-cell analysis involves analyzing DNA analytes.
  • a single-cell analysis involves analyzing two different analytes such as DNA and protein analytes.
  • variant refers to a specific combination of chromosome, position, and base change in the DNA of a subject.
  • a variant is detected through certain sequencing technology, during which certain errors can be introduced through the process.
  • a variant identified through a detection process can be a true variant or a false positive variant.
  • a true variant is a detected variant that reflects a real variant in a subject (i.e., a real change in DNA), while a false positive variant is a detected variant that may not reflect a real variant in the subject.
  • a false positive variant can arise from processing errors, such as PCR or sequencing errors.
  • a real variant can be a germline variant or a somatic variant.
  • a true variant specifically refers to a detected rare somatic variant that occurs in a subject at a very low (e.g., ⁇ 1%) percentage.
  • correlation refers to any statistical relationship, whether causal or not, between two random variables or bivariate data. For example, two variables can be correlated if a first variable is generally elevated when the second variable is also elevated. As another example, two variables can be correlated if a first variable is decreased when the second variable is also decreased. As used herein, “correlation” also encompasses anti- correlative relationships e.g., a first variable is elevated when the second variable is decreased, and vice versa.
  • variant caller model refers to a prediction model or a machine-learned model that is implemented to call variants of a cell population.
  • the variant caller model analyzes cell population features derived from sequence reads or protein expression profiles across a cell population or individual cell features from sequence reads and protein expression profiles from individual cells.
  • the variant caller model receives the cell population features or individual cell features as input and predicts a classification for a candidate variant.
  • the variant caller model extracts cell population features or individual cell features from sequence reads and protein expression profiles and predicts a classification for a candidate variant based on the extracted cell population features or individual cell features.
  • the classification for a candidate variant is based on the sequencing data for the candidate variants and analyte information related to the candidate variant.
  • analyte information related to the candidate variant For example, multianalyte and/or multisample correlations with other analytes and/or samples are identified for a candidate variant, which is then used to classify the candidate variant.
  • candidate variant refers to a base across sequence reads of a cell population that is mismatched in comparison to a reference base.
  • a variant caller model is implemented to determine whether the candidate variant is a true variant, such as a homozygous variant or a heterozygous variant.
  • true variant refers to a genetic variant that is present in one or more cells of a cell population.
  • a true variant includes a rare variant from a somatic mutation that occurs in only a subpopulation of a cell population.
  • phase “false positive” refers to a genetic variant, identified from sequencing data, which is falsely called due to base-calling errors, processing errors (e.g., due to PCR bias), and/or artifacts (e.g., allele dropout or sequence homology), among possible other errors.
  • measurable residual disease is used interchangeably and generally refer to small sub-populations of cancer cells that may be present at rare quantities in larger populations of cells. Such cancer cells representing signals of MRD can include one or more true variants that may be associated with the cancerous nature of the cancer cells.
  • measurable residual disease refers to small numbers of cancer cells that remain following cancer treatment.
  • measurable residual disease refers to signatures of cancer cells associated with a blood cancer, an example of which is acute lymphocytic leukemia (ALL).
  • ALL acute lymphocytic leukemia
  • Embodiments described herein refer to an improved variant caller that identifies multianalyte and multisample correlations between a candidate variant and other analytes, variants, and samples, and further performs a classification of the candidate variant based on the identified correlations.
  • these true variants represent rare variants that occur in a population of cells at a very low rate (e.g., ⁇ 1%). Accurate identification of these rare variants can be useful for identifying sub-populations of cells, such as cells representing signatures of measurable residual disease (MRD).
  • multianalyte and multisample correlations involve implementing a variant correlation model and the classification of variants involves implementing a variant caller model.
  • the variant caller method described herein achieves higher accuracy in calling true variants that are present in cells in comparison to conventional variant caller methods (e.g., Genome Analysis Toolkit (GATK)) that employ hard cutoffs as opposed to a variant correlation model and/or variant caller model.
  • GATK Genome Analysis Toolkit
  • hard filters used in the GATK is found in De Summa, S., Malerba, G., Pmto, R. et al.
  • GATK hard filtering tunable parameters to improve variant calling for nextgeneration sequencing targeted gene panel data.
  • FIG. 1A depicts an overall system environment 100 including a cell analysis workflow device 120 and a variant caller device 130 for variant calling, in accordance with an embodiment.
  • a cell population 110 is obtained.
  • the cell population 110 can be isolated from a test sample obtained from a subject or a patient.
  • the cell population 110 includes healthy cells taken from a healthy subject.
  • the cell population 110 includes diseased cells taken from a subject.
  • the cell population 110 includes cancer cells taken from a subject previously diagnosed with cancer.
  • cancer cells can be tumor cells available in the bloodstream of the subject diagnosed with cancer.
  • cancer cells can be cells obtained through a tumor biopsy.
  • the cell population 110 includes sub-populations of cells that are present in rare quantities in the cell population 110.
  • a sub-population of cells may be cancer cells that are present in rare quantities in the cell population 110 that is largely made up of healthy cells.
  • the sub-population of cells may be cancer cells that remain after the subject has undergone cancer treatment.
  • the presence of such cancer cells, even at low quantities in the cell population 110 may be informative for, e.g., guiding treatment for the subject.
  • the cell analysis workflow device 120 refers to a device that processes cells and generates information related to the analytes of cells.
  • the cell analysis workflow device 120 refers to a system comprising one or more devices that process cells and prepare nucleic acids for sequencing and/or proteins for expression analysis.
  • the cell analysis workflow device 120 is a workflow device that generates nucleic acids from single cells, thereby enabling the subsequent identification of sequence reads and individual cells from which the sequence reads originated. Whereas measuring one analyte - like DNA - in a cell-specific manner provides an expanded view of cancer, a single analyte may not be sufficient to resolve all the clonal populations of a tumor.
  • the cell analysis workflow device 120 is a single-cell multi-omics workflow device that simultaneously measures multiple analytes, such as DNA, RNA, protein, other biomolecules, and combinations of different molecules. This facilitates the identification of signatures of MRD at a more granular and accurate level.
  • the cell analysis workflow device 120 performs single-cell processing by encapsulating individual cells into emulsions, lysing cells within emulsions, performing cell barcoding of cell lysate in emulsions, and performing a nucleic amplification reaction in emulsions.
  • amplified nucleic acids can be collected and sequenced. Further description of example embodiments of single-cell workflow processes is described in U.S. Patent Application No. 14/420,646, which is hereby incorporated by reference in its entirety.
  • certain cell surface markers are first labeled, e.g., by using antibody-oligo conjugates (AOCs) that bind specifically to target cell surface proteins. Accordingly, when performing nucleic amplification in emulsions, not only target genes but also antibody-specific oligos are amplified with a cell-specific barcode, which then allows the construction of both mutational and protein (e.g., surface markers) profiles.
  • AOCs antibody-oligo conjugates
  • the cell analysis workflow device 120 can be any of the TapestriTM Platform, inDropTM system, NadiaTM instrument, or the ChromiumTM instrument.
  • the cell analysis workflow device 120 includes a sequencer for sequencing the nucleic acids to generate sequence reads.
  • the cell analysis workflow device 120 also includes an analyzer to generate protein expression profiles based on the amplitude of amplification of antibody-specific oligos.
  • the variant caller device 130 is configured to receive the sequence reads and/or protein expression data from the cell analysis workflow device 120 and to process the sequence reads and/or protein expression information to call one or more variants 140.
  • the variant calling includes a classification of a variant as a potential true variant or likely false positive. For example, when a variant is called, it may be marked as a true variant or not.
  • the variant caller device 130 is communicatively coupled to the cell analysis workflow device 120, and therefore, directly receives the sequence reads and protein expression data from the cell analysis workflow device 120.
  • the variant caller device 130 determines the correlation of a candidate variant with one or more other analytes or variants with a same cell, same sample, or with different cells and/or different samples. Based on the correlation information, a candidate variant can be then classified as true or not during a variant calling process.
  • the variant caller device 130 identifies multianalyte and multisample information from the sequence reads and protein expression information obtained through a cell-specific workflow process and subsequently calls variants across the cell population using the correlation information for each candidate variant. Altogether, this two-step process of cell-specific correlation identification and true variant classification during the variant calling process enables more accurate variant calls 140 across the cell population 110.
  • FIG. IB is a block diagram of a variant caller device 130, in accordance with the embodiment described in FIG. 1 A.
  • the variant caller device 130 includes a variant filtration module 132, a variant correlation module 134, a variant caller module 136, and optionally a training module 138.
  • the modules of the variant caller device 130 can be arranged differently from the embodiment shown in FIG. IB.
  • the training module 138 (as shown in dotted lines) can be implemented by a device other than the variant caller device 130 and the methods described below regarding the training module 138 can be performed by the other device.
  • the variant filtration module 132 filters one or more variants that are not likely true variants.
  • the variant filtration module 132 may filter some falsely called variants based on certain features (e.g., individual cell features or cell population features) identified from the sequence reads and/or protein expression data of various variants from one or more cells or samples.
  • some falsely called variants may show a difference in certain cell population features or individual cell features, such as genotype quality, read depth or depth coverage, variant allele frequency, the number of cells in which a variant is detected, and the like, when compared to true variants.
  • these various thresholds may be predefined and configured in the variant caller device 130, which then uses the threshold values to filter out a candidate variant that is not a true variant (e.g., a “false positive” variant).
  • the variant filtration module 132 applies one or more filters at any stage of a variant calling process. For example, one or more filters may be applied first before checking the multianalyte and multisample correlation of a candidate variant. For another example, the variant filtration module 132 applies one or more filters after checking the multianalyte and multisample correlation of a candidate variant. In yet another example, the variant filtration module 132 applies one or more filters during the process of checking the multianalyte and multisample correlation of a candidate variant.
  • the specific thresholds for each filter may be configured based on the knowledge identified from the false positive variants and/or the true variants. For example, by manually looking into the sequence reads of a large number of samples, it may be found that the true variants generally have a minimum depth of coverage of 10. The threshold for the depth of coverage may be set to about 10. Similarly, if it is found that the true variants generally have a genotype quality of 30 or more, then the threshold for the genotype quality is set to about 30. In various embodiments, the thresholds for other filters may be similarly determined.
  • the features or specific filters selected for filtering purposes may be not fixed but rather can be dynamically updated. For example, based on the new findings of certain true variants or false positive variants, additional filters can be further added to the variant filtration module 132.
  • the threshold values for each specific filter may be also dynamically updated if new findings support such updates.
  • the thresholds may be also adjusted. For example, when determining the co-occurrence of two candidate variants within the same cells, the threshold cell number used for determining co-occurrence may be changed depending on a disease stage of a subject from which a sample is obtained, since a disease in later stages may have more variants that co-occur within same cells.
  • these candidate variants may be excluded from the correlation analysis, to save the computing resources required for the analysis.
  • the variant correlation module 134 determines multianalyte and multisample correlations for one or more candidate variants.
  • the variant correlation module 134 may identify any kind of correlation that facilitates the identification of a true variant and/or exclusion of a false positive variant.
  • the variant correlation module 134 may determine whether there is a cross-analyte correlation by checking the analyte features of a candidate variant with other analyte features associated with the candidate variant (e.g., analyte features from a same cell or same sample).
  • the analyte features may relate to DNA, RNA, protein, or other kinds of biomolecules.
  • the variant correlation module 134 may check whether there is a correlation between the VAF value of a candidate variant and the protein expression of cell surface markers from a same cell. For another example, the variant correlation module 134 may determine whether there is a correlation between one variant and another variant within a same cell by checking how many times (e.g., how many cells) these two variants occur in a same cell.
  • the variant correlation module 134 leverages cross-sample correlation to determine whether a variant is a true variant or not.
  • data from a set of control samples may be used for background error rate evaluation when determining whether a candidate variant is a true variant or is due to a background error (e.g., base-calling errors, processing errors, and/or artifacts, which may be considered as “background error”).
  • the control samples may not have true or rare variants (e.g., control samples obtained from healthy donors) and thus the identified variants can be considered as being from background errors (e.g., due to base-calling errors, processing errors, and/or artifacts, among possible other errors).
  • control samples may be similarly processed as the sample that is subject to the variant call, and thus the background errors identified from the control samples may allow determining whether a candidate variant identified from a sample in the variant call is due to the background error or not (e.g., by estimating the background error rate for each candidate rate to see how likely the variant is falsely called).
  • the variant correlation module 134 further includes a variant correlation model that allows the identification of any potential correlation between a candidate variant and another variant/analyte/cell/sample.
  • the variant correlation model may evaluate the presence of possible correlations based on the data obtained from sequence reads, protein expression from the same cells, same samples, or from different cells and/or samples.
  • the variant correlation model disclosed herein includes one or more probabilistic models that evaluate the statistical significance of one or more possible correlations.
  • the variant correlation model may also include one or more machine-earning models that can be trained to identify certain correlations based on the features identified from the variants, analytes, cells, and/or samples.
  • the variant pair co-occurrence patterns from a large number of samples can be used to train a machine-learning model to determine whether there is a correlation between co-occurring variants within same cells.
  • the variant caller module 136 applies a variant caller model to predict one or more true variants of a cell population.
  • the variant caller module 136 provides, as input, multianalyte and multisample correlation information associated with a candidate variant to the variant caller model.
  • the variant caller model analyzes the multianalyte and multisample correlation information and outputs a prediction for the candidate variant.
  • the variant caller model is a classifier that outputs a classification for the candidate variant out of multiple possible classifications.
  • the variant caller model is a classifier that outputs one of two classifications for the candidate variant.
  • the variant caller model can output a classification of a true variant or a false positive variant.
  • the variant caller model outputs a classification of an indeterminate variant.
  • An indeterminate variant can represent a low- confidence call that requires additional analysis to confirm whether the indeterminate variant is a true variant.
  • the training module 138 generally implements methods for generating one or both of the variant correlation model and the variant caller model.
  • the training module 138 is implemented by a device or system other than the variant caller device 130.
  • the training module 138 can be implemented by a third party.
  • the third party generates one or both of the variant correlation model and the variant caller model.
  • the third party can then provide one or both of the trained variant correlation model and the trained variant caller model to the variant caller device 130.
  • the training module 138 trains the variant correlation model.
  • the training module 138 can employ a machine learning-implemented method to train the variant correlation model, such as any one of a linear regression algorithm, logistic regression algorithm, decision tree algorithm, support vector machine classification, Naive Bayes classification, K-Nearest Neighbor classification, random forest algorithm, deep learning algorithm, gradient boosting algorithm, and dimensionality reduction techniques such as manifold learning, principal component analysis, factor analysis, autoencoder regularization, and independent component analysis, or combinations thereof.
  • the training module 138 employs supervised learning algorithms, unsupervised learning algorithms, semisupervised learning algorithms (e.g., partial supervision), transfer learning, multi-task learning, or any combination thereof to train the variant correlation model.
  • the training module 138 trains the variant correlation model using variant correlation training samples.
  • the variant correlation training samples include training sequence reads, protein expression, and other features derived from individual cells or samples that show certain correlation patterns. Such training samples can be expressed in a commonly used file format such as a SAM or BAM file format.
  • the training module 138 also trains the variant caller model.
  • the training module 138 can employ a machine learning-implemented method to train the variant caller model, such as any one of a linear regression algorithm, logistic regression algorithm, decision tree algorithm, support vector machine classification, Naive Bayes classification, K-Nearest Neighbor classification, random forest algorithm, deep learning algorithm, gradient boosting algorithm, and dimensionality reduction techniques such as manifold learning, principal component analysis, factor analysis, autoencoder regularization, and independent component analysis, or combinations thereof.
  • the training module 138 employs supervised learning algorithms, unsupervised learning algorithms, semisupervised learning algorithms (e.g., partial supervision), transfer learning, multi-task learning, or any combination thereof to train the variant caller model.
  • the training module 138 trains the variant caller model using variant caller training samples.
  • the variant caller training samples include training multianalyte and multisample correlations with known variant identification (e.g., known true variants or false positive variants, e.g., which can be determined through manual investigation).
  • the variant caller training samples also include sequence reads, analyte expressions, cells, and/or samples that can be used to derive multianalyte and multisample correlation information.
  • the variant caller training samples can be labeled with reference ground truths indicating a classification of variants.
  • the reference ground truths differentiate between a true variant and a false positive variant.
  • the labels of the variant caller training samples can be previously determined and/or confirmed through other sequencing methods, such as bulk sequencing methods.
  • labels of the variant caller training samples can be previously determined at least in part based on known genetic variants that are present in certain cell lines.
  • a label can be a binary value (e.g., a 0 or 1 value) that is indicative of whether the variant is a true variant or a false positive variant.
  • a label can be different integer values (e.g., 0, 1, 2, 3, etc.) depending on the number of classifications that the variant caller model is designed to predict. For example, in indeterminate variant can be labeled as 2, while a true variant or false positive variant is labeled as 0 or 1.
  • the multianalyte and multisample correlation may include, but is not limited to, 1) a multianalyte correlation between VAF values of variants and protein expression of one or more proteins, 2) a multianalyte co-occurrence between one variant and another different variant in the same cell(s), and 3) a multisample comparison of a variant from an in-evaluation sample to other control samples. Based on the findings from these different multianalyte and multisample correlations, it can be determined whether a variant is a true variant or not.
  • Step 202 Obtain a set of candidate variants from a heterogeneous population of cells.
  • the heterogeneous population of cells includes cells that contain true variants (e.g., various somatic mutations) and cells that do not contain true variants.
  • the candidate variants are determined through single-cell DNA sequencing.
  • the set of candidate variants includes a plurality of falsely called variants arising from base-calling errors, processing errors (e.g., due to PCR bias), and/or artifacts (e.g., allele dropout or sequence homology). Accordingly, it may be desirable to separate true variants from these falsely called variants so that a proper population of cells may be identified for further analysis, such as understanding the mechanisms of tumorigenesis, among others.
  • Step 204 For each candidate variant, determine a correlation between the candidate variant and one or more of other features associated with the variant.
  • the candidate variant can be compared to other analytes and/or other variants within the same or different cells or samples, to identify possible multianalyte and multisample correlations between the variant and other analytes and/or other variants within the same or different cells or samples.
  • the candidate variant can be analyzed for each possible correlation.
  • determining a correlation between the candidate variant and one or more of the other features associated with the variant comprises using a variant correlation model to determine the correlations for the candidate variant from different aspects.
  • Step 206 Select a subset of the candidate variants as true variants based on the determined correlations.
  • selecting a subset of the candidate variants as true variants comprises using a variant caller model to determine whether a candidate variant is a true variant based on the determined one or more correlations associated with the variant.
  • not very possible correlations for a candidate variant need to be determined. For example, if one identified correlation can determine whether the candidate variant is a true variant, then no other correlations need to be further evaluated. In other embodiments, multiple correlations are determined instead, to improve the sensitivity and specificity of a variant calling process.
  • Step 208 Identify a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants.
  • identifying a subpopulation of cells from the heterogeneous population of cells comprises identifying a subpopulation of cells comprising one or more true variants.
  • the one or more true variants identified through method 200 disclosed herein represent rare variants present in the subpopulation of cells.
  • the subpopulation of cells represents a rare cell population within a heterogeneous population of cells.
  • the subpopulation may be a subclone present in a tumor sample or biopsy.
  • the variant correlation model and the variant caller model are machine-learned models.
  • Each of the variant correlation model and the variant caller model may be trained using training data.
  • the variant correlation model and the variant caller model can be deployed (e.g., deployed to a variant caller device for variant call).
  • one or both of the variant correlation model and the variant caller model is any one of a regression model (e.g., linear regression, logistic regression, or polynomial regression), decision tree, random forest, support vector machine, Naive Bayes 1 model, k-means cluster, or neural network (e.g., feed-forward networks, convolutional neural networks (CNN), deep neural networks (DNN), autoencoder neural networks, generative adversarial networks, or recurrent networks (e.g., long short-term memory networks (LSTM), bidirectional recurrent networks, deep bi-directional recurrent networks).
  • a regression model e.g., linear regression, logistic regression, or polynomial regression
  • decision tree e.g., logistic regression, or polynomial regression
  • random forest e.g., logistic regression, or polynomial regression
  • Naive Bayes 1 model e.g., k-means cluster
  • neural network e.g., feed-forward networks, convolutional neural networks (CNN),
  • one or both of the variant correlation model and the variant caller model have one or more parameters, such as hyperparameters or model parameters.
  • Hyperparameters are generally established prior to training. Examples of hyperparameters include the learning rate, depth or leaves of a decision tree, number of hidden layers in a deep neural network, number of clusters in a k-means cluster, penalty in a regression model, and a regularization parameter associated with a cost function.
  • Model parameters of one or both of the variant correlation model and the variant caller model are generally adjusted during training. Examples of model parameters include weights associated with nodes in layers of a neural network, support vectors in a support vector machine, and coefficients in a regression model. The model parameters of the machine learning model are trained (e.g., adjusted) using the training data to improve the predictive power of the machine learning model.
  • one or both of the variant correlation model and the variant caller model are parametric models in which one or more parameters of the models define the dependence between the independent variables and dependent variables.
  • various parameters of parametric-type models are trained to minimize a loss function, the training being conducted through gradient-based numerical optimization algorithms, such as batch gradient algorithms, stochastic gradient algorithms, and the like.
  • one or both of the variant correlation model and the variant caller model are non-parametric models in which the model structure is determined from the training data and is not strictly based on a fixed set of parameters.
  • FIG. 3 A depicts one example implementation of the variant correlation model 310, in accordance with an embodiment.
  • the variant correlation model 310 analyzes features associated with a candidate variant, where the features are generated from sequence reads and protein expressions derived from single cells in a sample or multiple samples.
  • the variant correlation model 310 analyzes these features derived from the sequence reads and protein expressions. These features may include various features such as protein expression levels, variant allele frequency values, etc., all of which can be derived from the sequence reads and protein expression profiles of the individual cells within the sample(s).
  • the variant correlation model 310 Based on the identified features associated with a candidate variant, the variant correlation model 310 outputs a correlation value or matrix that represents one or more correlations between the candidate variant and one or more of other variants and/or analytes from the same or different cells or samples.
  • the variant correlation model 310 is a neural network.
  • the variant correlation model 310 is a deep-learning neural network.
  • the variant correlation model 310 may be structured with two, three, four, five, six, seven, eight, nine, or ten layers. Layers of the variant correlation model 310 are comprised of one or more nodes. A node in a layer can be connected to other nodes of other layers, the connection between nodes being associated with parameters. A value at one node may be represented as a combination of the values of nodes connected to the particular node weighted by associated parameters mapped by an activation function associated with the particular node.
  • FIG. 3B depicts one example implementation of the variant caller model, in accordance with an embodiment.
  • the variant caller model 320 analyzes various correlations derived from features associated with a candidate variant.
  • the variant caller model 320 outputs a classification for the candidate variant.
  • the classification for the variant is one of a true variant or a false positive variant.
  • the variant caller model 320 receives as input the various correlations identified from the features associated with a candidate variant.
  • the variant caller model 320 analyzes the various correlations and predicts a variant classification for the candidate variant.
  • the variant caller model 320 is a neural network.
  • the variant caller model 320 is a deep-learning neural network.
  • the variant caller model 320 may be structured with two, three, four, five, six, seven, eight, nine, or ten layers. Layers of the variant caller model 320 are comprised of one or more nodes. A node in a layer can be connected to other nodes of other layers, the connection between nodes being associated with parameters. A value at one node may be represented as a combination of the values of nodes connected to the particular node weighted by associated parameters mapped by an activation function associated with the particular node.
  • Embodiments disclosed herein leverage the single-cell data by identifying variants that co-occur together in the same cell in a statistically significant way.
  • Prior approaches identify variants by checking one variant at a time, where an error rate for individual variants may be high. The error rate of co-occurring variants is much lower compared to individual variants, so the theoretical limit of detection for co-occurring variants is lower than that of individual variants.
  • Embodiments of methods disclosed herein use a Binomial test for testing the statistical significance of co-occurring variants. This not only identifies if two rare variants co-occur together in a significant way but also filters out cases where one variant is rare while the other is not.
  • FIG. 4A depicts a flow chart of an example method 420 for co-occurrence-based true variant identification, according to some embodiments.
  • a set of candidate variants are obtained from a heterogeneous population of cells.
  • the candidate variants may be identified from a cell population (e.g., cell population 110 described in FIG. 1 A), which includes data obtained from a heterogeneous population of cells through a single cell platform (an example of which includes the Tapestri® platform).
  • a single-cell platform enables highly sensitive targeted DNA sequencing at the single-cell level.
  • this droplet-based technology measures genetic lesions (single nucleotide variants (SNVs), indels, chromosomal rearrangements) and copy number variants (CNVs) in each cell.
  • SNVs single nucleotide variants
  • CNVs copy number variants
  • Step 424 Apply a first set of variant filters to identify a set of variants from candidate variants.
  • a set of ad hoc variant filters are first applied to remove potential false positive variants (or simply false positives).
  • the possible ad hoc variant filters include, but are not limited to, allele depth (AD) and/or depth of coverage (DP), genotype quality (GQ), the minimum number of cells that a variant is present, and a minimum of variant allele frequency of each candidate variant.
  • AD allele depth
  • DP depth of coverage
  • GQ genotype quality
  • these different filters may be set to predefined values, to allow filtering out likely false positives.
  • the thresholds may be set as follows: requiring a minimum depth of 10 reads, requiring genotype quality of 30 or larger, requiring a variant to be present in at least 3 cells, and requiring a minimum variant allele frequency of 35.
  • different threshold values may be set for each specific filter, as further described in detail below.
  • Allele depth and depth of coverage are two complementary fields that represent two important ways of evaluating the depth of the data prepared for the variant call.
  • Allele depth refers to the unfiltered allele depth, which is the number of reads that support each of the reported alleles. All reads at a position (including reads that did not pass the variant caller’s filters) are included in this number, except for reads that are considered uninformative. For reads that are considered uninformative, these reads generally do not provide sufficient statistical evidence to support one allele over another.
  • Depth of coverage is the filtered depth, at the sample level as the allele depth. The depth of coverage is the number of filtered reads that support each of the reported alleles. Only reads that passed the variant caller’s depth threshold filter(s) may be included in this number of the depth of coverage. It should be noted that, unlike the allele depth calculation, uninformative reads are generally included in the depth of coverage calculation.
  • variants with certain allele depths may be filtered out for further analysis.
  • variants that have a depth less than the depth of coverage threshold may be filtered out so that only variants that have a number of reads equal to or greater than the threshold may be subject to further analysis in the variant call.
  • a depth of coverage threshold may be a value that can be predetermined based on the knowledge gained from certain analyses (e.g., based on the knowledge identified from ground-truth samples).
  • the predetermined depth of coverage threshold can be set to 4, 6, 8, 10, 12, 14, 16, 18, 20 reads, or another different value for a candidate variant.
  • a depth of coverage threshold of 10 may be selected for filtering purposes.
  • the genotype quality is another ad hoc variant filter that can be used to filter out certain false positive variants.
  • Genotype quality generally represents a confidence level (e.g., Phred-scaled confidence level (PL)) that a genotype assignment is correct.
  • GQ may be derived from the genotype PLs. For example, the GQ may be determined based on the difference between the PL of the second most likely genotype and the PL of the most likely genotype.
  • the values of the PLs may be normalized so that the most likely PL is 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99.
  • the value of GQ is capped at 99 since larger values may be not more informative, but these values take more space in the file. So if the second most likely PL is greater than 99, a GQ of 99 may be used instead.
  • the GQ provides a sign of the difference between the likelihoods of the two most likely genotypes. If it is low, it means there is not much confidence in the genotype.
  • variants in a variant call may be also filtered based on the GQ. For example, only variants that have a GQ higher than a threshold value may be subject to further analysis, while the variants that have a low GQ may be filtered out without further analysis.
  • the GQ threshold may be determined based on knowledge obtained from other analyses (e.g., based on the knowledge identified from ground-truth samples). In one example, the GQ threshold may be set to 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, or another different value. In a specific example, the GQ threshold may be set to 35.
  • the ad hoc variant filters further include determining the quantity of cells that a candidate variant is present. That is, by determining the number of cells that a variant is present, certain false positive variants can be also filtered out.
  • Variant reads are the number of independent sequence reads supporting the presence of a variant. Due to the high error rate of NGS at the per-base call level, calls supported by fewer than a certain number of variant reads are typically considered to be likely false positive calls.
  • true variants generally have a larger number of variant reads (e.g., detected in a larger number of cells). Accordingly, by using the cell quantity as a filter, the variants that occur in smaller numbers of cells can be considered potential false positives and filtered out.
  • the specific value used in the cell number-based filtering process may be also determined based on information obtained from certain analyses (e.g., based on the knowledge identified from ground-truth samples). For example, after analysis based on the studies, it is found that true variants generally occur in three or more cells in a sample. Accordingly, the cell quantity threshold may be set to 3 to filter out variants that occur in less than three cells in a sample. In various embodiments, the cell threshold may be also another different value, such as 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 45, 50, or another number of cells, based on the knowledge obtained from certain studies.
  • the ad hoc variant filters further leverage the variant allele frequency (VAF) to filter out certain false positives.
  • VAF is the percentage of sequence reads observed matching a specific DNA variant divided by the overall coverage at that locus. True variants typically have higher variant allele frequencies, while false positives may have lower variant allele frequencies. Accordingly, by setting VAF as a filter, certain false positives may be also filtered out.
  • the specific values used in the VAF filtering process may be also determined based on the information from the ground-truth studies or other similar analyses. For example, a number of studies indicate that a variant that has a VAF smaller than 35 is likely a false positive, and thus the threshold value may be set to 35. In various embodiments, the threshold value for the VAF can be set to other different values, such as 20, 25, 30, 35, 40, 45, 50, or another different value.
  • a set of variants may be obtained from the candidate variants identified from a sample.
  • the set of variants can be then subject to the co-occurrence determination, for example, to determine the quantity of cells where a candidate variant co-occurs with another variant.
  • Step 426 For a variant pair included in the candidate variants, determine a quantity of co-occurrence cells where both variants in the variant pair co-occur in each of the co-occurrence cells.
  • the threshold number of cells for determining the variant pair co-occurrence is set to a value determined based on the ground-truth studies or other similar analyses.
  • the threshold number can be set to 3, which means that when variants in a variant pair co-occur in 3 cells or more, it indicates that there is a co-occurrence between the two variants in the variant pair.
  • different numbers other than 3 (e.g., 2, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, or another different number of cells) can be used instead.
  • Step 428 Determine a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants.
  • the statistical significance of variant co-occurrence is tested by using a Binomial test.
  • a Binomial test is a test of the statistical significance of deviations from a theoretically expected distribution of observations into two categories using sample data.
  • the Binomial test serves as a kind of probability test based on various rules of probability, and involves the testing of the difference between a sample proportion and a given proportion.
  • a single variant occurrence probability is estimated per cell that applies to all cells in the sample. The product of two variant occurrence probabilities gives the probability of overlap by chance.
  • the threshold p-value of the Binomial test is set to 0.01, which means that when the determined p- value of the Binomial test is less than or equal to 0.01, the detected co-occurrence is statistically significant.
  • threshold p-values of the Binomial test can be set to a value other than 0.01 (e.g., 0.005, 0.006, 0.007, 0.008, 0.009, 0.011, 0.012, 0.013, 0.014, 0.015 or another value) to determine the statistical significance.
  • a one-sided Binomial test is used in evaluating the statistical significance. As described earlier, if one variant is in 1% of cells and the other is in 80%, then it would be expected most variant pairs will co-occur by chance. By using a one-sided Binomial test, it can filter out cases where one variant is rare and the other is not.
  • the Binomial test can be expressed as:
  • a sample may have tens of thousands of rare variants, with the vast majority occurring in less than 10 cells. In most samples, there are tens of thousands of variants, and performing a statistical test on all possible combinations, such as a Binomial test, is very computationally inefficient. To speed up the computation, the mathematical concepts of the sparse matrices, matrix multiplication, and adjacency matrices may be used.
  • a sparse matrix is a special case of a matrix in which the number of zero elements is much higher than the number of non-zero elements. As a rule of thumb, if 2/3 of the total elements in a matrix are zeros, it can be called a sparse matrix.
  • an adjacency matrix is a square matrix used to represent a finite graph.
  • the elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.
  • a sparse matrix representation of which variants are present in each cell is generated, followed by multiplication of this sparse matrix by the transpose of itself yields an adjacency matrix, so that redundant memory access to non-zeros may be eliminated by decoupling multiplication from accumulation, thereby speeding up the Binomial test of all possible combinations of variant pairs.
  • Step 430 Apply a second set of ad hoc filters to identify a sub-subset of variant pairs.
  • the subset of variant pairs may be further fdtered to remove certain false positives.
  • a first example filter relates to a genomic distance between the two variants in a variant pair. In general, co-occurring variants that are genomically very close together are more likely the result of PCR or alignment errors. By fdtering out the variant pairs where two variants in a pair are too close, the likely false positives from PCR and alignment errors are excluded.
  • the threshold genomic distance between two variants in a variant pair is determined based on the knowledge identified from PCR and alignment error analyses.
  • the threshold genomic distance is set to 100 base pairs, which means that if co-occurring variants in a variant pair are less than 100 base pairs apart from each other, the variant pair can be fdtered out from the true variant analysis.
  • a second example filter for a variant pair that shows significant co-occurrence relates to an average VAF.
  • VAF can be used to evaluate a single variant
  • the average VAF can be used to evaluate the variant pair.
  • the average VAF takes the average of the VAFs for the two variants in a variant pair.
  • the specific values used in the VAF filtering process may be also determined based on the information obtained for single variant-based VAF analysis.
  • the threshold value may be set to 35, which means that if the average VAF for a variant pair is less than 35, one or both variants in a variant pair are likely false positives.
  • the threshold value for the average VAF can be set to other different values, such as 20, 25, 30, 35, 40, 45, 50, or another different value.
  • the variants included in the variant pairs that have passed the second set of ad hoc filters are considered true variants.
  • the methods disclosed herein for co-occurrence-based true variant identification further include identifying a subpopulation of cells with one or more of the true variants (e.g., cells with two or more true variants in tumor development emerging from a gradual accumulation of somatic alterations that together enable malignant growth).
  • one or more true variants represent rare variants present in the subpopulation of cells.
  • the subpopulation of cells represents a rare cell population within a heterogeneous population of cells.
  • the subpopulation of cells represents less than 10%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1% of the cells in the heterogeneous population of cells.
  • Embodiments disclosed herein leverage background error rates to identify true variants. Prior approaches rely on a fixed threshold for all variants, which does not consider the variability in background error rates across variants. In the methods disclosed herein, a threshold for identifying a true variant is dynamically determined for each variant. Altogether, the methods disclosed herein enable a lower limit of detection for some variants, while having a higher threshold for variants that are more error-prone, thereby reducing false positives.
  • control samples for estimating background error rates for individual variants.
  • These control samples can be samples that have been previously or instantly generated specifically for background error rate estimation. These individual control samples may provide additional information beyond what a single sample can provide on its own in respective studies.
  • the methods disclosed herein use Beta-Binomial distribution to statistically model the per-variant background error rate. Furthermore, the methods disclosed herein use the variant-specific background error rate to determine whether a variant is a true variant.
  • the variant-specific background error rate generated using the Beta-Binomial distribution has certain advantages. First, as described above, the methods allow the calculation of a p-value for each variant, which will be the probability that the variant call is an error. Second, Beta-Binomial is a count-based distribution, so it dynamically adjusts to the total number of cells in a sample. The number of cells where a false positive variant is observed roughly scales with the total number of cells in the sample and therefore, the Beta-Binomial distribution captures this adjustment.
  • Beta-Binomial allows for over-dispersion, which enables it to be more flexible in modeling different error distribution shapes.
  • the specific processes for the background error rate generated from the Beta-Binomial distribution in identifying true variants are further described in detail below.
  • FIG. 4B depicts a flow chart of an example method 440 for identifying true variants based on the background error rate generated based on the Beta-Binomial distribution.
  • Step 442 Obtain a set of candidate variants from a heterogeneous population of cells.
  • the candidate variants may be identified from a cell sample, as described earlier.
  • the cell sample may include variants determined from sequencing data from a number of heterogeneous cells.
  • Some of the cells may include rare variants, e.g., cancer cells with rare variants indicative of MRD.
  • Step 444 Apply a first set of variant filters to identify a set of variants from candidate variants.
  • a first set of variant filters may be also applied to filter out potential false positives before the background error rate-based true variant identification.
  • the first set of variant filters may be similar to, or different from, the filters applied earlier in the co-occurrence-based true variant identification.
  • the first set of filters for the method 440 may include a depth of coverage filter, a genotype quality filter, and a genotyped cell percentage filter regarding a percentage of cells that are genotyped for a genomic position of a variant.
  • the threshold values for each filter may be similarly configured or selected. In one example, the genotyped cell percentage threshold may be set to at least 10%, 20%, 30%, 40%, 50%, or 60%.
  • Step 446 For a variant included in the set of candidate variants, generate a statistical significance by using a beta-binomial distribution with parameters generated from a set of control samples.
  • the Beta-Binomial distribution is specifically generated for that variant based on the parameters estimated from a set of control samples. These estimated parameters can be then used to calculate the statistical significance (e.g., p-value) for that variant.
  • the specific processes for generating the parameters for the Beta-Binomial distribution for a specific variant are further described in FIG. 4C. As will be described later, the parameters generated for a specific variant are normally for a variant found in the control samples. Accordingly, when a new variant is subject to the Beta-Binomial test for statistical significance, it may first check whether the new variant can be found in a control sample.
  • the parameters may have been already estimated for that new variant (or the parameters can be instantly generated from the control samples if they are not readily available).
  • the parameters for the Beta-Binomial distribution test may use the average parameters that are obtained by averaging the parameters estimated for all readily available parameters (e.g., parameters for all variants included in the control samples). In this way, the parameters for each specific variant can be obtained, which can be then used to calculate the p-value for that specific variant.
  • the detection limit can be lowered by taking into the background error rate specific to that variant during the true variant identification.
  • Step 448 Determine a subset of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants.
  • certain false positives can be further filtered out. For example, if the calculated p-value for a variant is larger than a threshold, the variant can be filtered out.
  • the p-value threshold may be set based on the knowledge obtained from ground-truth studies or other similar studies. In one example, the p-value may be set to 0.0001, which means that variants having a calculated p- value larger than 0.0001 can be filtered out.
  • the p-value can be set to a value other than 0.0001, such as 0.00005, 0.00006, 0.00007, 0.00008, 0.00009, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009, 0.001, or another different value.
  • a subset of candidate variants can be obtained after filtering out certain variants based on the calculated statistical significance.
  • Step 450 Apply a second set of filters to obtain a subset of candidate variants.
  • the candidate variants obtained based on the background error rate can be further filtered using one or more filters.
  • a filter may be further configured to require a variant to be present in at least a number of cells regardless of the calculated p-value.
  • the expected number of cells can be 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or another different value.
  • a variant needs to be present in at least 5 cells for that variant to be considered a true variant.
  • another different filter may be configured to require the variant allele frequency to be larger than a specific value to be considered a true variant.
  • the specific value may be set to 35, which means that variants with a VCF of less than 35 can be considered false positives and filtered out.
  • the specific value for the VCF can be set to a value other than 35, such as 25, 30, 40, 45, 50, 55, 60, 65, 70, or another different value.
  • the variants that have passed the second set of filters are considered true variants.
  • the methods disclosed herein for background error ratebased true variant identification further includes identifying a subpopulation of cells with one or more of the true variants (e.g., cells with one or more true variants identified through the abovedescribed processes).
  • the one or more true variants represent rare variants present in the subpopulation of cells.
  • the subpopulation of cells represents a rare cell population within a heterogeneous population of cells.
  • the subpopulation of cells represents less than 10%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1% of the cells in the heterogeneous population of cells.
  • FIG. 4C depicts an example method 460 for generating parameters for Beta-Binomial distribution for a specific variant.
  • the parameters may be generated based on a set of control samples and may be generated for a variant included in one or more of the control samples. In one embodiment, for each variant included in the set of control samples, a set of Beta-Binomial parameters may be generated.
  • Step 462 Acquire a set of control samples.
  • control samples may be collected based on the existing data already available from previous runs of single-cell sequencing (e.g., from data already available from a Tapestri® platform). Alternatively, a set of control samples may be also purposedly generated to estimate parameters for Beta-Binomial distribution for different variants.
  • control samples are ideally non-cancerous samples, such as healthy bone marrow or other different tissues. This allows the control samples to contain just germline variants and have no sub-clonal mutations which make it more difficult to filter out real variants before estimating the background error rate for each variant. In various embodiments, a sufficient number of control samples are necessary to yield more robust estimates of the error rate.
  • control samples are collected for this purpose. In other examples, at least 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, or another different number of control samples are collected in this step.
  • Step 464 For each control sample, mask or remove any real variants in each sample.
  • masking or removing real variants in each sample only background noise variants are included in the Beta-Binomial parameter estimation. For example, the germline variants are masked or removed in this step.
  • masking refers to a process to substitute low-quality base calls with an “N”.
  • Step 466 For each control sample, apply a set of variant filters to identify a set of background variants.
  • the set of variant filters including the filter thresholds are set to be the same as the set of filters used in the background error rate-based true variant identification.
  • the set of filters may include a depth of coverage filter and a genotype quality filter. If the depth of coverage threshold used in the background error rate-based true variant identification is set to 10, the threshold for the depth of coverage in this step is also set to 10. Similarly, if the genotype quality threshold used in the background error rate-based true variant identification is set to 30, the threshold for the genotype quality in this step is also set to 30. In this way, variants remaining in each control sample can be considered background variants, or noise variants, which can thus be used to generate a background error rate for a specific variant. [00146] Step 468: For each background variant, determine a quantity of cells containing the background variant in each sample.
  • the number of cells with the variant is counted in each control sample. This then leads to the generation of two vectors: (1) a first vector referring to the number of cells that have the variant in each control sample, and (2) a second vector referring to the number of total cells in each control sample. These two vectors can be then used to estimate the parameters of the Beta-Binomial distribution.
  • Step 470 Generate parameters for the beta-binomial distribution for each background variant based on the determined quantity of cells containing the background variant in each sample.
  • the parameters are generated based on the vectors generated for each specific variant. It should be noted that the parameters are only generated for the remaining variants that are false positives in Step 468. That is, the per- variant error rates are generated for remaining false positives in Step 468, and thus can be used to identify such false positives when identifying the true variants for a new sample, i.e., a sample that is not the control sample (e.g., a sample collected from a cancer patient). However, when using the background error rate to identify a true variant for a new sample, a candidate variant may be not found in the control samples. That is, a candidate may not have the corresponding per-variant error rate. In such a scenario, parameters from all the remaining false positives may be averaged, which can be then used to generate a background error rate that can be applied to a new variant that is not found in the remaining false positives in Step 468.
  • the parameters for the Beta-Binomial distribution are generated by averaging all parameters from all remaining false positive variants after filtration and removal of germline variants.
  • the probability of whether a candidate variant is from background error or is a true variant can be then determined.
  • FIG. 4D further depicts a flow diagram for generating a per-variant background error rate.
  • the bone marrow from healthy donors may be collected to get a number of samples, e.g., 30 samples or more.
  • the single-cell sequencing is then performed, and a first set of variant filters are then applied.
  • the likely germline variants are masked or removed.
  • the remaining variants are false positives, which may be considered background errors in a variant call.
  • These false positives are then used to generate a per-variant background error rate using the Beta-Binomial distribution.
  • each variant may have a different background error rate, which can be then taken into consideration for a specific variant in the true variant identification process.
  • methods disclosed herein involve correlating an invariant filtered set of variants against orthogonal analyte data and/or all multiplex samples.
  • an invariant filtered set of genomic variants detected via single-cell DNA analysis is correlated against orthogonal analyte data.
  • orthogonal protein analyte data the same principle may be extended to other orthogonal analyte data. Both random and systematic errors should be uniform across varying analytes and samples, whereas true somatic variants can be identified by strong correlation against analytes and/or other samples.
  • rare somatic variants may appear at low frequencies (e.g., less than 1% VAF) in a limited number of cells and may be absent in other cells.
  • the cells that lack the rare somatic variant can serve as an appropriate background, such that the rare somatic variants that are present from particular cells can rise above the background level and be identified by correlating to orthogonal protein analyte data, as further described in detail below.
  • FIG. 4E is a flow chart of an example method 480 for performing de novo variant calling via multianalyte correlation, in accordance with an embodiment.
  • Step 482 Obtain a set of candidate variants determined through single-cell DNA sequencing.
  • the set of candidate variants includes true variants and a plurality of falsely called variants arising from base-calling errors, processing errors (e.g., due to PCR bias), and/or artifacts (e.g., allele dropout or sequence homology).
  • Step 484 Determine allele frequencies of the candidate variants.
  • the allele frequencies of the candidate variants can be determined from sequence reads generated via single-cell DNA sequencing.
  • Step 486 Correlate the allele frequencies of the candidate variants to the protein expression of proteins of one or more proteins.
  • this step includes determining a correlation between each candidate variant and protein expression of each protein of a plurality of proteins.
  • these proteins may be cell surface markers that represent an immunophenotype of a cell.
  • the protein expression data may be constructed based on the amplification amplitude of the oligos included in the antibody-oligo conjugates.
  • a correlation matrix may include, on the Y-axis, the plurality of candidate variants.
  • a correlation matrix may include, on the X-axis, the plurality of proteins.
  • the values of the correlation matrix may represent correlation (or anti-correlation) between each candidate variant on the Y-axis and the protein expression of each protein on the X-axis.
  • a correlation matrix may further include, on the X-axis, germline variants determined from DNA sequencing.
  • Step 488 Select a subset of candidate variants as true variants based on the correlation of allele frequencies of candidate variants and protein expression of one or more proteins.
  • selecting a subset of the candidate variants as true variants comprises identifying a candidate variant with an allele frequency that correlates with protein expression with at least a threshold number of proteins.
  • the threshold number of proteins is at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least twelve, at least thirteen, at least fourteen, at least fifteen, at least sixteen, at least seventeen, at least eighteen, at least nineteen, or at least twenty proteins.
  • the correlation matrix described in Step 486 is used to select a subset of candidate variants as true variants.
  • the correlation values in the correlation matrix can be used to determine whether a candidate variant is correlated with the protein expression of a plurality of proteins.
  • a candidate variant is deemed to be a true variant based on its allele frequency being correlated with the protein expression of a plurality of proteins. Correlation is based on the standard deviation of correlation values for the candidate variant across the plurality of proteins.
  • a candidate variant is deemed to be a true variant if the standard deviation of correlation values across the plurality of proteins is above a threshold value.
  • a candidate variant is deemed to be a falsely called variant if the standard deviation of correlation values across the plurality of proteins is across a threshold value. For example, falsely called variants will be uncorrelated with protein expression values, and therefore, the standard deviation value of a false variant across the proteins will be below a threshold value.
  • true variants may be highly correlated with the protein expression of certain proteins and highly anti-correlated with the protein expression of other proteins.
  • a candidate variant is deemed to be correlated with protein expression of a plurality of proteins based on the standard deviation of correlation values for the candidate variant across the plurality of proteins and plurality of DNA sites.
  • a candidate variant is deemed to be a true variant if the standard deviation of correlation values across the plurality of proteins and plurality of DNA sites is above a threshold value.
  • a candidate variant is deemed to be a falsely called variant if the standard deviation of correlation values across the plurality of proteins and plurality of DNA sites is below a threshold value.
  • the threshold standard deviation value is any of 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, or 0.20.
  • the threshold standard deviation value is 0.07.
  • the threshold standard deviation value is 0.10.
  • the subpopulation of cells represents a rare cell population within a heterogeneous population of cells. In various embodiments, the subpopulation of cells represents less than 10%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1% of the cells in the heterogeneous population of cells.
  • methods disclosed herein involve combining two or more of the above-described methods.
  • the methods disclosed herein combine a background error rate-based method with a co-occurrence-based method.
  • the identified true variants can be further subject to the co-occurrence-based method, to identify a subset of true variants.
  • the identified true variants can be further subject to the background error rate-based method, to identify a subset of true variants.
  • a subset of candidate variants may be obtained. This subset of candidate variants may be further processed via a background error rate-based method 440 through Steps 442-450.
  • some repeating steps (such as certain filtering steps that use the same set of filters with the same thresholds) in method 440 may be omitted since these steps may have already been performed in method 420.
  • a subset of candidate variants may be obtained. This subset of candidate variants may be further processed via a co-occurrence-based method 420 through Steps 422-430.
  • some repeating steps e.g., certain filtering steps that use the same set of filters with the same thresholds
  • some repeating steps e.g., certain filtering steps that use the same set of filters with the same thresholds
  • a multianalyte correlation-based method 480 can be further combined with one or more of method 420 or method 440.
  • a confidence level of identifying a true variant can be sufficiently high to make a variant call.
  • Embodiments described herein further refer to example systems and/or associated computer devices for performing the true variant identification methods described above.
  • the example systems and/or associated computer devices can be configured to implement functionalities of the cell analysis workflow device 120 and variant caller device 130, as described above in reference to FIG. 1A.
  • FIG. 5 depicts an overall system environment, in accordance with an embodiment.
  • FIG. 5 depicts a single-cell analysis workflow including the designing of a targeted panel (e.g., targeted DNA panel), sample preparation, library preparation, cell sequencing, multi- omic analysis, and software analysis.
  • the single-cell analysis workflow further includes designing a panel for an additional analyte (e.g., analyte other than DNA), such as RNA or protein analytes.
  • an additional analyte e.g., analyte other than DNA
  • RNA or protein analytes e.g., RNA or protein analytes.
  • a protein panel can be designed and antibody-conjugated oligos are provided for performing a cell staining protocol.
  • the single-cell workflow combines both DNA and protein panels.
  • the single-cell workflow involves encapsulating and lysing cells in droplets, performing nucleic acid amplification (including target DNA and antibodyspecific oligo amplification) in droplets with a cell-specific barcode, and sequencing amplicons by NGS. Cell-specific mutational and protein profiles are then reconstructed using the software. Partial details of such a single-cell workflow are described in US Patent No. 10,161,007, which is hereby incorporated by reference in its entirety.
  • the flow pipeline shown in FIG. 1A is applicable for a Tapestri ® workflow instrument.
  • FIG. 6 depicts an example computing device for implementing the system and methods described in reference to FIGS. 1 A-5.
  • the example computing device 600 serves as the variant caller device 130 described in FIG. 1 A for identifying true variants.
  • Examples of a computing device can include a personal computer, desktop computer laptop, server computer, a computing node within a cluster, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.
  • the computing device 600 includes at least one processor 602 coupled to a chipset 604.
  • the chipset 604 includes a memory controller hub 620 and an input/output (I/O) controller hub 622.
  • a memory 606 and a graphics adapter 612 are coupled to the memory controller hub 620, and a display 618 is coupled to the graphics adapter 612.
  • a storage device 608, an input interface 614, and a network adapter 616 are coupled to the I/O controller hub 622.
  • Other embodiments of the computing device 600 have different architectures.
  • the storage device 608 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device.
  • the memory 606 holds instructions and data used by processor 602.
  • the input interface 614 is a touch-screen interface, a mouse, track ball, or other types of input interface, a keyboard, or some combination thereof, and is used to input data into the computing device 600.
  • the computing device 600 may be configured to receive input (e.g., commands) from the input interface 614 via gestures from the user.
  • the graphics adapter 612 displays images and other information on the display 618.
  • the network adapter 616 couples the computing device 600 to one or more computer networks.
  • the computing device 600 is adapted to execute computer program modules for providing the functionality described herein.
  • module refers to computer program logic configured to provide the specified functionality.
  • program modules are stored on the storage device 608, loaded into memory 606, and executed by the processor 602.
  • the types of computing devices 600 can vary from the embodiments described herein.
  • the computing device 600 can lack some of the components described above, such as graphics adapters 612, input interface 614, and displays 618.
  • a computing device 600 can include a processor 602 for executing instructions stored on a memory 606.
  • a non-transitory machine- readable storage medium such as one described above, is provided, the medium comprising a data storage material encoded with machine-readable data which, when using a machine programmed with instructions for using said data, is capable of executing instructions for performing true variant identification methods disclosed herein.
  • Embodiments of the methods described above can be implemented in computer programs executing on programmable computers, comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage elements), a graphics adapter, an input interface, a network adapter, at least one input device, and at least one output device.
  • a display is coupled to the graphics adapter.
  • Program code is applied to input data to perform the functions described above and generate output information.
  • the output information is applied to one or more output devices, in known fashion.
  • the computer can be, for example, a personal computer, microcomputer, or workstation of conventional design.
  • Each program can be implemented in a high-level procedural or object-oriented programming language to communicate with a computer system.
  • the programs can be implemented in assembly or machine language, if desired. In any case, the language can be a compiled or interpreted language.
  • Each such computer program is preferably stored on a storage media or device (e.g., ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein.
  • the system can also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
  • the signature patterns and databases thereof can be provided in a variety of media to facilitate their use.
  • Media refers to a manufacture that contains the signature pattern information of the present invention.
  • the databases of the present invention can be recorded on computer readable media, e.g., any medium that can be read and accessed directly by a computer.
  • Such media include, but are not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media.
  • magnetic storage media such as floppy discs, hard disc storage medium, and magnetic tape
  • optical storage media such as CD-ROM
  • electrical storage media such as RAM and ROM
  • hybrids of these categories such as magnetic/optical storage media.
  • Recorded refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure can be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g., word processing text file, database format, etc.
  • Disclosed herein is a method for performing de novo variant calling via multianalyte and multisample correlation.
  • methods disclosed herein are useful for distinguishing true somatic variants in a sample from more-numerous false positives by evaluating variant correlation with other analytes, such as protein expression, and/or correlation with the germline variants of multiplexed samples.
  • a method for identifying a subpopulation of cells from a heterogeneous population of cells comprising: obtaining a set of candidate variants; determining allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlating determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; selecting a subset of the candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins.
  • correlating determined allele frequencies of the candidate variants to protein expression comprises generating a correlation matrix.
  • the threshold number of proteins is at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least twelve, at least thirteen, at least fourteen, at least fifteen, at least sixteen, at least seventeen, at least eighteen, at least nineteen, or at least twenty proteins.
  • methods disclosed herein further comprise identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants.
  • the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells.
  • the heterogeneous population of cells represents a pooled sample comprising a plurality of cell samples.
  • the plurality of cell samples are distinguishable by germline variants that correlate with the true variants.
  • the correlation is based on a standard deviation of correlation values for the candidate variant across the plurality of proteins.
  • the standard deviation of correlation values further encompasses correlation values for the candidate variant across a plurality of DNA sites.
  • the candidate variant is correlated if the standard deviation of correlation values is greater than a threshold value. In various embodiments, the candidate variant is not correlated if the standard deviation of correlation values is less than a threshold value.
  • the threshold value is any of 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, or 0.20.
  • Example 1 Example Method of Identifying True Variants Based on Co-occurrence
  • Example method 1 relates to a method of identifying true variants based on the cooccurrence of two variants in a same cell in a number of cells.
  • a sample had tens of thousands of rare variants, with the vast majority occurring in less than 10 cells.
  • the sample without filtering and multianalyte and multisample correlation-based variant calls was anticipated to potentially yield too many false positives.
  • the following optimizations were implemented to minimize the false positive rate and improve computational speed.
  • ad hoc filters were imposed, an example of which included that the variants in a pair of co-occurring variants are to be at least 100 base pairs from each other.
  • the Binomial test was optimized and performed on all possible combinations of variants. In most samples, there are tens of thousands of variants, and performing a statistical test on all possible combinations is very computationally inefficient. To speed up the computation, the mathematical concepts of the sparse matrices and multiplication of adjacency matrices were leveraged to achieve a dramatic speed up in computation. The computation included the testing of the statistical significance of the co-occurrence of variant pairs using the Binomial test, the results of which were then used to determine whether a variant pair was true variant pair or not a true variant based on the tested statistical significance.
  • FIG. 7 depicts example outcomes of the statistical significance test for different variant pairs. Three pairs of variants were tested. In the left panel (two vertical columns with 10 rows, each row representing data from one cell and each column representing data from one variant) shown in FIG. 7, in the 10 cells tested, the pair of variants co-occurred in three cells and neither of them occurred alone in any other cell. The Binomial test indicated that there was statistical significance, and thus the two variants in this variant pair were classified as true variants. In the middle panel shown in FIG. 7, each variant in the evaluated variant pair occurred in three of ten cells. However, these two variants did not co-occur in any of the ten cells, and thus the Binomial test indicated there was no statistical significance for the co-occurrence test.
  • the two variants are not likely true variants.
  • one variant occurred in three cells out of the ten tested cells, while the other variant in the variant pair occurred in all ten cells. While the variant pair co-occurred in three cells out of the ten tested cells, the variant pair still did not show statistical significance based on the Binomial test, which means that the variant pairs are not likely true variants.
  • the Binomial test can filter out cases where one variant is rare and the other is not (e.g., if one variant is in 1% of cells and the other is in 80%, then co-occurrence most likely happens by chance).
  • Example 2 Example Method of Identifying True Variants Based on Background Error Rate
  • Example method 2 relates to a method of identifying true variants based on the background error rate.
  • the example method started with the identification of the probability distribution that best fits the background error distribution.
  • Many different methods were attempted for modeling error distributions, such as the Binomial distribution, Normal distribution, percentile base approach, and others.
  • the Beta-Binomial approach was found to perform the best in modeling the per-variant background error distribution and minimizing potential false positives.
  • the control samples included at least 500 cells. Too few cells included in a sample (e.g., less than 500 cells), led to artificially inflated background error rates for the variants. In addition, it was found that the estimated number of control samples achieved a robust per-variant error estimation. Further, additional ad-hoc filters were implemented after de novo variant identification based on the background error rate. These ad hoc filters included that a variant is present in at least 5 cells and that the average variant allele frequency is at least 35.
  • FIG. 8 depicts an example limit of detection profiles with or without using the background error rate.
  • the limit of detection is the minimum percentage of cells a variant must be present in.
  • the existing thresholds have a LOD of 1%.
  • the straight dotted line indicates the LOD is 1% when the background error rate-based variant evaluation is not applied (e.g., in an existing method, which is also referred to as the “control method”).
  • the black line for the method disclosed herein indicates the LOD greatly decreased when the background error rate-based true variant identification was applied. As can be seen, the LOD was greatly reduced for most variants, with over 60,000 tested variants having a LOD of 0.2% or less.
  • a variant that has a mutation rate of 0.2% or above can be effectively identified (e.g., correctly classified as a true variant) after the background error rate was estimated for the variant.
  • the higher detection limit on the right part of the black line corresponds to error-prone variants that generally show more mutations within cells.
  • FIG. 9A depicts the sensitivity of the aforementioned two methods, i.e., background error rate-based method (or simply “background error rate method”) or co-occurrence-based method (or simply “co-occurrence method”).
  • background error rate method or simply “background error rate method”
  • co-occurrence-based method or simply “co-occurrence method”.
  • Three sets of data were tested using one method (e.g., background error rate method) or using both methods.
  • the median sensitivity of the background error rate method alone and the median sensitivity of using two methods together increased from about 63% to about 80% when there are a large number of samples (e.g., 128 samples).
  • the sensitivity for the other two data sets that have a limited number of samples also showed improved sensitivity (e.g., around 42%-47%) when compared to the control method (e.g., around 26%).
  • Data set 2 included rare variants in AML minimum residual disease samples.
  • FIG. 9A shows that the control method could not detect these rare variants, and had a sensitivity close to 0.
  • the background error rate method + co-occurrence method showed a median sensitivity of 40%, a clear performance improvement when compared to the control method. Background error rate method alone also showed some sensitivity. All of these indicate that the above-described two methods (used independently or jointly) greatly improve the sensitivity in the variant call (e.g., detecting rare variants).
  • FIG. 9B further depicts the specificity of the aforementioned two methods.
  • the same three sets of data were tested for specificity.
  • the results shown in FIG. 9B show that the specificity for one method (e.g., background error rate method) or for two methods used together (e.g., background error rate method + co-occurrence method) were comparable to the control sample. Namely, each of the methodologies achieved greater than 99.5% specificity.
  • Example 3 Example Method of Identifying True Variants Based on Correlation of DNA and Proteins
  • Example method 3 relates to a method of identifying true variants based on the correlation between the DNA and protein expressions. The specific processes of determining DNA and protein expression correlation are described earlier. Some example outcomes are further described below.
  • FIG. 10A depicts one example of correlating DNA and protein information for identifying the presence of true variants.
  • FIG. 10B depicts another example of correlating DNA and protein information for identifying the presence of true variants.
  • the correlation between each candidate somatic variant (y- axis) and each protein expression (x-axis, left), and each other DNA variant (x-axis right) was represented with small square blocks with darkness/brightness indicating the correlation value. The darker or the brighter the block (when compared to gray blocks where most other blocks show), the greater correlation. In the figure, the dark blocks indicate the anti-correlation while the bright blocks indicate the correlation. From FIGS.
  • FIG. 10C depicts an additional example of correlating DNA and protein information for identifying the presence of true variants.
  • the correlation matrix was generated by performing single-cell DNA sequencing and single-cell proteomics on a pooled population of cells. Analysis of a pooled population of cells is valuable because rare somatic variants likely only occur in 1 sample and not across all samples. Therefore, candidate variants that appear across multiple samples are likely falsely called variants. In this sense, the other samples, which do not include a rare somatic variant, are present in the analysis as controls.
  • the presence of multiplexed samples leads to true variants with a strong correlation with both protein (left) and supposed germline variants (right). False positives did not correlate with many variants or any proteins.
  • FIG. 10D depicts an additional example of correlating DNA and protein information for identifying the presence of true variants at variant allele frequencies (VAF) below 1%.
  • VAF variant allele frequencies

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Analytical Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Described herein are true variant identification methods via multianalyte and multisample correlations. Generally, the first step of true variant identification involves applying a first machine learned model to identify one or more correlations of a candidate variant with other analytes or variants within the same or different cells or samples. The second step of true variant identification involves applying a second machine learned model to classify a variant to be a true variant or false positive based on the identified correlations. Such improved true variant identification methods facilitate the identification of signatures of measurable residual diseases at a more granular and accurate level.

Description

TRUE VARIANT IDENTIFICATION VIA MULTIANALYTE AND MULTISAMPLE CORRELATION
CROSS REFERENCE
[0001] This application claims the benefit of and priority to U.S. Provisional Patent Application No. 63/327,966 filed April 6, 2022, the entire disclosure of which is hereby incorporated by reference in its entirety for all purposes.
BACKGROUND
[0002] Detecting rare variants is valuable for identifying rare cells in samples that contain signatures of the measurable residual disease (MRD). Detecting rare variants from single-cell sequencing data is challenging because these somatic variants are often rare and below the variant-allele frequency (VAF) background. In addition, there are many thousands of false positive variants that occur at low frequencies (typically less than 1%). Thus, distinguishing false positive variants from true rare variants is difficult. These false positives have several causes, an example of which stems from polymerase chain reaction (PCR) and/or next generation sequencing (NGS) base-calling errors (1-100% VAF for some loci) that arise near the end of amplicons, near repeat regions, and amplification of DNA from single cells. Thus, there is a need for improved methods for identifying true variants in cells that are present at low frequencies.
SUMMARY
[0003] Described herein are embodiments for improved true variant identification via multianalyte and multisample correlation through a two-step process involving 1) detecting one or more of multianalyte and multisample correlations associated with a candidate variant and 2) performing a variant calling process to determine whether the candidate variant is a true variant or not based on the detected one or more of multianalyte and multisample correlations. Embodiments described herein offer some benefits or advantages over other existing variant calling processes due to the improved variant calling performance.
[0004] For example, in one multianalyte and multisample correlation detection, by modeling the background error rate through a set of control samples, the embodiments disclosed herein offer features and benefits including: 1) yield a threshold that is dynamically determined for each variant. By leveraging background error rate, it enables a lower limit of detection for some variants, while having a higher threshold for variants that are more error prone; thereby reducing false positives. 2) Leverage control samples for estimating background error rate for individual variants. Thus, these individual control samples provide an additional utility beyond what a single sample can provide on its own. 3) Use the Beta-Binomial distribution to statistically model the per-variant background error rate. The Beta-Binomial distribution offers several distinct advantages: a) allows the calculation of a p-value for each variant, which reflects the probability that the variant is an error, b) Beta-Binomial is a count-based distribution, so it dynamically adjusts to the total number of cells in a sample, which is important because the number of cells a false positive variant is observed in roughly scales with the total number of cells in the sample. Furthermore, unlike the Binomial distribution (also count-based), the BetaBinomial allows for over-dispersion, which enables it to be more flexible in modeling different error distribution shapes.
[0005] For another example, in another multianalyte and multisample correlation detection, by testing the statistical significance of co-occurring variants for two variants that co-occur in same cells, the embodiments disclosed herein offer additional features and benefits including: 1) leverage the single-cell nature of a single cell sequencing data by looking for variants the cooccur together in same cells in a statistically significant way. The error rate of co-occurring variants is much lower compared to individual variants, so the theoretical limit of detection for co-occurring variants is lower than that of individual variants. 2) Use a binomial test for testing the statistical significance of co-occurring variants. This not only identifies if two rare variants co-occur together in a significant way, but also filters out cases where one variant is rare and the other is not (e.g., if one variant is in 1% of cells and the other is in 80%, then one would expect by chance most to co-occur).
[0006] It should be noted that the features and benefits described herein are not all-inclusive, and many additional features and benefits will be apparent to one of ordinary skill in the art in view the following descriptions of specific embodiments.
[0007] Disclosed herein is a method for identifying a subpopulation of cells from a heterogeneous population of cells, the method comprising: obtaining a set of candidate variants determined through a single-cell analysis workflow; for a variant pair included in the set of candidate variants, determining a quantity of co-occurrence cells where both variants in the variant pair co-occur in each of the co-occurrence cells; determining a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants; and identifying a subset of candidate variants as true variants based on the determined set of variant pairs.
[0008] In various embodiments, before determining the quantity of co-occurrence cells, the method further comprises applying a first set of variant filters to identify a set of variants from the set of candidate variants. In various embodiments, both variants included in a variant pair, for determining the quantity of co-occurrence cells, are from the set of variants identified after applying the first set of variant filters. In various embodiments, applying the first set of variant filters comprises applying a depth of coverage threshold regarding a depth of coverage of a variant in a cell. In various embodiments, the depth of coverage threshold is at least 6, 8, 10, 12,
14, 16, 18, or 20 reads for a cell. In various embodiments, applying the first set of variant filters comprises applying a genotype quality threshold regarding a genotype quality of a variant. In various embodiments, the genotype quality threshold is at least 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, or 80. In various embodiments, applying the first set of variant filters comprises applying a cell number threshold regarding a number of cells where a variant is present. In various embodiments, the cell number threshold is at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 20, 25, 30, 35, 40, 45, or 50 cells. In various embodiments, applying the first set of variant filters comprises applying a variant allele frequency threshold regarding variant allele frequency of a variant. In various embodiments, the variant allele frequency threshold is at least 30, 35, 40, 45, 50, 55, or 60.
[0009] In various embodiments, determining the quantity of co-occurrence cells comprises determining at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 45, or 50 cells for a variant pair. In various embodiments, determining the quantity of co-occurrence cells comprises determining co-occurrence cells based on a statistical significance of co-occurrence of both variants of the variant pair in a same cell. In various embodiments, determining cooccurrence cells based on the statistical significance comprises determining co-occurrence cells based on a one-sided Binomial test. In various embodiments, determining co-occurrence cells based on the one-sided Binomial test comprises determining co-occurrence cells having a statistical significance of p-value of less than 0.03, 0.02, 0.01, 0.009, 0.008, 0.007, 0.006, 0.005, 0.004, 0.003, 0.002, or 0.001 based on the one-sided Binomial test. [0010] In various embodiments, after determining the set of variant pairs, the method further comprises applying a second set of variant filters to identify a subset of variant pairs. In various embodiments, identifying the subset of candidate variants as true variants comprises identifying the subset of candidate variants based on the identified subset of variant pairs. In various embodiments, applying the second set of variant filters comprises applying an average variant allele frequency threshold regarding an average variant allele frequency for two variants of a variant pair. In various embodiments, the average variant allele frequency threshold is at least 25, 30, 35, 40, 45, 50, 55, or 60. In various embodiments, applying the second set of variant filters comprises applying a genomic distance threshold regarding a genomic distance between two variants of a variant pair. In various embodiments, the genomic distance threshold is at least 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 base pairs apart from each other.
[0011] In various embodiments, the method further comprises identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants. In various embodiments, the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells. In various embodiments, a sensitivity for identifying a variant as a true variant is at least 0.9. 0.8, 0.7, 0.6, 0.5, 0.4, or 0.3. In various embodiments, a specificity for identifying a variant as a false positive variant is at most 0.998, 0.997, 0.996, 0.995, 0.994, 0.993, 0.992, or 0.991. In various embodiments, the heterogeneous population of cells are from measurable residual disease (MRD) samples.
[0012] In various embodiments, the method further comprises, for each variant included in the subset of candidate variants, generating a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determining a sub-subset of candidate variants based on the generated statistical significance for each variant included in the subset of candidate variants; for each variant of the sub-subset of candidate variants, applying a third set of variant filters; and identifying a sub-sub-subset of candidate variants as true variants based on the application of the third set of variant filters.
[0013] Disclosed herein is another method identifying a subpopulation of cells from a heterogeneous population of cells, the method comprising: obtaining a set of candidate variants determined through a single-cell analysis workflow; for a variant included in the set of candidate variants, generating a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determining a set of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants; and identifying a subset of candidate variants as true variants based on generated statistical significances for the plurality of variants.
[0014] In various embodiments, the parameters for the Beta-Binomial distribution are generated by: acquiring a plurality of control samples; removing germline variants from each control sample; for a control sample, applying a first set of variant filters to identify a set of background variants; for a background variant, determining a quantity of cells containing the background variant in each sample; and generating parameters for the Beta-Binomial distribution for the background variant based on the determined quantity of cells containing the background variant in each sample.
[0015] In various embodiments, before generating a statistical significance by using a BetaBinomial distribution, the method further comprises applying a second set of variant filters to identify a set of variants from the set of candidate variants. In various embodiments, the variant for generating a statistical significance is a variant included in the identified set of variants. In various embodiments, generating a statistical significance by using a Beta-Binomial distribution comprises generating the statistical significance using the parameters of the Beta-Binomial distribution when a variant included in the set of candidate variants is a background variant. In various embodiments, generating a statistical significance by using a Beta-Binomial distribution comprises generating the statistical significance using averaging parameters generated from a plurality of background variants when a variant included in the set of candidate variants is not a background variant. In various embodiments, the parameters for the Beta-Binomial distribution for the background variant are generated by using two vectors determined based on the determined quantity of cells containing the background variant in each sample. In various embodiments, a first vector of the two vectors comprises a quantity of cells that have the background variant in each control sample. In various embodiments, a second vector of the two vectors comprises a quantity of total cells in each control sample.
[0016] In various embodiments, applying the first set of variant filters comprises applying a depth of coverage threshold regarding a depth of coverage associated with a variant. In various embodiments, the depth of coverage threshold is at least 6, 8, 10, 12, 14, 16, 18, or 20 reads for a variant. In various embodiments, applying the first set of variant filters comprises applying a genotype quality threshold regarding a genotype quality associated with a variant. In various embodiments, the genotype quality threshold is at least 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, or 80. In various embodiments, the second set of variant filters comprises the depth of coverage threshold. In various embodiments, the second set of variant filters comprise the genotype quality threshold. In various embodiments, the second set of variant filters comprise a genotyped cell percentage threshold regarding a percentage of cells that are genotyped for a genomic position of a variant. In various embodiments, the genotyped cell percentage threshold is at least 10%, 20%, 30, 40%, 50%, or 60%. In various embodiments, identifying a subset of candidate variants as true variants based on generated statistical significances comprises identifying one or more variants that have a p-value smaller than a p-value threshold. In various embodiments, the p-value threshold is at most 0.0005, 0.0004, 0.0003, 0.0002, 0.0001, 0.00009, 0.00008, 0.00007, 0.00006, or 0.00005.
[0017] In various embodiments, after determining the set of variants based on the generated statistical significances, the method further comprises applying a third set of variant filters. In various embodiments, applying the third set of variant filters comprises applying a cell quantity threshold regarding a quantity of cells that a variant is present. In various embodiments, the cell quantity threshold is at least 3, 4, 5, 6, 7, 8, 9, or 10 cells. In various embodiments, applying the third set of variant filters comprises applying an average variant allele frequency threshold regarding a variant allele frequency average for a variant. In various embodiments, the average variant allele frequency threshold is at least 20, 25, 30, 35, 40, 45, 50, 55, or 60. In various embodiments, variants remaining in each control sample after applying the first set of variant filters are false positive variants. In various embodiments, there are at least 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 control samples. In various embodiments, the control samples are non-cancerous samples. In various embodiments, the control samples are bone marrow samples from healthy subjects.
[0018] In various embodiments, the method further comprises identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants. In various embodiments, the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells. In various embodiments, a sensitivity for identifying a variant as a true variant is at least 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, or 0.3. In various embodiments, a specificity for identifying a variant as a false positive variant is at most 0.998, 0.997, 0.996, 0.995, 0.994, 0.993, 0.992, or 0.991. In various embodiments, In various embodiments, the heterogeneous population of cells are from MRD samples.
[0019] In various embodiments, the method further comprises: for a variant pair included in the subset of candidate variants, determining a quantity of co-occurrence cells where both variants in the variant pair co-occur in each co-occurrence cell; determining a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the subset of candidate variants; applying a fourth set of variant filters to identify a subset of variant pairs; and identifying a sub-subset of candidate variants as true variants based on the determined subset of variant pairs.
[0020] Disclosed herein is another method for identifying a subpopulation of cells from a heterogeneous population of cells, the method comprising: obtaining a set of candidate variants; determining allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlating determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; and selecting a subset of candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins.
[0021] In various embodiments, correlating determined allele frequencies of the candidate variants to protein expression comprises generating a correlation matrix. In various embodiments, the threshold number of proteins is at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least twelve, at least thirteen, at least fourteen, at least fifteen, at least sixteen, at least seventeen, at least eighteen, at least nineteen, or at least twenty proteins. In various embodiments, the method further comprises identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants. In various embodiments, the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells. In various embodiments, the heterogeneous population of cells represents a pooled sample comprising a plurality of cell samples.
[0022] In various embodiments, the plurality of cell samples are distinguishable by germline variants that correlate with the true variants. In various embodiments, for a candidate variant, the correlation is based on a standard deviation of correlation values for the candidate variant across the plurality of proteins. In various embodiments, the standard deviation of correlation values further encompasses correlation values for the candidate variant across a plurality of DNA sites. In various embodiments, the candidate variant is correlated if the standard deviation of correlation values is greater than a threshold value. In various embodiments, the candidate variant is not correlated if the standard deviation of correlation values is less than a threshold value. In various embodiments, the threshold value is any of 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, or 0.20.
[0023] Additionally disclosed herein is a non-transitory computer readable medium for calling one or more variants of a cell population, the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: obtain a set of candidate variants; for a variant pair included in the set of candidate variants, determine a quantity of co-occurrence cells where both variants in the variant pair co-occur in each of the cooccurrence cells; determine a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants; and identify a subset of candidate variants as true variants based on the determined set of variant pairs.
[0024] Disclosed herein is another non-transitory computer readable medium for calling one or more variants of a cell population, the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: obtain a set of candidate variants; for a variant included in the set of candidate variants, generate a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determine a set of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants; and identify a subset of candidate variants as true variants based on generated statistical significances for the plurality of variants.
[0025] Disclosed herein is another non-transitory computer readable medium for calling one or more variants of a cell population, the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: obtain a set of candidate variants; determine allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlate determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; and select a subset of candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins.
[0026] Additionally disclosed herein is a system comprising: a single-cell analysis workflow device configured to generate a plurality of sequence reads for cells in a cell population; a computational device communicatively coupled to the single-cell analysis workflow device, the computational device configured to: obtain a set of candidate variants; for a variant pair included in the set of candidate variants, determine a quantity of co-occurrence cells where both variants in the variant pair co-occur in each of the co-occurrence cells; determine a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants; and identify a subset of candidate variants as true variants based on the determined set of variant pairs.
[0027] Disclosed herein is another system comprising: a single-cell analysis workflow device configured to generate a plurality of sequence reads for cells in a cell population; and a computational device communicatively coupled to the single-cell analysis workflow device, the computational device configured to: obtain a set of candidate variants; for a variant included in the set of candidate variants, generate a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determine a set of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants; and identify a subset of candidate variants as true variants based on generated statistical significances for the plurality of variants
[0028] Disclosed herein is another system comprising: a single-cell analysis workflow device configured to generate a plurality of sequence reads for cells in a cell population; and a computational device communicatively coupled to the single-cell analysis workflow device, the computational device configured to: obtain a set of candidate variants; determine allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlate determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; and select a subset of candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0029] These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description, and accompanying drawings, where:
[0030] Figure (FIG.) 1A depicts an overall system environment including a cell analysis workflow device and a variant caller device for identifying variant calls, in accordance with an embodiment.
[0031] FIG. IB is a block diagram of separate modules of a variant caller device, in accordance with an embodiment.
[0032] FIG. 2 is a flow diagram for identifying true variants and a subpopulation of cells containing true variants based on multianalyte and multisample correlation, in accordance with an embodiment.
[0033] FIG. 3A depicts the implementation of a variant correlation model, in accordance with an embodiment.
[0034] FIG. 3B depicts the implementation of a variant caller model, in accordance with an embodiment.
[0035] FIGS. 4A-4E are flow diagrams for identifying true variants based on various multianalyte and multisample correlations, in accordance with an embodiment.
[0036] FIG. 5 depicts a flow diagram for using a cell analysis workflow device for multi- omics analysis, in accordance with an embodiment.
[0037] FIG. 6 depicts an example computing device for implementing the system and methods described in reference to FIGS. 1-5.
[0038] FIG. 7 depicts example outcomes of the statistical significance test for the cooccurrence of three pairs of variants.
[0039] FIG. 8 depicts example detection limits for variants with or without a background error rate estimation.
[0040] FIGS. 9A and 9B depict example sensitivities and specificities for different disclosed methods when compared to a control method.
[0041] FIGS. 10A-10D depict example correlation matrices generated based on the correlations between variants and proteins detected under different application scenarios. DETAILED DESCRIPTION
Definitions
[0042] Terms used in the claims and specification are defined as set forth below unless otherwise specified.
[0043] The term “subject” encompasses a cell, tissue, or organism, human or non- human, whether in vivo, ex vivo, or in vitro, male or female.
[0044] The term “sample” can include a single cell or multiple cells or fragments of cells or an aliquot of body fluid, such as a blood sample, taken from a subject, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art. In various embodiments, the term “sample” also refers to data obtained from a sample. In particular embodiments, samples include analyte information for one or more cells taken from a subject.
[0045] The term “analyte” refers to a component of a cell. Cell analytes can be informative for characterizing a cell, such as for identifying one or more variants. Examples of an analyte include nucleic acid (e.g., RNA, DNA, cDNA), a protein, a peptide, an antibody, an antibody fragment, a polysaccharide, a sugar, a lipid, a small molecule, or combinations thereof. In particular embodiments, a single-cell analysis involves analyzing DNA analytes. In particular embodiments, a single-cell analysis involves analyzing two different analytes such as DNA and protein analytes.
[0046] The term “variant” refers to a specific combination of chromosome, position, and base change in the DNA of a subject. In various embodiments, a variant is detected through certain sequencing technology, during which certain errors can be introduced through the process. Accordingly, a variant identified through a detection process can be a true variant or a false positive variant. A true variant is a detected variant that reflects a real variant in a subject (i.e., a real change in DNA), while a false positive variant is a detected variant that may not reflect a real variant in the subject. A false positive variant can arise from processing errors, such as PCR or sequencing errors. A real variant can be a germline variant or a somatic variant. In some embodiments disclosed herein, a true variant specifically refers to a detected rare somatic variant that occurs in a subject at a very low (e.g., <1%) percentage.
[0047] The term “correlation” refers to any statistical relationship, whether causal or not, between two random variables or bivariate data. For example, two variables can be correlated if a first variable is generally elevated when the second variable is also elevated. As another example, two variables can be correlated if a first variable is decreased when the second variable is also decreased. As used herein, “correlation” also encompasses anti- correlative relationships e.g., a first variable is elevated when the second variable is decreased, and vice versa.
[0048] The phrase “variant caller model” refers to a prediction model or a machine-learned model that is implemented to call variants of a cell population. The variant caller model analyzes cell population features derived from sequence reads or protein expression profiles across a cell population or individual cell features from sequence reads and protein expression profiles from individual cells. In one embodiment, the variant caller model receives the cell population features or individual cell features as input and predicts a classification for a candidate variant. In one embodiment, the variant caller model extracts cell population features or individual cell features from sequence reads and protein expression profiles and predicts a classification for a candidate variant based on the extracted cell population features or individual cell features. In particular embodiments, the classification for a candidate variant is based on the sequencing data for the candidate variants and analyte information related to the candidate variant. As an example, multianalyte and/or multisample correlations with other analytes and/or samples are identified for a candidate variant, which is then used to classify the candidate variant.
[0049] The phrase “candidate variant” refers to a base across sequence reads of a cell population that is mismatched in comparison to a reference base. Generally, a variant caller model is implemented to determine whether the candidate variant is a true variant, such as a homozygous variant or a heterozygous variant.
[0050] The phrase “true variant” refers to a genetic variant that is present in one or more cells of a cell population. In various embodiments, a true variant includes a rare variant from a somatic mutation that occurs in only a subpopulation of a cell population.
[0051] The phase “false positive” refers to a genetic variant, identified from sequencing data, which is falsely called due to base-calling errors, processing errors (e.g., due to PCR bias), and/or artifacts (e.g., allele dropout or sequence homology), among possible other errors.
[0052] The phrases “measurable residual disease,” “minimal residual disease,” and “MRD” are used interchangeably and generally refer to small sub-populations of cancer cells that may be present at rare quantities in larger populations of cells. Such cancer cells representing signals of MRD can include one or more true variants that may be associated with the cancerous nature of the cancer cells. In various embodiments, “measurable residual disease,” “minimal residual disease,” and “MRD” refer to small numbers of cancer cells that remain following cancer treatment. In various embodiments, “measurable residual disease,” “minimal residual disease,” and “MRD” refer to signatures of cancer cells associated with a blood cancer, an example of which is acute lymphocytic leukemia (ALL).
Overview
[0053] Embodiments described herein refer to an improved variant caller that identifies multianalyte and multisample correlations between a candidate variant and other analytes, variants, and samples, and further performs a classification of the candidate variant based on the identified correlations. In various embodiments, these true variants represent rare variants that occur in a population of cells at a very low rate (e.g., < 1%). Accurate identification of these rare variants can be useful for identifying sub-populations of cells, such as cells representing signatures of measurable residual disease (MRD). In various embodiments, multianalyte and multisample correlations involve implementing a variant correlation model and the classification of variants involves implementing a variant caller model. Altogether, the variant caller method described herein achieves higher accuracy in calling true variants that are present in cells in comparison to conventional variant caller methods (e.g., Genome Analysis Toolkit (GATK)) that employ hard cutoffs as opposed to a variant correlation model and/or variant caller model. Further description regarding hard filters used in the GATK is found in De Summa, S., Malerba, G., Pmto, R. et al. GATK hard filtering: tunable parameters to improve variant calling for nextgeneration sequencing targeted gene panel data. BMC Bioinformatics 18, 119 (2017), which is incorporated by reference in its entirety.
[0054] Reference is made to FIG. 1A, which depicts an overall system environment 100 including a cell analysis workflow device 120 and a variant caller device 130 for variant calling, in accordance with an embodiment. At the beginning of variant calling, a cell population 110 is obtained. In various embodiments, the cell population 110 can be isolated from a test sample obtained from a subject or a patient. In various embodiments, the cell population 110 includes healthy cells taken from a healthy subject. In various embodiments, the cell population 110 includes diseased cells taken from a subject. In one embodiment, the cell population 110 includes cancer cells taken from a subject previously diagnosed with cancer. For example, cancer cells can be tumor cells available in the bloodstream of the subject diagnosed with cancer. As another example, cancer cells can be cells obtained through a tumor biopsy. In various embodiments, the cell population 110 includes sub-populations of cells that are present in rare quantities in the cell population 110. For example, a sub-population of cells may be cancer cells that are present in rare quantities in the cell population 110 that is largely made up of healthy cells. In particular embodiments, the sub-population of cells may be cancer cells that remain after the subject has undergone cancer treatment. Thus, the presence of such cancer cells, even at low quantities in the cell population 110, may be informative for, e.g., guiding treatment for the subject.
[0055] The cell analysis workflow device 120 refers to a device that processes cells and generates information related to the analytes of cells. In various embodiments, the cell analysis workflow device 120 refers to a system comprising one or more devices that process cells and prepare nucleic acids for sequencing and/or proteins for expression analysis. In various embodiments, the cell analysis workflow device 120 is a workflow device that generates nucleic acids from single cells, thereby enabling the subsequent identification of sequence reads and individual cells from which the sequence reads originated. Whereas measuring one analyte - like DNA - in a cell-specific manner provides an expanded view of cancer, a single analyte may not be sufficient to resolve all the clonal populations of a tumor. In various embodiments, the cell analysis workflow device 120 is a single-cell multi-omics workflow device that simultaneously measures multiple analytes, such as DNA, RNA, protein, other biomolecules, and combinations of different molecules. This facilitates the identification of signatures of MRD at a more granular and accurate level.
[0056] In various embodiments, the cell analysis workflow device 120 performs single-cell processing by encapsulating individual cells into emulsions, lysing cells within emulsions, performing cell barcoding of cell lysate in emulsions, and performing a nucleic amplification reaction in emulsions. Thus, amplified nucleic acids can be collected and sequenced. Further description of example embodiments of single-cell workflow processes is described in U.S. Patent Application No. 14/420,646, which is hereby incorporated by reference in its entirety. In various embodiments, before performing single-cell processing by encapsulating individual cells into emulsions, certain cell surface markers are first labeled, e.g., by using antibody-oligo conjugates (AOCs) that bind specifically to target cell surface proteins. Accordingly, when performing nucleic amplification in emulsions, not only target genes but also antibody-specific oligos are amplified with a cell-specific barcode, which then allows the construction of both mutational and protein (e.g., surface markers) profiles.
[0057] In particular embodiments, the cell analysis workflow device 120 can be any of the Tapestri™ Platform, inDrop™ system, Nadia™ instrument, or the Chromium™ instrument. In various embodiments, the cell analysis workflow device 120 includes a sequencer for sequencing the nucleic acids to generate sequence reads. In various embodiments, the cell analysis workflow device 120 also includes an analyzer to generate protein expression profiles based on the amplitude of amplification of antibody-specific oligos.
[0058] The variant caller device 130 is configured to receive the sequence reads and/or protein expression data from the cell analysis workflow device 120 and to process the sequence reads and/or protein expression information to call one or more variants 140. In various embodiments, the variant calling includes a classification of a variant as a potential true variant or likely false positive. For example, when a variant is called, it may be marked as a true variant or not.
[0059] In various embodiments, the variant caller device 130 is communicatively coupled to the cell analysis workflow device 120, and therefore, directly receives the sequence reads and protein expression data from the cell analysis workflow device 120. The variant caller device 130 determines the correlation of a candidate variant with one or more other analytes or variants with a same cell, same sample, or with different cells and/or different samples. Based on the correlation information, a candidate variant can be then classified as true or not during a variant calling process. In particular embodiments, the variant caller device 130 identifies multianalyte and multisample information from the sequence reads and protein expression information obtained through a cell-specific workflow process and subsequently calls variants across the cell population using the correlation information for each candidate variant. Altogether, this two-step process of cell-specific correlation identification and true variant classification during the variant calling process enables more accurate variant calls 140 across the cell population 110.
Variant Caller Device
[0060] FIG. IB is a block diagram of a variant caller device 130, in accordance with the embodiment described in FIG. 1 A. As shown in FIG. IB, the variant caller device 130 includes a variant filtration module 132, a variant correlation module 134, a variant caller module 136, and optionally a training module 138. In some embodiments, the modules of the variant caller device 130 can be arranged differently from the embodiment shown in FIG. IB. For example, the training module 138 (as shown in dotted lines) can be implemented by a device other than the variant caller device 130 and the methods described below regarding the training module 138 can be performed by the other device.
[0061] Generally, the variant filtration module 132 filters one or more variants that are not likely true variants. For example, the variant filtration module 132 may filter some falsely called variants based on certain features (e.g., individual cell features or cell population features) identified from the sequence reads and/or protein expression data of various variants from one or more cells or samples. For example, some falsely called variants may show a difference in certain cell population features or individual cell features, such as genotype quality, read depth or depth coverage, variant allele frequency, the number of cells in which a variant is detected, and the like, when compared to true variants. Accordingly, by setting proper thresholds for one or more specific features, it allows the removal of certain candidate variants that are not considered potential true variants. In various embodiments, these various thresholds may be predefined and configured in the variant caller device 130, which then uses the threshold values to filter out a candidate variant that is not a true variant (e.g., a “false positive” variant).
[0062] In various embodiments, the variant filtration module 132 applies one or more filters at any stage of a variant calling process. For example, one or more filters may be applied first before checking the multianalyte and multisample correlation of a candidate variant. For another example, the variant filtration module 132 applies one or more filters after checking the multianalyte and multisample correlation of a candidate variant. In yet another example, the variant filtration module 132 applies one or more filters during the process of checking the multianalyte and multisample correlation of a candidate variant.
[0063] In various embodiments, the specific thresholds for each filter may be configured based on the knowledge identified from the false positive variants and/or the true variants. For example, by manually looking into the sequence reads of a large number of samples, it may be found that the true variants generally have a minimum depth of coverage of 10. The threshold for the depth of coverage may be set to about 10. Similarly, if it is found that the true variants generally have a genotype quality of 30 or more, then the threshold for the genotype quality is set to about 30. In various embodiments, the thresholds for other filters may be similarly determined.
[0064] In various embodiments, the features or specific filters selected for filtering purposes may be not fixed but rather can be dynamically updated. For example, based on the new findings of certain true variants or false positive variants, additional filters can be further added to the variant filtration module 132. Similarly, the threshold values for each specific filter may be also dynamically updated if new findings support such updates. In addition, in some embodiments, depending on the samples, the thresholds may be also adjusted. For example, when determining the co-occurrence of two candidate variants within the same cells, the threshold cell number used for determining co-occurrence may be changed depending on a disease stage of a subject from which a sample is obtained, since a disease in later stages may have more variants that co-occur within same cells.
[0065] In various embodiments, if one or more candidate variants are filtered out by filter thresholds before the multivariant and multisample correlation analysis, these candidate variants may be excluded from the correlation analysis, to save the computing resources required for the analysis.
[0066] The variant correlation module 134 determines multianalyte and multisample correlations for one or more candidate variants. In various embodiments, the variant correlation module 134 may identify any kind of correlation that facilitates the identification of a true variant and/or exclusion of a false positive variant. In one example, the variant correlation module 134 may determine whether there is a cross-analyte correlation by checking the analyte features of a candidate variant with other analyte features associated with the candidate variant (e.g., analyte features from a same cell or same sample). The analyte features may relate to DNA, RNA, protein, or other kinds of biomolecules. For example, the variant correlation module 134 may check whether there is a correlation between the VAF value of a candidate variant and the protein expression of cell surface markers from a same cell. For another example, the variant correlation module 134 may determine whether there is a correlation between one variant and another variant within a same cell by checking how many times (e.g., how many cells) these two variants occur in a same cell.
[0067] In another example, the variant correlation module 134 leverages cross-sample correlation to determine whether a variant is a true variant or not. For example, data from a set of control samples may be used for background error rate evaluation when determining whether a candidate variant is a true variant or is due to a background error (e.g., base-calling errors, processing errors, and/or artifacts, which may be considered as “background error”). The control samples may not have true or rare variants (e.g., control samples obtained from healthy donors) and thus the identified variants can be considered as being from background errors (e.g., due to base-calling errors, processing errors, and/or artifacts, among possible other errors). The control samples may be similarly processed as the sample that is subject to the variant call, and thus the background errors identified from the control samples may allow determining whether a candidate variant identified from a sample in the variant call is due to the background error or not (e.g., by estimating the background error rate for each candidate rate to see how likely the variant is falsely called). In various embodiments, there are additional multianalyte and multisample correlations that can be evaluated to determine whether a candidate variant is a true variant or not.
[0068] In various embodiments, the variant correlation module 134 further includes a variant correlation model that allows the identification of any potential correlation between a candidate variant and another variant/analyte/cell/sample. The variant correlation model may evaluate the presence of possible correlations based on the data obtained from sequence reads, protein expression from the same cells, same samples, or from different cells and/or samples. In one example, the variant correlation model disclosed herein includes one or more probabilistic models that evaluate the statistical significance of one or more possible correlations. In another example, the variant correlation model may also include one or more machine-earning models that can be trained to identify certain correlations based on the features identified from the variants, analytes, cells, and/or samples. As an example, the variant pair co-occurrence patterns from a large number of samples can be used to train a machine-learning model to determine whether there is a correlation between co-occurring variants within same cells.
[0069] The variant caller module 136 applies a variant caller model to predict one or more true variants of a cell population. In various embodiments, the variant caller module 136 provides, as input, multianalyte and multisample correlation information associated with a candidate variant to the variant caller model. The variant caller model analyzes the multianalyte and multisample correlation information and outputs a prediction for the candidate variant. [0070] In various embodiments, the variant caller model is a classifier that outputs a classification for the candidate variant out of multiple possible classifications. In some embodiments, the variant caller model is a classifier that outputs one of two classifications for the candidate variant. As an example, the variant caller model can output a classification of a true variant or a false positive variant. In some embodiments, the variant caller model outputs a classification of an indeterminate variant. An indeterminate variant can represent a low- confidence call that requires additional analysis to confirm whether the indeterminate variant is a true variant.
[0071] The training module 138 generally implements methods for generating one or both of the variant correlation model and the variant caller model. In various embodiments, the training module 138 is implemented by a device or system other than the variant caller device 130. For example, the training module 138 can be implemented by a third party. In such a scenario, the third party generates one or both of the variant correlation model and the variant caller model. The third party can then provide one or both of the trained variant correlation model and the trained variant caller model to the variant caller device 130.
[0072] In various embodiments, the training module 138 trains the variant correlation model. The training module 138 can employ a machine learning-implemented method to train the variant correlation model, such as any one of a linear regression algorithm, logistic regression algorithm, decision tree algorithm, support vector machine classification, Naive Bayes classification, K-Nearest Neighbor classification, random forest algorithm, deep learning algorithm, gradient boosting algorithm, and dimensionality reduction techniques such as manifold learning, principal component analysis, factor analysis, autoencoder regularization, and independent component analysis, or combinations thereof. In various embodiments, the training module 138 employs supervised learning algorithms, unsupervised learning algorithms, semisupervised learning algorithms (e.g., partial supervision), transfer learning, multi-task learning, or any combination thereof to train the variant correlation model.
[0073] The training module 138 trains the variant correlation model using variant correlation training samples. In various embodiments, the variant correlation training samples include training sequence reads, protein expression, and other features derived from individual cells or samples that show certain correlation patterns. Such training samples can be expressed in a commonly used file format such as a SAM or BAM file format. [0074] In various embodiments, the training module 138 also trains the variant caller model. The training module 138 can employ a machine learning-implemented method to train the variant caller model, such as any one of a linear regression algorithm, logistic regression algorithm, decision tree algorithm, support vector machine classification, Naive Bayes classification, K-Nearest Neighbor classification, random forest algorithm, deep learning algorithm, gradient boosting algorithm, and dimensionality reduction techniques such as manifold learning, principal component analysis, factor analysis, autoencoder regularization, and independent component analysis, or combinations thereof. In various embodiments, the training module 138 employs supervised learning algorithms, unsupervised learning algorithms, semisupervised learning algorithms (e.g., partial supervision), transfer learning, multi-task learning, or any combination thereof to train the variant caller model.
[0075] The training module 138 trains the variant caller model using variant caller training samples. In various embodiments, the variant caller training samples include training multianalyte and multisample correlations with known variant identification (e.g., known true variants or false positive variants, e.g., which can be determined through manual investigation). In various embodiments, the variant caller training samples also include sequence reads, analyte expressions, cells, and/or samples that can be used to derive multianalyte and multisample correlation information.
[0076] In various embodiments, the variant caller training samples can be labeled with reference ground truths indicating a classification of variants. In one embodiment, the reference ground truths differentiate between a true variant and a false positive variant. In various embodiments, the labels of the variant caller training samples can be previously determined and/or confirmed through other sequencing methods, such as bulk sequencing methods. In various embodiments, labels of the variant caller training samples can be previously determined at least in part based on known genetic variants that are present in certain cell lines. In various embodiments, a label can be a binary value (e.g., a 0 or 1 value) that is indicative of whether the variant is a true variant or a false positive variant. In some embodiments, a label can be different integer values (e.g., 0, 1, 2, 3, etc.) depending on the number of classifications that the variant caller model is designed to predict. For example, in indeterminate variant can be labeled as 2, while a true variant or false positive variant is labeled as 0 or 1. Methods for Calling Variants of a Cell Population
[0077] Reference is now made to a flow chart 200 shown in FIG. 2, which describes an example method for identifying true variants and a subpopulation containing true variants based on multianalyte and multisample correlation. The multianalyte and multisample correlation may include, but is not limited to, 1) a multianalyte correlation between VAF values of variants and protein expression of one or more proteins, 2) a multianalyte co-occurrence between one variant and another different variant in the same cell(s), and 3) a multisample comparison of a variant from an in-evaluation sample to other control samples. Based on the findings from these different multianalyte and multisample correlations, it can be determined whether a variant is a true variant or not.
[0078] Step 202: Obtain a set of candidate variants from a heterogeneous population of cells.
[0079] In various embodiments, the heterogeneous population of cells includes cells that contain true variants (e.g., various somatic mutations) and cells that do not contain true variants. In various embodiments, the candidate variants are determined through single-cell DNA sequencing. In various embodiments, the set of candidate variants includes a plurality of falsely called variants arising from base-calling errors, processing errors (e.g., due to PCR bias), and/or artifacts (e.g., allele dropout or sequence homology). Accordingly, it may be desirable to separate true variants from these falsely called variants so that a proper population of cells may be identified for further analysis, such as understanding the mechanisms of tumorigenesis, among others.
[0080] Step 204: For each candidate variant, determine a correlation between the candidate variant and one or more of other features associated with the variant.
[0081] In various embodiments, the candidate variant can be compared to other analytes and/or other variants within the same or different cells or samples, to identify possible multianalyte and multisample correlations between the variant and other analytes and/or other variants within the same or different cells or samples. For example, the candidate variant can be analyzed for each possible correlation. In various embodiments, determining a correlation between the candidate variant and one or more of the other features associated with the variant comprises using a variant correlation model to determine the correlations for the candidate variant from different aspects. [0082] Step 206: Select a subset of the candidate variants as true variants based on the determined correlations.
[0083] In various embodiments, selecting a subset of the candidate variants as true variants comprises using a variant caller model to determine whether a candidate variant is a true variant based on the determined one or more correlations associated with the variant. In various embodiments, not very possible correlations for a candidate variant need to be determined. For example, if one identified correlation can determine whether the candidate variant is a true variant, then no other correlations need to be further evaluated. In other embodiments, multiple correlations are determined instead, to improve the sensitivity and specificity of a variant calling process.
[0084] Step 208: Identify a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants.
[0085] The heterogeneity and dynamism of cancer present formidable challenges to understanding and treating the disease. Evolution often leads to considerable genetic variation across clones. Accordingly, identifying subpopulations of cells is very important in understanding the mechanisms of tumorigenesis. In various embodiments, identifying a subpopulation of cells from the heterogeneous population of cells comprises identifying a subpopulation of cells comprising one or more true variants. In various embodiments, the one or more true variants identified through method 200 disclosed herein represent rare variants present in the subpopulation of cells. In various embodiments, the subpopulation of cells represents a rare cell population within a heterogeneous population of cells. For example, the subpopulation may be a subclone present in a tumor sample or biopsy.
Embodiments of Variant Correlation model and Variant Caller Model
[0086] In particular embodiments, the variant correlation model and the variant caller model are machine-learned models. Each of the variant correlation model and the variant caller model may be trained using training data. Following the training, the variant correlation model and the variant caller model can be deployed (e.g., deployed to a variant caller device for variant call). [0087] In various embodiments, one or both of the variant correlation model and the variant caller model is any one of a regression model (e.g., linear regression, logistic regression, or polynomial regression), decision tree, random forest, support vector machine, Naive Bayes 1 model, k-means cluster, or neural network (e.g., feed-forward networks, convolutional neural networks (CNN), deep neural networks (DNN), autoencoder neural networks, generative adversarial networks, or recurrent networks (e.g., long short-term memory networks (LSTM), bidirectional recurrent networks, deep bi-directional recurrent networks).
[0088] In various embodiments, one or both of the variant correlation model and the variant caller model have one or more parameters, such as hyperparameters or model parameters. Hyperparameters are generally established prior to training. Examples of hyperparameters include the learning rate, depth or leaves of a decision tree, number of hidden layers in a deep neural network, number of clusters in a k-means cluster, penalty in a regression model, and a regularization parameter associated with a cost function. Model parameters of one or both of the variant correlation model and the variant caller model are generally adjusted during training. Examples of model parameters include weights associated with nodes in layers of a neural network, support vectors in a support vector machine, and coefficients in a regression model. The model parameters of the machine learning model are trained (e.g., adjusted) using the training data to improve the predictive power of the machine learning model.
[0089] In some embodiments, one or both of the variant correlation model and the variant caller model are parametric models in which one or more parameters of the models define the dependence between the independent variables and dependent variables. In various embodiments, various parameters of parametric-type models are trained to minimize a loss function, the training being conducted through gradient-based numerical optimization algorithms, such as batch gradient algorithms, stochastic gradient algorithms, and the like. In some embodiments, one or both of the variant correlation model and the variant caller model are non-parametric models in which the model structure is determined from the training data and is not strictly based on a fixed set of parameters.
[0090] FIG. 3 A depicts one example implementation of the variant correlation model 310, in accordance with an embodiment. In this implementation, the variant correlation model 310 analyzes features associated with a candidate variant, where the features are generated from sequence reads and protein expressions derived from single cells in a sample or multiple samples. In various embodiments, the variant correlation model 310 analyzes these features derived from the sequence reads and protein expressions. These features may include various features such as protein expression levels, variant allele frequency values, etc., all of which can be derived from the sequence reads and protein expression profiles of the individual cells within the sample(s). Based on the identified features associated with a candidate variant, the variant correlation model 310 outputs a correlation value or matrix that represents one or more correlations between the candidate variant and one or more of other variants and/or analytes from the same or different cells or samples.
[0091] In particular embodiments, the variant correlation model 310 is a neural network. In some embodiments, the variant correlation model 310 is a deep-learning neural network. The variant correlation model 310 may be structured with two, three, four, five, six, seven, eight, nine, or ten layers. Layers of the variant correlation model 310 are comprised of one or more nodes. A node in a layer can be connected to other nodes of other layers, the connection between nodes being associated with parameters. A value at one node may be represented as a combination of the values of nodes connected to the particular node weighted by associated parameters mapped by an activation function associated with the particular node.
[0092] FIG. 3B depicts one example implementation of the variant caller model, in accordance with an embodiment. In the implementation shown in FIG. 3B, the variant caller model 320 analyzes various correlations derived from features associated with a candidate variant. The variant caller model 320 outputs a classification for the candidate variant. In some embodiments, the classification for the variant is one of a true variant or a false positive variant. [0093] In some embodiments, the variant caller model 320 receives as input the various correlations identified from the features associated with a candidate variant. The variant caller model 320 analyzes the various correlations and predicts a variant classification for the candidate variant.
[0094] In particular embodiments, the variant caller model 320 is a neural network. In some embodiments, the variant caller model 320 is a deep-learning neural network. The variant caller model 320 may be structured with two, three, four, five, six, seven, eight, nine, or ten layers. Layers of the variant caller model 320 are comprised of one or more nodes. A node in a layer can be connected to other nodes of other layers, the connection between nodes being associated with parameters. A value at one node may be represented as a combination of the values of nodes connected to the particular node weighted by associated parameters mapped by an activation function associated with the particular node. Methods for Co-occurrence Based True Variant Identification
[0095] Embodiments disclosed herein leverage the single-cell data by identifying variants that co-occur together in the same cell in a statistically significant way. Prior approaches identify variants by checking one variant at a time, where an error rate for individual variants may be high. The error rate of co-occurring variants is much lower compared to individual variants, so the theoretical limit of detection for co-occurring variants is lower than that of individual variants. Embodiments of methods disclosed herein use a Binomial test for testing the statistical significance of co-occurring variants. This not only identifies if two rare variants co-occur together in a significant way but also filters out cases where one variant is rare while the other is not. For example, if one variant is in 1% of cells and the other is in 80%, then it can be expected that the two variants will co-occur by chance. Prior approaches do not apply this Binomial test to combinations of candidate rare variants observed in a sample. The specific processes for co- occurrence-based true variant identification are further described in detail below.
[0096] FIG. 4A depicts a flow chart of an example method 420 for co-occurrence-based true variant identification, according to some embodiments. Specifically, in step 422, a set of candidate variants are obtained from a heterogeneous population of cells. The candidate variants may be identified from a cell population (e.g., cell population 110 described in FIG. 1 A), which includes data obtained from a heterogeneous population of cells through a single cell platform (an example of which includes the Tapestri® platform). A single-cell platform enables highly sensitive targeted DNA sequencing at the single-cell level. By incorporating a cell-specific barcode into the amplified target sequences, this droplet-based technology measures genetic lesions (single nucleotide variants (SNVs), indels, chromosomal rearrangements) and copy number variants (CNVs) in each cell. Other structural characteristics like zygosity, mutational co-occurrence, and the presence of rare cell populations can be also identified, as further described below.
[0097] Step 424: Apply a first set of variant filters to identify a set of variants from candidate variants.
[0098] In various embodiments, before determining the co-occurrence of a pair of variants within one or more cells, a set of ad hoc variant filters are first applied to remove potential false positive variants (or simply false positives). The possible ad hoc variant filters include, but are not limited to, allele depth (AD) and/or depth of coverage (DP), genotype quality (GQ), the minimum number of cells that a variant is present, and a minimum of variant allele frequency of each candidate variant. In applications, these different filters may be set to predefined values, to allow filtering out likely false positives. In one example application, the thresholds may be set as follows: requiring a minimum depth of 10 reads, requiring genotype quality of 30 or larger, requiring a variant to be present in at least 3 cells, and requiring a minimum variant allele frequency of 35. In other example applications, different threshold values may be set for each specific filter, as further described in detail below.
[0099] Allele depth and depth of coverage are two complementary fields that represent two important ways of evaluating the depth of the data prepared for the variant call. Allele depth refers to the unfiltered allele depth, which is the number of reads that support each of the reported alleles. All reads at a position (including reads that did not pass the variant caller’s filters) are included in this number, except for reads that are considered uninformative. For reads that are considered uninformative, these reads generally do not provide sufficient statistical evidence to support one allele over another. Depth of coverage, on the other hand, is the filtered depth, at the sample level as the allele depth. The depth of coverage is the number of filtered reads that support each of the reported alleles. Only reads that passed the variant caller’s depth threshold filter(s) may be included in this number of the depth of coverage. It should be noted that, unlike the allele depth calculation, uninformative reads are generally included in the depth of coverage calculation.
[00100] In various embodiments, by using a depth filter, variants with certain allele depths may be filtered out for further analysis. For example, variants that have a depth less than the depth of coverage threshold may be filtered out so that only variants that have a number of reads equal to or greater than the threshold may be subject to further analysis in the variant call. In various embodiments, a depth of coverage threshold may be a value that can be predetermined based on the knowledge gained from certain analyses (e.g., based on the knowledge identified from ground-truth samples). In applications, the predetermined depth of coverage threshold can be set to 4, 6, 8, 10, 12, 14, 16, 18, 20 reads, or another different value for a candidate variant. In one specific example, a depth of coverage threshold of 10 may be selected for filtering purposes. [00101] In various embodiments, the genotype quality is another ad hoc variant filter that can be used to filter out certain false positive variants. Genotype quality (GQ) generally represents a confidence level (e.g., Phred-scaled confidence level (PL)) that a genotype assignment is correct. GQ may be derived from the genotype PLs. For example, the GQ may be determined based on the difference between the PL of the second most likely genotype and the PL of the most likely genotype. In applications, the values of the PLs may be normalized so that the most likely PL is 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In various embodiments, the value of GQ is capped at 99 since larger values may be not more informative, but these values take more space in the file. So if the second most likely PL is greater than 99, a GQ of 99 may be used instead. In general, the GQ provides a sign of the difference between the likelihoods of the two most likely genotypes. If it is low, it means there is not much confidence in the genotype.
[00102] In various embodiments, variants in a variant call may be also filtered based on the GQ. For example, only variants that have a GQ higher than a threshold value may be subject to further analysis, while the variants that have a low GQ may be filtered out without further analysis. In applications, the GQ threshold may be determined based on knowledge obtained from other analyses (e.g., based on the knowledge identified from ground-truth samples). In one example, the GQ threshold may be set to 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, or another different value. In a specific example, the GQ threshold may be set to 35.
[00103] In various embodiments, the ad hoc variant filters further include determining the quantity of cells that a candidate variant is present. That is, by determining the number of cells that a variant is present, certain false positive variants can be also filtered out. Variant reads are the number of independent sequence reads supporting the presence of a variant. Due to the high error rate of NGS at the per-base call level, calls supported by fewer than a certain number of variant reads are typically considered to be likely false positive calls. On the other hand, true variants generally have a larger number of variant reads (e.g., detected in a larger number of cells). Accordingly, by using the cell quantity as a filter, the variants that occur in smaller numbers of cells can be considered potential false positives and filtered out.
[00104] The specific value used in the cell number-based filtering process may be also determined based on information obtained from certain analyses (e.g., based on the knowledge identified from ground-truth samples). For example, after analysis based on the studies, it is found that true variants generally occur in three or more cells in a sample. Accordingly, the cell quantity threshold may be set to 3 to filter out variants that occur in less than three cells in a sample. In various embodiments, the cell threshold may be also another different value, such as 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 45, 50, or another number of cells, based on the knowledge obtained from certain studies.
[00105] In various embodiments, the ad hoc variant filters further leverage the variant allele frequency (VAF) to filter out certain false positives. VAF is the percentage of sequence reads observed matching a specific DNA variant divided by the overall coverage at that locus. True variants typically have higher variant allele frequencies, while false positives may have lower variant allele frequencies. Accordingly, by setting VAF as a filter, certain false positives may be also filtered out.
[00106] The specific values used in the VAF filtering process may be also determined based on the information from the ground-truth studies or other similar analyses. For example, a number of studies indicate that a variant that has a VAF smaller than 35 is likely a false positive, and thus the threshold value may be set to 35. In various embodiments, the threshold value for the VAF can be set to other different values, such as 20, 25, 30, 35, 40, 45, 50, or another different value.
[00107] In various embodiments, after applying the first set of filters, a set of variants may be obtained from the candidate variants identified from a sample. The set of variants can be then subject to the co-occurrence determination, for example, to determine the quantity of cells where a candidate variant co-occurs with another variant.
[00108] Step 426: For a variant pair included in the candidate variants, determine a quantity of co-occurrence cells where both variants in the variant pair co-occur in each of the co-occurrence cells.
[00109] Tumor development emerges from a gradual accumulation of somatic alterations that together enable malignant growth. Accordingly, when two variants occur together in a cell, it indicates more likely true variants than false positives. However, for any pair of variants cooccurring in a same cell, a likely correlation exists between the two variants in the variant pair when they co-occur in a certain number of cells (but not just one cell). In the methods disclosed herein, the threshold number of cells for determining the variant pair co-occurrence is set to a value determined based on the ground-truth studies or other similar analyses. In one example, the threshold number can be set to 3, which means that when variants in a variant pair co-occur in 3 cells or more, it indicates that there is a co-occurrence between the two variants in the variant pair. In other examples, different numbers other than 3 (e.g., 2, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, or another different number of cells) can be used instead.
[00110] Step 428: Determine a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants.
[00111] In various embodiments, the statistical significance of variant co-occurrence is tested by using a Binomial test. A Binomial test is a test of the statistical significance of deviations from a theoretically expected distribution of observations into two categories using sample data. In various embodiments, the Binomial test serves as a kind of probability test based on various rules of probability, and involves the testing of the difference between a sample proportion and a given proportion. When using the Binomial test to test co-occurrence, a single variant occurrence probability is estimated per cell that applies to all cells in the sample. The product of two variant occurrence probabilities gives the probability of overlap by chance. In one example, the threshold p-value of the Binomial test is set to 0.01, which means that when the determined p- value of the Binomial test is less than or equal to 0.01, the detected co-occurrence is statistically significant. In other examples, threshold p-values of the Binomial test can be set to a value other than 0.01 (e.g., 0.005, 0.006, 0.007, 0.008, 0.009, 0.011, 0.012, 0.013, 0.014, 0.015 or another value) to determine the statistical significance.
[00112] In various embodiments, a one-sided Binomial test is used in evaluating the statistical significance. As described earlier, if one variant is in 1% of cells and the other is in 80%, then it would be expected most variant pairs will co-occur by chance. By using a one-sided Binomial test, it can filter out cases where one variant is rare and the other is not. In various embodiments, the Binomial test can be expressed as:
Figure imgf000031_0001
[00113] In various embodiments, a sample may have tens of thousands of rare variants, with the vast majority occurring in less than 10 cells. In most samples, there are tens of thousands of variants, and performing a statistical test on all possible combinations, such as a Binomial test, is very computationally inefficient. To speed up the computation, the mathematical concepts of the sparse matrices, matrix multiplication, and adjacency matrices may be used. A sparse matrix is a special case of a matrix in which the number of zero elements is much higher than the number of non-zero elements. As a rule of thumb, if 2/3 of the total elements in a matrix are zeros, it can be called a sparse matrix. In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. In various embodiments, a sparse matrix representation of which variants are present in each cell is generated, followed by multiplication of this sparse matrix by the transpose of itself yields an adjacency matrix, so that redundant memory access to non-zeros may be eliminated by decoupling multiplication from accumulation, thereby speeding up the Binomial test of all possible combinations of variant pairs.
[00114] In various embodiments, based on the Binomial test of all possible combinations of variant pairs, a subset of variant pairs that have shown statistical significance in the Binomial test may be obtained.
[00115] Step 430: Apply a second set of ad hoc filters to identify a sub-subset of variant pairs. [00116] In various embodiments, after identifying a subset of variant pairs that show significance in the co-occurrence test, the subset of variant pairs may be further fdtered to remove certain false positives. A first example filter relates to a genomic distance between the two variants in a variant pair. In general, co-occurring variants that are genomically very close together are more likely the result of PCR or alignment errors. By fdtering out the variant pairs where two variants in a pair are too close, the likely false positives from PCR and alignment errors are excluded. In applications, the threshold genomic distance between two variants in a variant pair is determined based on the knowledge identified from PCR and alignment error analyses. In one example, the threshold genomic distance is set to 100 base pairs, which means that if co-occurring variants in a variant pair are less than 100 base pairs apart from each other, the variant pair can be fdtered out from the true variant analysis.
[00117] A second example filter for a variant pair that shows significant co-occurrence relates to an average VAF. Just as VAF can be used to evaluate a single variant, the average VAF can be used to evaluate the variant pair. Here, the average VAF takes the average of the VAFs for the two variants in a variant pair. The specific values used in the VAF filtering process may be also determined based on the information obtained for single variant-based VAF analysis. In one example, the threshold value may be set to 35, which means that if the average VAF for a variant pair is less than 35, one or both variants in a variant pair are likely false positives. In other examples, the threshold value for the average VAF can be set to other different values, such as 20, 25, 30, 35, 40, 45, 50, or another different value.
[00118] In various embodiments, the variants included in the variant pairs that have passed the second set of ad hoc filters are considered true variants.
[00119] In various embodiments, the methods disclosed herein for co-occurrence-based true variant identification further include identifying a subpopulation of cells with one or more of the true variants (e.g., cells with two or more true variants in tumor development emerging from a gradual accumulation of somatic alterations that together enable malignant growth). In various embodiments, one or more true variants represent rare variants present in the subpopulation of cells. In various embodiments, the subpopulation of cells represents a rare cell population within a heterogeneous population of cells. In various embodiments, the subpopulation of cells represents less than 10%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1% of the cells in the heterogeneous population of cells.
Methods for Background Error Rate-Based True Variant Identification
[00120] Embodiments disclosed herein leverage background error rates to identify true variants. Prior approaches rely on a fixed threshold for all variants, which does not consider the variability in background error rates across variants. In the methods disclosed herein, a threshold for identifying a true variant is dynamically determined for each variant. Altogether, the methods disclosed herein enable a lower limit of detection for some variants, while having a higher threshold for variants that are more error-prone, thereby reducing false positives.
[00121] In various embodiments, the methods disclosed herein leverage control samples for estimating background error rates for individual variants. These control samples can be samples that have been previously or instantly generated specifically for background error rate estimation. These individual control samples may provide additional information beyond what a single sample can provide on its own in respective studies.
[00122] In various embodiments, the methods disclosed herein use Beta-Binomial distribution to statistically model the per-variant background error rate. Furthermore, the methods disclosed herein use the variant-specific background error rate to determine whether a variant is a true variant. The variant-specific background error rate generated using the Beta-Binomial distribution has certain advantages. First, as described above, the methods allow the calculation of a p-value for each variant, which will be the probability that the variant call is an error. Second, Beta-Binomial is a count-based distribution, so it dynamically adjusts to the total number of cells in a sample. The number of cells where a false positive variant is observed roughly scales with the total number of cells in the sample and therefore, the Beta-Binomial distribution captures this adjustment. Furthermore, unlike count-based Binomial distribution or other similar Binomial distributions, the Beta-Binomial allows for over-dispersion, which enables it to be more flexible in modeling different error distribution shapes. The specific processes for the background error rate generated from the Beta-Binomial distribution in identifying true variants are further described in detail below.
[00123] FIG. 4B depicts a flow chart of an example method 440 for identifying true variants based on the background error rate generated based on the Beta-Binomial distribution.
[00124] Step 442: Obtain a set of candidate variants from a heterogeneous population of cells. [00125] In various embodiments, the candidate variants may be identified from a cell sample, as described earlier. For example, the cell sample may include variants determined from sequencing data from a number of heterogeneous cells. Some of the cells may include rare variants, e.g., cancer cells with rare variants indicative of MRD.
[00126] Step 444: Apply a first set of variant filters to identify a set of variants from candidate variants.
[00127] Similar to the methods described earlier for co-occurrence-based true variant identification, a first set of variant filters may be also applied to filter out potential false positives before the background error rate-based true variant identification. The first set of variant filters may be similar to, or different from, the filters applied earlier in the co-occurrence-based true variant identification. For example, the first set of filters for the method 440 may include a depth of coverage filter, a genotype quality filter, and a genotyped cell percentage filter regarding a percentage of cells that are genotyped for a genomic position of a variant. The threshold values for each filter may be similarly configured or selected. In one example, the genotyped cell percentage threshold may be set to at least 10%, 20%, 30%, 40%, 50%, or 60%. In various embodiments, after filtering using the first set of filters, many potential false positives may be filtered out, to obtain a filtered set of candidate variants for the background error rate-based true variant identification. [00128] Step 446: For a variant included in the set of candidate variants, generate a statistical significance by using a beta-binomial distribution with parameters generated from a set of control samples.
[00129] In various embodiments, for each unique variant that has passed the first set of filters, the Beta-Binomial distribution is specifically generated for that variant based on the parameters estimated from a set of control samples. These estimated parameters can be then used to calculate the statistical significance (e.g., p-value) for that variant. The specific processes for generating the parameters for the Beta-Binomial distribution for a specific variant are further described in FIG. 4C. As will be described later, the parameters generated for a specific variant are normally for a variant found in the control samples. Accordingly, when a new variant is subject to the Beta-Binomial test for statistical significance, it may first check whether the new variant can be found in a control sample. If the new variant is found in the control sample, the parameters may have been already estimated for that new variant (or the parameters can be instantly generated from the control samples if they are not readily available). On the other hand, if the new variant is not found in any control sample used to generate parameters, the parameters for the Beta-Binomial distribution test may use the average parameters that are obtained by averaging the parameters estimated for all readily available parameters (e.g., parameters for all variants included in the control samples). In this way, the parameters for each specific variant can be obtained, which can be then used to calculate the p-value for that specific variant.
[00130] In various embodiments, since the parameters are specifically generated for each candidate variant for the Binomial test, the detection limit can be lowered by taking into the background error rate specific to that variant during the true variant identification.
[00131] Step 448: Determine a subset of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants.
[00132] In various embodiments, based on the calculated statistical significance, certain false positives can be further filtered out. For example, if the calculated p-value for a variant is larger than a threshold, the variant can be filtered out. In applications, the p-value threshold may be set based on the knowledge obtained from ground-truth studies or other similar studies. In one example, the p-value may be set to 0.0001, which means that variants having a calculated p- value larger than 0.0001 can be filtered out. In other examples, the p-value can be set to a value other than 0.0001, such as 0.00005, 0.00006, 0.00007, 0.00008, 0.00009, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009, 0.001, or another different value. In various embodiments, after filtering out certain variants based on the calculated statistical significance, a subset of candidate variants can be obtained.
[00133] Step 450: Apply a second set of filters to obtain a subset of candidate variants.
[00134] In various embodiments, the candidate variants obtained based on the background error rate can be further filtered using one or more filters. For example, a filter may be further configured to require a variant to be present in at least a number of cells regardless of the calculated p-value. The expected number of cells can be 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or another different value. In a specific application, a variant needs to be present in at least 5 cells for that variant to be considered a true variant.
[00135] In various embodiments, another different filter may be configured to require the variant allele frequency to be larger than a specific value to be considered a true variant. For example, the specific value may be set to 35, which means that variants with a VCF of less than 35 can be considered false positives and filtered out. In other examples, the specific value for the VCF can be set to a value other than 35, such as 25, 30, 40, 45, 50, 55, 60, 65, 70, or another different value.
[00136] In various embodiments, the variants that have passed the second set of filters are considered true variants.
[00137] In various embodiments, the methods disclosed herein for background error ratebased true variant identification further includes identifying a subpopulation of cells with one or more of the true variants (e.g., cells with one or more true variants identified through the abovedescribed processes). In various embodiments, the one or more true variants represent rare variants present in the subpopulation of cells. In various embodiments, the subpopulation of cells represents a rare cell population within a heterogeneous population of cells. In various embodiments, the subpopulation of cells represents less than 10%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1% of the cells in the heterogeneous population of cells.
[00138] FIG. 4C depicts an example method 460 for generating parameters for Beta-Binomial distribution for a specific variant.
[00139] In various embodiments, the parameters may be generated based on a set of control samples and may be generated for a variant included in one or more of the control samples. In one embodiment, for each variant included in the set of control samples, a set of Beta-Binomial parameters may be generated.
[00140] Step 462: Acquire a set of control samples.
[00141] In various embodiments, the control samples may be collected based on the existing data already available from previous runs of single-cell sequencing (e.g., from data already available from a Tapestri® platform). Alternatively, a set of control samples may be also purposedly generated to estimate parameters for Beta-Binomial distribution for different variants. In various embodiments, the control samples are ideally non-cancerous samples, such as healthy bone marrow or other different tissues. This allows the control samples to contain just germline variants and have no sub-clonal mutations which make it more difficult to filter out real variants before estimating the background error rate for each variant. In various embodiments, a sufficient number of control samples are necessary to yield more robust estimates of the error rate. In one example, at least 20 control samples are collected for this purpose. In other examples, at least 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, or another different number of control samples are collected in this step.
[00142] Step 464: For each control sample, mask or remove any real variants in each sample. [00143] In various embodiments, by masking or removing real variants in each sample, only background noise variants are included in the Beta-Binomial parameter estimation. For example, the germline variants are masked or removed in this step. Here, masking refers to a process to substitute low-quality base calls with an “N”.
[00144] Step 466: For each control sample, apply a set of variant filters to identify a set of background variants.
[00145] The set of variant filters including the filter thresholds are set to be the same as the set of filters used in the background error rate-based true variant identification. For example, the set of filters may include a depth of coverage filter and a genotype quality filter. If the depth of coverage threshold used in the background error rate-based true variant identification is set to 10, the threshold for the depth of coverage in this step is also set to 10. Similarly, if the genotype quality threshold used in the background error rate-based true variant identification is set to 30, the threshold for the genotype quality in this step is also set to 30. In this way, variants remaining in each control sample can be considered background variants, or noise variants, which can thus be used to generate a background error rate for a specific variant. [00146] Step 468: For each background variant, determine a quantity of cells containing the background variant in each sample.
[00147] In generating parameters for specific variants, the number of cells with the variant is counted in each control sample. This then leads to the generation of two vectors: (1) a first vector referring to the number of cells that have the variant in each control sample, and (2) a second vector referring to the number of total cells in each control sample. These two vectors can be then used to estimate the parameters of the Beta-Binomial distribution.
[00148] Step 470: Generate parameters for the beta-binomial distribution for each background variant based on the determined quantity of cells containing the background variant in each sample.
[00149] In various embodiments, the parameters are generated based on the vectors generated for each specific variant. It should be noted that the parameters are only generated for the remaining variants that are false positives in Step 468. That is, the per- variant error rates are generated for remaining false positives in Step 468, and thus can be used to identify such false positives when identifying the true variants for a new sample, i.e., a sample that is not the control sample (e.g., a sample collected from a cancer patient). However, when using the background error rate to identify a true variant for a new sample, a candidate variant may be not found in the control samples. That is, a candidate may not have the corresponding per-variant error rate. In such a scenario, parameters from all the remaining false positives may be averaged, which can be then used to generate a background error rate that can be applied to a new variant that is not found in the remaining false positives in Step 468.
[00150] In various embodiments, for a variant not found in the control samples, the parameters for the Beta-Binomial distribution are generated by averaging all parameters from all remaining false positive variants after filtration and removal of germline variants.
[00151] In various embodiments, after the parameters (including parameters for the remaining false positives and new variants not found in the control samples as described above) are estimated from the control samples, the probability of whether a candidate variant is from background error or is a true variant can be then determined. Here the following is an example probabilistic model used for modeling the variant background error rate, according to various embodiments: e ~ BetaBinomial(k, n, a, b) e = probability that a variant is from a background eror k = number of cells with that variant n = total number of cells a, b = parameters estimated from control sample
As can be seen from the probabilistic model, based on the parameters estimated from control samples, it can be determined whether a candidate variant is a true variant or a false positive due to the background error.
[00152] FIG. 4D further depicts a flow diagram for generating a per-variant background error rate. As illustrated, the bone marrow from healthy donors may be collected to get a number of samples, e.g., 30 samples or more. The single-cell sequencing is then performed, and a first set of variant filters are then applied. In addition, the likely germline variants are masked or removed. The remaining variants are false positives, which may be considered background errors in a variant call. These false positives are then used to generate a per-variant background error rate using the Beta-Binomial distribution. As can be seen from FIG. 4D, each variant may have a different background error rate, which can be then taken into consideration for a specific variant in the true variant identification process.
Methods for Multianalyte Correlation-Based True Variant Identification
[00153] In particular embodiments, methods disclosed herein involve correlating an invariant filtered set of variants against orthogonal analyte data and/or all multiplex samples. For example, an invariant filtered set of genomic variants detected via single-cell DNA analysis is correlated against orthogonal analyte data. Although the subsequent description refers to orthogonal protein analyte data, the same principle may be extended to other orthogonal analyte data. Both random and systematic errors should be uniform across varying analytes and samples, whereas true somatic variants can be identified by strong correlation against analytes and/or other samples. Put another way, true somatic variants that are present in the genome will exhibit correlations with another analyte other than DNA (e.g., protein analyte expression data) whereas falsely called variants are randomly present, but will not exhibit correlations with another analyte other than DNA (e.g., protein analyte expression data).
[00154] Furthermore, particularly rare somatic variants may appear at low frequencies (e.g., less than 1% VAF) in a limited number of cells and may be absent in other cells. Thus, by combining multiple cells, the cells that lack the rare somatic variant can serve as an appropriate background, such that the rare somatic variants that are present from particular cells can rise above the background level and be identified by correlating to orthogonal protein analyte data, as further described in detail below.
[00155] FIG. 4E is a flow chart of an example method 480 for performing de novo variant calling via multianalyte correlation, in accordance with an embodiment.
[00156] Step 482: Obtain a set of candidate variants determined through single-cell DNA sequencing.
[00157] In various embodiments, the set of candidate variants includes true variants and a plurality of falsely called variants arising from base-calling errors, processing errors (e.g., due to PCR bias), and/or artifacts (e.g., allele dropout or sequence homology).
[00158] Step 484: Determine allele frequencies of the candidate variants.
[00159] For example, the allele frequencies of the candidate variants can be determined from sequence reads generated via single-cell DNA sequencing.
[00160] Step 486: Correlate the allele frequencies of the candidate variants to the protein expression of proteins of one or more proteins.
[00161] In various embodiments, this step includes determining a correlation between each candidate variant and protein expression of each protein of a plurality of proteins. In various embodiments, there may be at least ten, at least fifty, at least a hundred, or at least a thousand candidate variants. In a specific example, these proteins may be cell surface markers that represent an immunophenotype of a cell. In various embodiments, there may be protein expression data for at least one, at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least twelve, at least thirteen, at least fourteen, at least fifteen, at least sixteen, at least seventeen, at least eighteen, at least nineteen, or at least twenty proteins. The protein expression data may be constructed based on the amplification amplitude of the oligos included in the antibody-oligo conjugates.
[00162] In various embodiments, correlating the allele frequencies of the candidate variants to the protein expression of proteins involves constructing a correlation matrix. Example correlation matrices are described and shown below in the Examples section. In various embodiments, a correlation matrix may include, on the Y-axis, the plurality of candidate variants. In various embodiments, a correlation matrix may include, on the X-axis, the plurality of proteins. The values of the correlation matrix may represent correlation (or anti-correlation) between each candidate variant on the Y-axis and the protein expression of each protein on the X-axis. In various embodiments, a correlation matrix may further include, on the X-axis, germline variants determined from DNA sequencing. By including at least the proteins on the X- axis, it enables improved visualization given that the correlative values in the matrix will enable distinguishing of true variants and falsely called variants.
[00163] Step 488: Select a subset of candidate variants as true variants based on the correlation of allele frequencies of candidate variants and protein expression of one or more proteins.
[00164] In various embodiments, selecting a subset of the candidate variants as true variants comprises identifying a candidate variant with an allele frequency that correlates with protein expression with at least a threshold number of proteins. In various embodiments, the threshold number of proteins is at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least twelve, at least thirteen, at least fourteen, at least fifteen, at least sixteen, at least seventeen, at least eighteen, at least nineteen, or at least twenty proteins.
[00165] In various embodiments, the correlation matrix described in Step 486 is used to select a subset of candidate variants as true variants. For example, the correlation values in the correlation matrix can be used to determine whether a candidate variant is correlated with the protein expression of a plurality of proteins.
[00166] In various embodiments, a candidate variant is deemed to be a true variant based on its allele frequency being correlated with the protein expression of a plurality of proteins. Correlation is based on the standard deviation of correlation values for the candidate variant across the plurality of proteins. In various embodiments, a candidate variant is deemed to be a true variant if the standard deviation of correlation values across the plurality of proteins is above a threshold value. In various embodiments, a candidate variant is deemed to be a falsely called variant if the standard deviation of correlation values across the plurality of proteins is across a threshold value. For example, falsely called variants will be uncorrelated with protein expression values, and therefore, the standard deviation value of a false variant across the proteins will be below a threshold value. Conversely, true variants may be highly correlated with the protein expression of certain proteins and highly anti-correlated with the protein expression of other proteins. [00167] In various embodiments, a candidate variant is deemed to be correlated with protein expression of a plurality of proteins based on the standard deviation of correlation values for the candidate variant across the plurality of proteins and plurality of DNA sites. In various embodiments, a candidate variant is deemed to be a true variant if the standard deviation of correlation values across the plurality of proteins and plurality of DNA sites is above a threshold value. In various embodiments, a candidate variant is deemed to be a falsely called variant if the standard deviation of correlation values across the plurality of proteins and plurality of DNA sites is below a threshold value.
[00168] Therefore, the standard deviation of correlation values for a true variant across the proteins will be above a threshold value. In various embodiments, the threshold standard deviation value is any of 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, or 0.20. In particular embodiments, the threshold standard deviation value is 0.07. In particular embodiments, the threshold standard deviation value is 0.10. [00169] Step 490: Identify a subpopulation of cells with one or more of the true variants. In various embodiments, the one or more true variants represent rare variants present in the subpopulation of cells. In various embodiments, the subpopulation of cells represents a rare cell population within a heterogeneous population of cells. In various embodiments, the subpopulation of cells represents less than 10%, less than 5%, less than 4%, less than 3%, less than 2%, or less than 1% of the cells in the heterogeneous population of cells.
Methods Combining Background Error Rate and Co-occurrence
[00170] In various embodiments, methods disclosed herein involve combining two or more of the above-described methods. In particular embodiments, the methods disclosed herein combine a background error rate-based method with a co-occurrence-based method. Specifically, after identifying some variants as potential true variants using the background error rate-based method, the identified true variants can be further subject to the co-occurrence-based method, to identify a subset of true variants. Alternatively, after identifying some true variants using the co- occurrence-based method, the identified true variants can be further subject to the background error rate-based method, to identify a subset of true variants.
[00171] In one example method, when a set of candidate variants from a sample are processed via a co-occurrence-based method 420 through Steps 422-430, a subset of candidate variants may be obtained. This subset of candidate variants may be further processed via a background error rate-based method 440 through Steps 442-450. In some embodiments, some repeating steps (such as certain filtering steps that use the same set of filters with the same thresholds) in method 440 may be omitted since these steps may have already been performed in method 420.
[00172] In another example method, when a set of candidate variants from a sample are processed via a background error rate-based method 440 through Steps 442-450, a subset of candidate variants may be obtained. This subset of candidate variants may be further processed via a co-occurrence-based method 420 through Steps 422-430. In some embodiments, some repeating steps (e.g., certain filtering steps that use the same set of filters with the same thresholds) in method 420 may be omitted since these steps may have already been performed in method 440.
[00173] In various embodiments, a multianalyte correlation-based method 480 can be further combined with one or more of method 420 or method 440. In various embodiments, by checking as many as possible likely correlations associated with a candidate variant, a confidence level of identifying a true variant can be sufficiently high to make a variant call.
Systems and Computer Devices
[00174] Embodiments described herein further refer to example systems and/or associated computer devices for performing the true variant identification methods described above. The example systems and/or associated computer devices can be configured to implement functionalities of the cell analysis workflow device 120 and variant caller device 130, as described above in reference to FIG. 1A.
[00175] FIG. 5 depicts an overall system environment, in accordance with an embodiment. Specifically, FIG. 5 depicts a single-cell analysis workflow including the designing of a targeted panel (e.g., targeted DNA panel), sample preparation, library preparation, cell sequencing, multi- omic analysis, and software analysis. In various embodiments, the single-cell analysis workflow further includes designing a panel for an additional analyte (e.g., analyte other than DNA), such as RNA or protein analytes. For example, as shown in FIG. 1 A, a protein panel can be designed and antibody-conjugated oligos are provided for performing a cell staining protocol. Thus, the single-cell workflow combines both DNA and protein panels. [00176] In various embodiments, the single-cell workflow involves encapsulating and lysing cells in droplets, performing nucleic acid amplification (including target DNA and antibodyspecific oligo amplification) in droplets with a cell-specific barcode, and sequencing amplicons by NGS. Cell-specific mutational and protein profiles are then reconstructed using the software. Partial details of such a single-cell workflow are described in US Patent No. 10,161,007, which is hereby incorporated by reference in its entirety. In various embodiments, the flow pipeline shown in FIG. 1A is applicable for a Tapestri ® workflow instrument.
[00177] FIG. 6 depicts an example computing device for implementing the system and methods described in reference to FIGS. 1 A-5. In various embodiments, the example computing device 600 serves as the variant caller device 130 described in FIG. 1 A for identifying true variants. Examples of a computing device can include a personal computer, desktop computer laptop, server computer, a computing node within a cluster, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.
[00178] As shown in FIG. 6, in some embodiments, the computing device 600 includes at least one processor 602 coupled to a chipset 604. The chipset 604 includes a memory controller hub 620 and an input/output (I/O) controller hub 622. A memory 606 and a graphics adapter 612 are coupled to the memory controller hub 620, and a display 618 is coupled to the graphics adapter 612. A storage device 608, an input interface 614, and a network adapter 616 are coupled to the I/O controller hub 622. Other embodiments of the computing device 600 have different architectures.
[00179] The storage device 608 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device. The memory 606 holds instructions and data used by processor 602. The input interface 614 is a touch-screen interface, a mouse, track ball, or other types of input interface, a keyboard, or some combination thereof, and is used to input data into the computing device 600. In some embodiments, the computing device 600 may be configured to receive input (e.g., commands) from the input interface 614 via gestures from the user. The graphics adapter 612 displays images and other information on the display 618. The network adapter 616 couples the computing device 600 to one or more computer networks. [00180] The computing device 600 is adapted to execute computer program modules for providing the functionality described herein. As used herein, the term “module” refers to computer program logic configured to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, and/or software. In one embodiment, program modules are stored on the storage device 608, loaded into memory 606, and executed by the processor 602. [00181] The types of computing devices 600 can vary from the embodiments described herein. For example, the computing device 600 can lack some of the components described above, such as graphics adapters 612, input interface 614, and displays 618. In some embodiments, a computing device 600 can include a processor 602 for executing instructions stored on a memory 606.
[00182] The methods of performing true variant identification can be implemented in hardware or software, or a combination of both. In one embodiment, a non-transitory machine- readable storage medium, such as one described above, is provided, the medium comprising a data storage material encoded with machine-readable data which, when using a machine programmed with instructions for using said data, is capable of executing instructions for performing true variant identification methods disclosed herein. Embodiments of the methods described above can be implemented in computer programs executing on programmable computers, comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage elements), a graphics adapter, an input interface, a network adapter, at least one input device, and at least one output device. A display is coupled to the graphics adapter. Program code is applied to input data to perform the functions described above and generate output information. The output information is applied to one or more output devices, in known fashion. The computer can be, for example, a personal computer, microcomputer, or workstation of conventional design.
[00183] Each program can be implemented in a high-level procedural or object-oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language can be a compiled or interpreted language. Each such computer program is preferably stored on a storage media or device (e.g., ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. The system can also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
[00184] The signature patterns and databases thereof can be provided in a variety of media to facilitate their use. “Media” refers to a manufacture that contains the signature pattern information of the present invention. The databases of the present invention can be recorded on computer readable media, e.g., any medium that can be read and accessed directly by a computer. Such media include, but are not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. One of skill in the art can readily appreciate how any of the presently known computer readable mediums can be used to create a manufacture comprising a recording of the present database information. “Recorded” refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure can be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g., word processing text file, database format, etc.
Additional Embodiments
[00185] Disclosed herein is a method for performing de novo variant calling via multianalyte and multisample correlation. Here, methods disclosed herein are useful for distinguishing true somatic variants in a sample from more-numerous false positives by evaluating variant correlation with other analytes, such as protein expression, and/or correlation with the germline variants of multiplexed samples.
[00186] Disclosed herein is a method for identifying a subpopulation of cells from a heterogeneous population of cells, the method comprising: obtaining a set of candidate variants; determining allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlating determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; selecting a subset of the candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins.
[00187] In various embodiments, correlating determined allele frequencies of the candidate variants to protein expression comprises generating a correlation matrix. In various embodiments, the threshold number of proteins is at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least twelve, at least thirteen, at least fourteen, at least fifteen, at least sixteen, at least seventeen, at least eighteen, at least nineteen, or at least twenty proteins. In various embodiments, methods disclosed herein further comprise identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants. In various embodiments, the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells. In various embodiments, the heterogeneous population of cells represents a pooled sample comprising a plurality of cell samples. In various embodiments, the plurality of cell samples are distinguishable by germline variants that correlate with the true variants.
[00188] In various embodiments, for a candidate variant, the correlation is based on a standard deviation of correlation values for the candidate variant across the plurality of proteins. In various embodiments, the standard deviation of correlation values further encompasses correlation values for the candidate variant across a plurality of DNA sites. In various embodiments, the candidate variant is correlated if the standard deviation of correlation values is greater than a threshold value. In various embodiments, the candidate variant is not correlated if the standard deviation of correlation values is less than a threshold value. In various embodiments, the threshold value is any of 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, or 0.20.
Examples
[00189] Below are examples of specific embodiments for carrying out the present invention. The examples are offered for illustrative purposes only and are not intended to limit the scope of the present invention in any way. Efforts have been made to ensure accuracy with respect to numbers used (e.g., amounts, temperatures, etc.), but some experimental error and deviation should be allowed for. Example 1: Example Method of Identifying True Variants Based on Co-occurrence
[00190] Example method 1 relates to a method of identifying true variants based on the cooccurrence of two variants in a same cell in a number of cells. In the example method, a sample had tens of thousands of rare variants, with the vast majority occurring in less than 10 cells. The sample without filtering and multianalyte and multisample correlation-based variant calls was anticipated to potentially yield too many false positives. Hence, the following optimizations were implemented to minimize the false positive rate and improve computational speed. First, ad hoc filters were imposed, an example of which included that the variants in a pair of co-occurring variants are to be at least 100 base pairs from each other. Generally, co-occurring variants that are genomically very close together are more likely the result of PCR or alignment errors. Second, the Binomial test was optimized and performed on all possible combinations of variants. In most samples, there are tens of thousands of variants, and performing a statistical test on all possible combinations is very computationally inefficient. To speed up the computation, the mathematical concepts of the sparse matrices and multiplication of adjacency matrices were leveraged to achieve a dramatic speed up in computation. The computation included the testing of the statistical significance of the co-occurrence of variant pairs using the Binomial test, the results of which were then used to determine whether a variant pair was true variant pair or not a true variant based on the tested statistical significance.
[00191] FIG. 7 depicts example outcomes of the statistical significance test for different variant pairs. Three pairs of variants were tested. In the left panel (two vertical columns with 10 rows, each row representing data from one cell and each column representing data from one variant) shown in FIG. 7, in the 10 cells tested, the pair of variants co-occurred in three cells and neither of them occurred alone in any other cell. The Binomial test indicated that there was statistical significance, and thus the two variants in this variant pair were classified as true variants. In the middle panel shown in FIG. 7, each variant in the evaluated variant pair occurred in three of ten cells. However, these two variants did not co-occur in any of the ten cells, and thus the Binomial test indicated there was no statistical significance for the co-occurrence test. The two variants are not likely true variants. In the right panel in FIG. 7, one variant occurred in three cells out of the ten tested cells, while the other variant in the variant pair occurred in all ten cells. While the variant pair co-occurred in three cells out of the ten tested cells, the variant pair still did not show statistical significance based on the Binomial test, which means that the variant pairs are not likely true variants. As described earlier, the Binomial test can filter out cases where one variant is rare and the other is not (e.g., if one variant is in 1% of cells and the other is in 80%, then co-occurrence most likely happens by chance).
Example 2: Example Method of Identifying True Variants Based on Background Error Rate
[00192] Example method 2 relates to a method of identifying true variants based on the background error rate. The example method started with the identification of the probability distribution that best fits the background error distribution. Many different methods were attempted for modeling error distributions, such as the Binomial distribution, Normal distribution, percentile base approach, and others. The Beta-Binomial approach was found to perform the best in modeling the per-variant background error distribution and minimizing potential false positives. The control samples included at least 500 cells. Too few cells included in a sample (e.g., less than 500 cells), led to artificially inflated background error rates for the variants. In addition, it was found that the estimated number of control samples achieved a robust per-variant error estimation. Further, additional ad-hoc filters were implemented after de novo variant identification based on the background error rate. These ad hoc filters included that a variant is present in at least 5 cells and that the average variant allele frequency is at least 35.
[00193] FIG. 8 depicts an example limit of detection profiles with or without using the background error rate. The limit of detection (LOD) is the minimum percentage of cells a variant must be present in. The existing thresholds have a LOD of 1%. In the figure, the straight dotted line indicates the LOD is 1% when the background error rate-based variant evaluation is not applied (e.g., in an existing method, which is also referred to as the “control method”). The black line for the method disclosed herein indicates the LOD greatly decreased when the background error rate-based true variant identification was applied. As can be seen, the LOD was greatly reduced for most variants, with over 60,000 tested variants having a LOD of 0.2% or less. That is, a variant that has a mutation rate of 0.2% or above can be effectively identified (e.g., correctly classified as a true variant) after the background error rate was estimated for the variant. The higher detection limit on the right part of the black line corresponds to error-prone variants that generally show more mutations within cells.
[00194] FIG. 9A depicts the sensitivity of the aforementioned two methods, i.e., background error rate-based method (or simply “background error rate method”) or co-occurrence-based method (or simply “co-occurrence method”). Three sets of data were tested using one method (e.g., background error rate method) or using both methods. As shown in FIG. 9A, when compared to the control method (e.g., mutated in >1% of cells), the median sensitivity of the background error rate method alone and the median sensitivity of using two methods together increased from about 63% to about 80% when there are a large number of samples (e.g., 128 samples). The sensitivity for the other two data sets that have a limited number of samples (e.g., n=12 or n=6) also showed improved sensitivity (e.g., around 42%-47%) when compared to the control method (e.g., around 26%). Data set 2 included rare variants in AML minimum residual disease samples. FIG. 9A shows that the control method could not detect these rare variants, and had a sensitivity close to 0. However, the background error rate method + co-occurrence method showed a median sensitivity of 40%, a clear performance improvement when compared to the control method. Background error rate method alone also showed some sensitivity. All of these indicate that the above-described two methods (used independently or jointly) greatly improve the sensitivity in the variant call (e.g., detecting rare variants).
[00195] FIG. 9B further depicts the specificity of the aforementioned two methods. The same three sets of data were tested for specificity. The results shown in FIG. 9B show that the specificity for one method (e.g., background error rate method) or for two methods used together (e.g., background error rate method + co-occurrence method) were comparable to the control sample. Namely, each of the methodologies achieved greater than 99.5% specificity.
Example 3: Example Method of Identifying True Variants Based on Correlation of DNA and Proteins
[00196] Example method 3 relates to a method of identifying true variants based on the correlation between the DNA and protein expressions. The specific processes of determining DNA and protein expression correlation are described earlier. Some example outcomes are further described below.
[00197] FIG. 10A depicts one example of correlating DNA and protein information for identifying the presence of true variants. FIG. 10B depicts another example of correlating DNA and protein information for identifying the presence of true variants. On each correlation matrix shown in FIG. 10A and FIG. 10B, the correlation between each candidate somatic variant (y- axis) and each protein expression (x-axis, left), and each other DNA variant (x-axis right) was represented with small square blocks with darkness/brightness indicating the correlation value. The darker or the brighter the block (when compared to gray blocks where most other blocks show), the greater correlation. In the figure, the dark blocks indicate the anti-correlation while the bright blocks indicate the correlation. From FIGS. 10A and 10B, it can be seen that two variants in each figure (indicated by lines) showed strong correlations with different proteins, which indicates that these two variants are likely true variants. Traditional false positives, due to systematic errors, showed up as strongly correlated DNA-DNA points (indicated by arrows) but are excluded due to their lack of correlation with proteins.
[00198] FIG. 10C depicts an additional example of correlating DNA and protein information for identifying the presence of true variants. Here, the correlation matrix was generated by performing single-cell DNA sequencing and single-cell proteomics on a pooled population of cells. Analysis of a pooled population of cells is valuable because rare somatic variants likely only occur in 1 sample and not across all samples. Therefore, candidate variants that appear across multiple samples are likely falsely called variants. In this sense, the other samples, which do not include a rare somatic variant, are present in the analysis as controls. As shown in FIG. 10C, the presence of multiplexed samples leads to true variants with a strong correlation with both protein (left) and supposed germline variants (right). False positives did not correlate with many variants or any proteins.
[00199] FIG. 10D depicts an additional example of correlating DNA and protein information for identifying the presence of true variants at variant allele frequencies (VAF) below 1%. This is evidence that the methodology disclosed herein is highly sensitive. Specifically, using correlation and a fixed threshold, the methodology identifies variants below 1% with no variant filtering. For example, investigating the row-wise standard deviation of the correlation values of the correlation matrix, the true variants can be distinguished from the background. Note that much of the signal comes in DNA, where multiplexing gives a large germline variant correlation. Specifically, using a threshold standard deviation value of 0.07, three of the candidate variants (with standard deviation values greater than 0.07) were identified as true variants whereas the other three candidate variants (with standard deviation values less than 0.07) were identified as falsely called variants.
[00200] Altogether, these results demonstrate that the application of different multianalyte and multisample correlation-based methods achieves a significant improvement in variant calling. [00201] While the invention has been particularly shown and described with reference to a preferred embodiment and various alternate embodiments, it will be understood by persons skilled in the relevant art that various changes in form and details can be made therein without departing from the spirit and scope of the invention.
[00202] All references, issued patents, and patent applications cited within the body of the instant specification are hereby incorporated by reference in their entirety, for all purposes.

Claims

CLAIMS What is claimed is:
1. A method for identifying a subpopulation of cells from a heterogeneous population of cells, the method comprising: obtaining a set of candidate variants determined through a single-cell analysis workflow; for a variant pair included in the set of candidate variants, determining a quantity of cooccurrence cells where both variants in the variant pair co-occur in each of the co-occurrence cells; determining a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants; and identifying a subset of candidate variants as true variants based on the determined set of variant pairs.
2. The method of claim 1 , wherein, before determining the quantity of co-occurrence cells, the method further comprises applying a first set of variant filters to identify a set of variants from the set of candidate variants.
3. The method of claim 2, wherein both variants included in a variant pair, for determining the quantity of co-occurrence cells, are from the set of variants identified after applying the first set of variant filters.
4. The method of claim 2 or 3, wherein applying the first set of variant filters comprises applying a depth of coverage threshold regarding a depth of coverage of a variant in a cell.
5. The method of claim 2, wherein the depth of coverage threshold is at least 6, 8, 10, 12, 14, 16, 18, or 20 reads for a cell.
6. The method of any one of claims 2-5, wherein applying the first set of variant filters comprises applying a genotype quality threshold regarding a genotype quality of a variant.
7. The method of claim 6, wherein the genotype quality threshold is at least 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, or 80.
8. The method of any one of claims 2-7, wherein applying the first set of variant filters comprises applying a cell number threshold regarding a number of cells where a variant is present.
9. The method of claim 8, wherein the cell number threshold is at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 45, or 50 cells.
10. The method of any one of claims 2-9, wherein applying the first set of variant filters comprises applying a variant allele frequency threshold regarding variant allele frequency of a variant.
11. The method of claim 10, wherein the variant allele frequency threshold is at least 30, 35, 40, 45, 50, 55, or 60.
12. The method of any one of claims 1-11, wherein determining the quantity of cooccurrence cells comprises determining at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 45, or 50 cells for a variant pair.
13. The method of any one of claims 1-12, wherein determining the quantity of cooccurrence cells comprises determining co-occurrence cells based on a statistical significance of co-occurrence of both variants of the variant pair in a same cell.
14. The method of claim 13, wherein determining co-occurrence cells based on the statistical significance comprises determining co-occurrence cells based on a one-sided Binomial test.
15. The method of claim 14, wherein determining co-occurrence cells based on the one-sided Binomial test comprises determining co-occurrence cells having a statistical significance of p-value of less than 0.03, 0.02, 0.01, 0.009, 0.008, 0.007, 0.006, 0.005, 0.004, 0.003, 0.002, or 0.001 based on the one-sided Binomial test.
16. The method of any one of claims 1-15, wherein, after determining the set of variant pairs, the method further comprises applying a second set of variant filters to identify a subset of variant pairs.
17. The method of claim 16, wherein identifying the subset of candidate variants as true variants comprises identifying the subset of candidate variants based on the identified subset of variant pairs.
18. The method of claim 16 or 17, wherein applying the second set of variant filters comprises applying an average variant allele frequency threshold regarding an average variant allele frequency for two variants of a variant pair.
19. The method of claim 18, wherein the average variant allele frequency threshold is at least 25, 30, 35, 40, 45, 50, 55, or 60.
20. The method of any one of claims 16-19, wherein applying the second set of variant filters comprises applying a genomic distance threshold regarding a genomic distance between two variants of a variant pair.
21. The method of claim 20, wherein the genomic distance threshold is at least 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 base pairs apart from each other.
22. The method of any one of claims 1-21, further comprising identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants.
23. The method of claim 22, wherein the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells.
24. The method of any one of claims 1-23, wherein a sensitivity for identifying a variant as a true variant is at least 0.9. 0.8, 0.7, 0.6, 0.5, 0.4, or 0.3.
25. The method of any one of claims 1-24, wherein a specificity for identifying a variant as a false positive variant is at most 0.998, 0.997, 0.996, 0.995, 0.994, 0.993, 0.992, or 0.991.
26. The method of any one of claims 1-25, wherein the heterogeneous population of cells are from measurable residual disease (MRD) samples.
27. The method of any one of claims 1-26, further comprising: for each variant included in the subset of candidate variants, generating a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determining a sub-subset of candidate variants based on the generated statistical significance for each variant included in the subset of candidate variants; for each variant of the sub-subset of candidate variants, applying a third set of variant filters; and identifying a sub-sub-subset of candidate variants as true variants based on the application of the third set of variant filters.
28. A method for identifying a subpopulation of cells from a heterogeneous population of cells, the method comprising: obtaining a set of candidate variants determined through a single-cell analysis workflow; for a variant included in the set of candidate variants, generating a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determining a set of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants; and identifying a subset of candidate variants as true variants based on generated statistical significances for the plurality of variants.
29. The method of claim 28, wherein the parameters for the Beta-Binomial distribution are generated by: acquiring a plurality of control samples; removing germline variants from each control sample; for a control sample, applying a first set of variant filters to identify a set of background variants; for a background variant, determining a quantity of cells containing the background variant in each sample; and generating parameters for the Beta-Binomial distribution for the background variant based on the determined quantity of cells containing the background variant in each sample.
30. The method of claim 28 or 29, wherein, before generating a statistical significance by using a Beta-Binomial distribution, the method further comprises applying a second set of variant filters to identify a set of variants from the set of candidate variants.
31. The method of claim 30, wherein the variant for generating a statistical significance is a variant included in the identified set of variants.
32. The method of any of claims 28-31, wherein generating a statistical significance by using a Beta-Binomial distribution comprises generating the statistical significance using the parameters of the Beta-Binomial distribution when a variant included in the set of candidate variants is a background variant.
33. The method of any of claims 28-31, wherein generating a statistical significance by using a Beta-Binomial distribution comprises generating the statistical significance using averaging parameters generated from a plurality of background variants when a variant included in the set of candidate variants is not a background variant.
34. The method of any of claims 29-33, wherein the parameters for the Beta-Binomial distribution for the background variant are generated by using two vectors determined based on the determined quantity of cells containing the background variant in each sample.
35. The method of claim 34, wherein a first vector of the two vectors comprises a quantity of cells that have the background variant in each control sample.
36. The method of claim 34 or 35, wherein a second vector of the two vectors comprises a quantity of total cells in each control sample.
37. The method of any one of claims 29-36, wherein applying the first set of variant filters comprises applying a depth of coverage threshold regarding a depth of coverage associated with a variant.
38. The method of claim 37, wherein the depth of coverage threshold is at least 6, 8, 10, 12, 14, 16, 18, or 20 reads for a variant.
39. The method of any one of claims 29-38, wherein applying the first set of variant filters comprises applying a genotype quality threshold regarding a genotype quality associated with a variant.
40. The method of claim 39, wherein the genotype quality threshold is at least 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, or 80.
41. The method of any one of claims 37-40, wherein the second set of variant filters comprises the depth of coverage threshold.
42. The method of any one of claims 39-41, wherein the second set of variant filters comprise the genotype quality threshold.
43. The method of any one of claims 30-42, wherein the second set of variant filters comprise a genotyped cell percentage threshold regarding a percentage of cells that are genotyped for a genomic position of a variant.
44. The method of claim 43, wherein the genotyped cell percentage threshold is at least 10%, 20%, 30, 40%, 50%, or 60%.
45. The method of any one of claims 30-44, wherein identifying a subset of candidate variants as true variants based on generated statistical significances comprises identifying one or more variants that have a p-value smaller than a p- value threshold.
46. The method of claim 45, wherein the p-value threshold is at most 0.0005, 0.0004, 0.0003, 0.0002, 0.0001, 0.00009, 0.00008, 0.00007, 0.00006, or 0.00005.
47. The method of any one of claims 29-46, wherein, after determining the set of variants based on the generated statistical significances, the method further comprises applying a third set of variant filters.
48. The method of claim 47, wherein applying the third set of variant filters comprises applying a cell quantity threshold regarding a quantity of cells that a variant is present.
49. The method of claim 48, wherein the cell quantity threshold is at least 3, 4, 5, 6, 7, 8, 9, or 10 cells.
50. The method of any one of claims 47-49, wherein applying the third set of variant filters comprises applying an average variant allele frequency threshold regarding a variant allele frequency average for a variant.
51. The method of claim 50, wherein the average variant allele frequency threshold is at least 20, 25, 30, 35, 40, 45, 50, 55, or 60.
52. The method of any one of claims 29-51, wherein variants remaining in each control sample after applying the first set of variant filters are false positive variants.
53. The method of any one of claims 29-52, wherein there are at least 10, 20, 30, 40,
50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 control samples.
54. The method of any one of claims 29-53, wherein the control samples are non- cancerous samples.
55. The method of any one of claims 29-54, wherein the control samples are bone marrow samples from healthy subjects.
56. The method of any one of claims 29-55, further comprising identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants.
57. The method of any one of claims 29-56, wherein the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells.
58. The method of any one of claims 29-57, wherein a sensitivity for identifying a variant as a true variant is at least 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, or 0.3.
59. The method of any one of claims 29-58, wherein a specificity for identifying a variant as a false positive variant is at most 0.998, 0.997, 0.996, 0.995, 0.994, 0.993, 0.992, or 0.991.
60. The method of any one of claims 29-59, wherein the heterogeneous population of cells are from MRD samples.
61. The method of any one of claims 29-60, further comprising: for a variant pair included in the subset of candidate variants, determining a quantity of co-occurrence cells where both variants in the variant pair co-occur in each co-occurrence cell; determining a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the subset of candidate variants; applying a fourth set of variant filters to identify a subset of variant pairs; and identifying a sub-subset of candidate variants as true variants based on the determined subset of variant pairs.
62. A method for identifying a subpopulation of cells from a heterogeneous population of cells, the method comprising: obtaining a set of candidate variants; determining allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlating determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; and selecting a subset of candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins.
63. The method of claim 62, wherein correlating determined allele frequencies of the candidate variants to protein expression comprises generating a correlation matrix.
64. The method of claim 62 or 63, wherein the threshold number of proteins is at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least twelve, at least thirteen, at least fourteen, at least fifteen, at least sixteen, at least seventeen, at least eighteen, at least nineteen, or at least twenty proteins.
65. The method of any one of claims 62-64, further comprising identifying a subpopulation of cells from the heterogeneous population of cells, the subpopulation of cells comprising one or more of the true variants.
66. The method of claim 65 wherein the subpopulation of cells represents less than 1% of cells in the heterogeneous population of cells.
67. The method of any one of claims 62-66, wherein the heterogeneous population of cells represents a pooled sample comprising a plurality of cell samples.
68. The method of claim 67, wherein the plurality of cell samples are distinguishable by germline variants that correlate with the true variants.
69. The method of any one of claims 62-68, wherein for a candidate variant, the correlation is based on a standard deviation of correlation values for the candidate variant across the plurality of proteins.
70. The method of claim 69, wherein the standard deviation of correlation values further encompasses correlation values for the candidate variant across a plurality of DNA sites.
71. The method of claim 69 or 70, wherein the candidate variant is correlated if the standard deviation of correlation values is greater than a threshold value.
72. The method of claim 69 or 70, wherein the candidate variant is not correlated if the standard deviation of correlation values is less than a threshold value.
73. The method of claim 71 or 72, wherein the threshold value is any of 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, or 0.20.
74. A non-transitory computer readable medium for calling one or more variants of a cell population, the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: obtain a set of candidate variants; for a variant pair included in the set of candidate variants, determine a quantity of cooccurrence cells where both variants in the variant pair co-occur in each of the co-occurrence cells; determine a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants; and identify a subset of candidate variants as true variants based on the determined set of variant pairs.
75. A non-transitory computer readable medium for calling one or more variants of a cell population, the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: obtain a set of candidate variants; for a variant included in the set of candidate variants, generate a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determine a set of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants; and identify a subset of candidate variants as true variants based on generated statistical significances for the plurality of variants.
76. A non-transitory computer readable medium for calling one or more variants of a cell population, the non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to: obtain a set of candidate variants; determine allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlate determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; and select a subset of candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins.
77. A system comprising: a single-cell analysis workflow device configured to generate a plurality of sequence reads for cells in a cell population; a computational device communicatively coupled to the single-cell analysis workflow device, the computational device configured to: obtain a set of candidate variants; for a variant pair included in the set of candidate variants, determine a quantity of co-occurrence cells where both variants in the variant pair co-occur in each of the cooccurrence cells; determine a set of variant pairs based on quantities of co-occurrence cells determined for a plurality of variant pairs included in the set of candidate variants; and identify a subset of candidate variants as true variants based on the determined set of variant pairs.
78. A system comprising: a single-cell analysis workflow device configured to generate a plurality of sequence reads for cells in a cell population; and a computational device communicatively coupled to the single-cell analysis workflow device, the computational device configured to: obtain a set of candidate variants; for a variant included in the set of candidate variants, generate a statistical significance by using a Beta-Binomial distribution with parameters generated from a set of control samples; determine a set of variants based on statistical significances generated for a plurality of variants included in the set of candidate variants; and identify a subset of candidate variants as true variants based on generated statistical significances for the plurality of variants
79. A system comprising: a single-cell analysis workflow device configured to generate a plurality of sequence reads for cells in a cell population; and a computational device communicatively coupled to the single-cell analysis workflow device, the computational device configured to: obtain a set of candidate variants; determine allele frequencies of the candidate variants for each of one or more cells, wherein the allele frequencies are determined through single-cell DNA sequencing of the heterogeneous population of cells; correlate determined allele frequencies of the candidate variants to protein expression of a plurality of proteins; and select a subset of candidate variants as true variants, wherein the subset of candidate variants are selected based on correlation of the allele frequencies of the subset of candidate variants and protein expression of one or more proteins of the plurality of proteins.
PCT/US2023/065473 2022-04-06 2023-04-06 True variant identification via multianalyte and multisample correlation WO2023196928A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263327966P 2022-04-06 2022-04-06
US63/327,966 2022-04-06

Publications (2)

Publication Number Publication Date
WO2023196928A2 true WO2023196928A2 (en) 2023-10-12
WO2023196928A3 WO2023196928A3 (en) 2023-12-07

Family

ID=88243691

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/065473 WO2023196928A2 (en) 2022-04-06 2023-04-06 True variant identification via multianalyte and multisample correlation

Country Status (1)

Country Link
WO (1) WO2023196928A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117423382A (en) * 2023-10-21 2024-01-19 云准医药科技(广州)有限公司 Single-cell barcode identity recognition method based on SNP polymorphism

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018160999A1 (en) * 2017-03-03 2018-09-07 Yale University Mapping a functional cancer genome atlas of tumor suppressors using aav-crispr mediated direct in vivo screening
WO2021067721A1 (en) * 2019-10-02 2021-04-08 Mission Bio, Inc. Improved variant caller using single-cell analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117423382A (en) * 2023-10-21 2024-01-19 云准医药科技(广州)有限公司 Single-cell barcode identity recognition method based on SNP polymorphism
CN117423382B (en) * 2023-10-21 2024-05-10 云准医药科技(广州)有限公司 Single-cell barcode identity recognition method based on SNP polymorphism

Also Published As

Publication number Publication date
WO2023196928A3 (en) 2023-12-07

Similar Documents

Publication Publication Date Title
US20240079092A1 (en) Systems and methods for deriving and optimizing classifiers from multiple datasets
CN103201744B (en) For estimating the method that full-length genome copies number variation
Hamid et al. Data integration in genetics and genomics: methods and challenges
CN110870020A (en) Aberrant splicing detection using Convolutional Neural Network (CNNS)
US20240013921A1 (en) Generalized computational framework and system for integrative prediction of biomarkers
KR20210127798A (en) Semi-supervised learning for training an ensemble of deep convolutional neural networks
WO2020077232A1 (en) Methods and systems for nucleic acid variant detection and analysis
US20220130488A1 (en) Methods for detecting copy-number variations in next-generation sequencing
US20190164627A1 (en) Models for Targeted Sequencing
JP2008511058A (en) Data quality and / or partial aneuploid chromosome determination using computer systems
Ruan et al. Differential analysis of biological networks
CA2481485A1 (en) Apparatus and method for analyzing data
KR20220069943A (en) Single-cell RNA-SEQ data processing
Eichner et al. Support vector machines-based identification of alternative splicing in Arabidopsis thaliana from whole-genome tiling arrays
CN114373547A (en) Method and system for predicting disease risk
WO2023196928A2 (en) True variant identification via multianalyte and multisample correlation
JP2022550841A (en) Improved variant calling using single-cell analysis
KR20140090296A (en) Method and apparatus for analyzing genetic information
US20220101135A1 (en) Systems and methods for using a convolutional neural network to detect contamination
US20190108311A1 (en) Site-specific noise model for targeted sequencing
KR20210044400A (en) Method and apparatus for discovering biomarker for predicting cancer prognosis using heterogeneous platform of DNA methylation data
Mayrink et al. Bayesian factor models for the detection of coherent patterns in gene expression data
Otto Distance-based methods for the analysis of Next-Generation sequencing data
Khan et al. AI and Genomes for Decisions Regarding the Expression of Genes
AlRefaai et al. Gene Expression Dataset Classification Using Machine Learning Methods: A Survey

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23785641

Country of ref document: EP

Kind code of ref document: A2