US20220367006A1 - Methods and systems for dynamic variant thresholding in a liquid biopsy assay - Google Patents

Methods and systems for dynamic variant thresholding in a liquid biopsy assay Download PDF

Info

Publication number
US20220367006A1
US20220367006A1 US17/858,872 US202217858872A US2022367006A1 US 20220367006 A1 US20220367006 A1 US 20220367006A1 US 202217858872 A US202217858872 A US 202217858872A US 2022367006 A1 US2022367006 A1 US 2022367006A1
Authority
US
United States
Prior art keywords
variant
sequence
sequencing
locus
cancer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/858,872
Inventor
Robert Tell
Wei Zhu
Justin David Finkle
Christine Lo
Terri M. Driessen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tempus Labs Inc
Original Assignee
Tempus Labs Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US202062978130P external-priority
Application filed by Tempus Labs Inc filed Critical Tempus Labs Inc
Priority to US17/858,872 priority Critical patent/US20220367006A1/en
Publication of US20220367006A1 publication Critical patent/US20220367006A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium

Abstract

Methods, systems, and software are provided for validating a somatic sequence variant in a subject having a cancer condition. Sequence reads are obtained from sequencing cell-free DNA fragments in a liquid biopsy sample of the subject. Sequence reads are aligned to a reference sequence. A variant allele fragment count and locus fragment count are identified for a candidate variant that maps to a locus in the reference sequence. The variant allele fragment count is compared against a dynamic variant count threshold for the locus. The threshold is based on a pre-test odds of a positive variant call for the locus, based on the prevalence of variants in a genomic region including the locus in a cohort of subjects having the cancer condition. The somatic sequence variant in the subject is validated, or rejected, when the variant allele fragment count for the candidate variant satisfies, or does not satisfy, the threshold.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. patent application Ser. No. 17/179,086, filed Feb. 18, 2021, which claims priority to U.S. Provisional Patent Application No. 63/041,293, filed Jun. 19, 2020, and U.S. Provisional Patent Application No. 62/978,130, filed Feb. 18, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
  • FIELD OF THE INVENTION
  • The present disclosure relates generally to the use of cell-free DNA sequencing data to provide clinical support for personalized treatment of cancer.
  • BACKGROUND
  • Precision oncology is the practice of tailoring cancer therapy to the unique genomic, epigenetic, and/or transcriptomic profile of an individual's cancer. Personalized cancer treatment builds upon conventional therapeutic regimens used to treat cancer based only on the gross classification of the cancer, e.g., treating all breast cancer patients with a first therapy and all lung cancer patients with a second therapy. This field was borne out of many observations that different patients diagnosed with the same type of cancer, e.g., breast cancer, responded very differently to common treatment regimens. Over time, researchers have identified genomic, epigenetic, and transcriptomic markers that improve predictions as to how an individual cancer will respond to a particular treatment modality.
  • There is growing evidence that cancer patients who receive therapy guided by their genetics have better outcomes. For example, studies have shown that targeted therapies result in significantly improved progression-free cancer survival. See, e.g., Radovich M. et al., Oncotarget, 7(35):56491-500 (2016). Similarly, reports from the IMPACT trial—a large (n=1307) retrospective analysis of consecutive, prospectively molecularly profiled patients with advanced cancer who participated in a large, personalized medicine trial—indicate that patients receiving targeted therapies matched to their tumor biology had a response rate of 16.2%, as opposed to a response rate of 5.2% for patients receiving non-matched therapy. Tsimberidou A M et al., ASCO 2018, Abstract LBA2553 (2018).
  • In fact, therapy targeted to specific genomic alterations is already the standard of care in several tumor types, e.g., as suggested in the National Comprehensive Cancer Network (NCCN) guidelines for melanoma, colorectal cancer, and non-small cell lung cancer. In practice, implementation of these targeted therapies requires determining the status of the diagnostic marker in each eligible cancer patient. While this can be accomplished for the few, well known mutations associated with treatment recommendations in the NCCN guidelines using individual assays or small next generation sequencing (NGS) panels, the growing number of actionable genomic alterations and increasing complexity of diagnostic classifiers necessitates a more comprehensive evaluation of each patient's cancer genome, epigenome, and/or transcriptome.
  • For instance, some evidence suggests that use of combination therapies where each component is matched to an actionable genomic alteration holds the greatest potential for treating individual cancers. To this point, a retroactive study of cancer patients treated with one or more therapeutic regimens revealed that patients who received therapies matched to a higher percentage of their genomic alterations experienced a greater frequency of stable disease (e.g., a longer time to recurrence), longer time to treatment failure, and greater overall survival. Wheeler J J et al., Cancer Res., 76:3690-701 (2016). Thus, comprehensive evaluation of each cancer patient's genome, epigenome, and/or transcriptome should maximize the benefits provided by precision oncology, by facilitating more fine-tuned combination therapies, use of novel off-label drug indications, and/or tissue agnostic immunotherapy. See, for example, Schwaederle M. et al., J Clin Oncol., 33(32):3817-25 (2015); Schwaederle M. et al., JAMA Oncol., 2(11):1452-59 (2016); and Wheler J J et al., Cancer Res., 76(13):3690-701 (2016). Further, the use of comprehensive next generation sequencing analysis of cancer genomes facilitates better access and a larger patient pool for clinical trial enrollment. Coyne G O et al., Curr. Probl. Cancer, 41(3):182-93 (2017); and Markman M., Oncology, 31(3):158, 168.
  • The use of large NGS genomic analysis is growing in order to address the need for more comprehensive characterization of an individual's cancer genome. See, for example, Fernandes G S et al., Clinics, 72(10):588-94. Recent studies indicate that of the patients for which large NGS genomic analysis is performed, 30-40% then receive clinical care based on the assay results, which is limited by at least the identification of actionable genomic alterations, the availability of medication for treatment of identified actionable genomic alterations, and the clinical condition of the subject. See, Ross J S et al., JAMA Oncol., 1(1):40-49 (2015); Ross J S et al., Arch. Pathol. Lab Med., 139:642-49 (2015); Hirshfield K M et al., Oncologist, 21(11):1315-25 (2016); and Groisberg R. et al., Oncotarget, 8:39254-67 (2017).
  • However, these large NGS genomic analyses are conventionally performed on solid tumor samples. For instance, each of the studies referenced in the paragraph above performed NGS analysis of FFPE tumor blocks from patients. Solid tissue biopsies remain the gold standard for diagnosis and identification of predictive biomarkers because they represent well-known and validated methodologies that provide a high degree of accuracy. Nevertheless, there are significant limitations to the use of solid tissue material for large NGS genomic analyses of cancers. For example, tumor biopsies are subject to sampling bias caused by spatial and/or temporal genetic heterogeneity, e.g., between two regions of a single tumor and/or between different cancerous tissues (such as between primary and metastatic tumor sites or between two different primary tumor sites). Such intertumor or intratumor heterogeneity can cause sub-clonal or emerging mutations to be overlooked when using localized tissue biopsies, with the potential for sampling bias to be exacerbated over time as sub-clonal populations further evolve and/or shift in predominance.
  • Additionally, the acquisition of solid tissue biopsies often requires invasive surgical procedures, e.g., when the primary tumor site is located at an internal organ. These procedures can be expensive, time consuming, and carry a significant risk to the patient, e.g., when the patient's health is poor and may not be able to tolerate invasive medical procedures and/or the tumor is located in a particularly sensitive or inoperable location, such as in the brain or heart. Further, the amount of tissue, if any, that can be procured depends on multiple factors, including the location of the tumor, the size of the tumor, the fragility of the patient, and the risk of comorbidities related to biopsies, such as bleeding and infections. For instance, recent studies report that tissue samples in a majority of advanced non-small cell lung cancer patients are limited to small biopsies and cannot be obtained at all in up to 31% of patients. Ilie and Hofman, Transl. Lung Cancer Res., 5(4):420-23 (2016). Even when a tissue biopsy is obtained, the sample may be too scant for comprehensive testing.
  • Further, the method of tissue collection, preservation (e.g., formalin fixation), and/or storage of tissue biopsies can result in sample degradation and variable quality DNA. This, in turn, leads to inaccuracies in downstream assays and analysis, including next-generation sequencing (NGS) for the identification of biomarkers. Ilie and Hofman, Transl Lung Cancer Res., 5(4):420-23 (2016).
  • In addition, the invasive nature of the biopsy procedure, the time and cost associated with obtaining the sample, and the compromised state of cancer patients receiving therapy render repeat testing of cancerous tissues impracticable, if not impossible. As a result, solid tissue biopsy analysis is not amenable to many monitoring schemes that would benefit cancer patients, such as disease progression analysis, treatment efficacy evaluation, disease recurrence monitoring, and other techniques that require data from several time points.
  • Cell-free DNA (cfDNA) has been identified in various bodily fluids, e.g., blood serum, plasma, urine, etc. Chan et al., Ann. Clin. Biochem., 40 (Pt 2):122-30 (2003). This cfDNA originates from necrotic or apoptotic cells of all types, including germline cells, hematopoietic cells, and diseased (e.g., cancerous) cells. Advantageously, genomic alterations in cancerous tissues can be identified from cfDNA isolated from cancer patients. See, e.g., Stroun et al., Oncology, 46(5):318-22 (1989); Goessl et al., Cancer Res., 60(21):5941-45 (2000); and Frenel et al., Clin. Cancer Res. 21(20):4586-96 (2015). Thus, one approach to overcoming the problems presented by the use of solid tissue biopsies described above is to analyze cell-free nucleic acids (e.g., cfDNA) and/or nucleic acids in circulating tumor cells present in biological fluids, e.g., via a liquid biopsy.
  • Specifically, liquid biopsies offer several advantages over conventional solid tissue biopsy analysis. For instance, because bodily fluids can be collected in a minimally invasive or non-invasive fashion, sample collection is simpler, faster, safer, and less expensive than solid tumor biopsies. Such methods require only small amounts of sample (e.g., 10 mL or less of whole blood per biopsy) and reduce the discomfort and risk of complications experienced by patients during conventional tissue biopsies. In fact, liquid biopsy samples can be collected with limited or no assistance from medical professionals and can be performed at almost any location. Further, liquid biopsy samples can be collected from any patient, regardless of the location of their cancer, their overall health, and any previous biopsy collection. This allows for analysis of the cancer genome of patients from which a solid tumor sample cannot be easily and/or safely obtained. In addition, because cell-free DNA in the bodily fluids arise from many different types of tissues in the patient, the genomic alterations present in the pool of cell-free DNA are representative of various different clonal sub-populations of the cancerous tissue of the subject, facilitating a more comprehensive analysis of the cancerous genome of the subject than is possible from one or more sections of a single solid tumor sample.
  • Liquid biopsies also enable serial genetic testing prior to cancer detection, during the early stages of cancer progression, throughout the course of treatment, and during remission, e.g., to monitor for disease recurrence. The ability to conduct serial testing via non-invasive liquid biopsies throughout the course of disease could prove beneficial for many patients, e.g., through monitoring patient response to therapies, the emergence of new actionable genomic alterations, and/or drug-resistance alterations. These types of information allow medical professionals to more quickly tailor and update therapeutic regimens, e.g., facilitating more timely intervention in the case of disease progression. See, e.g., Ilie and Hofman, Transl. Lung Cancer Res., 5(4):420-23 (2016).
  • Nevertheless, while liquid biopsies are promising tools for improving outcomes using precision oncology, there are significant challenges specific to the use of cell-free DNA for evaluation of a subject's cancer genome. For instance, there is a highly variable signal-to-noise ratio from one liquid biopsy sample to the next. This occurs because cfDNA originates from a variety of different cells in a subject, both healthy and diseased. Depending on the stage and type of cancer in any particular subject, the fraction of cfDNA fragments originating from cancerous cells (the “tumor fraction” or “ctDNA fraction” of the sample/subject) can range from almost 0% to well over 50%. Other factors, including tumor type and mutation profile, can also impact the amount of DNA released from cancerous tissues. For instance, cfDNA clearance through the liver and kidneys is affected by a variety of factors, including renal dysfunction or other tissue damaging factors (e.g., chemotherapy, surgery, and/or radiotherapy).
  • This, in turn, leads to problems detecting and/or validating cancer-specific genomic alterations in a liquid sample. This is particularly true during early stages of the disease—when cancer therapies have much higher success rates—because the tumor fraction in the patient is lowest at this point. Thus, early stage cancer patients can have ctDNA fractions below the limit of detection (LOD) for one or more informative genomic alterations, limiting clinical utility because of the risk of false negatives and/or providing an incomplete picture of the cancer genome of the patient. Further, because cancers, and even individual tumors, can be clonally diverse, actionable genomic alterations that arise in only a subset of clonal populations are diluted below the overall tumor fraction of the sample, further frustrating attempts to tailor combination therapies to the various actionable mutations in the patient's cancer genome. Consequently, most studies using liquid biopsy samples to date have focused on late stage patients for assay validation and research.
  • Another challenge associated with liquid biopsies is the accurate determination of tumor fraction in a sample. This difficulty arises from at least the heterogeneity of cancers and the increased frequency of large chromosomal duplications and deletions found in cancers. As a result, the frequency of genomic alterations from cancerous tissues varies from locus to locus based on at least (i) their prevalence in different sub-clonal populations of the subject's cancer, and (ii) their location within the genome, relative to large chromosomal copy number variations. The difficulty in accurately determining the tumor fraction of liquid biopsy samples affects accurate measurement of various cancer features shown to have diagnostic value for the analysis of solid tumor biopsies. These include allelic ratios, copy number variations, overall mutational burden, frequency of abnormal methylation patterns, etc., all of which are correlated with the percentage of DNA fragments that arise from cancerous tissue, as opposed to healthy tissue.
  • Altogether, these factors result in highly variable concentrations of ctDNA—from patient to patient and possibly from locus to locus—that confound accurate measurement of disease indicators and actionable genomic alterations. Further, the quantity and quality of cfDNA obtained from liquid biopsy samples are highly dependent on the particular methodology for collecting the samples, storing the samples, sequencing the samples, and standardizing the sequencing data.
  • While validation studies of existing liquid biopsy assays have shown high sensitivity and specificity, few studies have corroborated results with orthogonal methods, or between particular testing platforms, e.g., different NGS technologies and/or targeted panel sequencing versus whole genome/exome sequence. Reports of liquid biopsy-based studies are limited by comparison to non-comprehensive tissue testing algorithms including Sanger sequencing, small NGS hotspot panels, polymerase chain reaction (PCR), and fluorescent in situ hybridization (FISH), which may not contain all NCCN guideline genes in their reportable range, thus suffering in comparison to a more comprehensive liquid biopsy assay.
  • As an example, conventional liquid biopsy assays do not provide a method for accurately detecting variants (e.g., variant alleles) in ctDNA NGS assays. As described above, many patients may not have abundant ctDNA in early stage disease and may shed variants below the limit of detection (LOD) for ctDNA assays, resulting in false negatives. Detecting these variants at low circulating fractions is also technically challenging due to constraints of sequencing by synthesis. Additionally, differentiating between germline and somatic variants in ctDNA is difficult, as is differentiating between mutations derived from clonal hematopoiesis (CH) and the solid tumor being assayed. In such cases, mutations in hematopoietic lineage cells may be mistaken for tumor-derived mutations. Indeed, researchers have identified several genes frequently mutated in CH with potential importance in cancer, such as JAK2, TP53, GNAS, IDH2, and KRAS. Mayrhofer et al., 2018, “Cell-free DNA profiling of metastatic prostate cancer reveals microsatellite instability, structural rearrangements and clonal hematopoiesis,” Genome Med, (10), pg. 85; Hu et al., 2018, “False-Positive Plasma Genotyping Due to Clonal Hematopoiesis,” Clin Cancer Res, (24), pg. 4437.
  • The information disclosed in this Background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
  • SUMMARY
  • Given the above background, there is a need in the art for improved methods and systems for supporting clinical decisions in precision oncology using liquid biopsy assays. In particular, there is a need in the art for improved methods and systems for identifying somatic tumor mutations in cell-free DNA, particularly where the sample has low tumor fractions. Advantageously, the present disclosure solves this and other needs in the art by providing improved somatic variant identification methodology that better accounts for locus-specific and/or sample specific considerations to more accurately identify true somatic mutations in a liquid biopsy sample. For example, by using an application of Bayes theorem to account for one or more of (i) the prevalence of variants at a specific locus in a specific cancer type, (ii) the variant allele fraction for the variant being evaluated, (iii) the prevalence of sequencing errors at a particular locus, and (iv) the actual sequencing error rate of a particular reaction, the variant filter methodologies described herein tune the specificity and sensitivity of variant count thresholds in a locus-specific fashion to achieve higher accuracy of true somatic variant calling in a liquid biopsy assay.
  • For example, in one aspect, the present disclosure provides a method of validating a somatic sequence variant in a test subject having a cancer condition. The method is performed at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors.
  • The method includes obtaining, from a first sequencing reaction, a corresponding sequence of each cell-free DNA fragment in a first plurality of cell-free DNA fragments in a liquid biopsy sample of the test subject, thus obtaining a first plurality of sequence reads. Each respective sequence read in the first plurality of sequence reads is aligned to a reference sequence for the species of the subject, thus identifying a variant allele fragment count for a candidate variant that maps to a locus in the reference sequence, and a locus fragment count for the locus encompassing the candidate variant.
  • The method further includes comparing the variant allele fragment count for the candidate variant against a dynamic variant count threshold for the locus in the reference sequence that the candidate variant maps to. The dynamic variant count threshold is based upon a pre-test odds of a positive variant call for the locus based on the prevalence of variants in a genomic region that includes the locus from a first set of nucleic acids obtained from a cohort of subjects having the cancer condition.
  • The method then includes rejecting or validating the variant as a true somatic variant based upon the dynamic variant count threshold. For instance, when the variant allele fragment count for the candidate variant satisfies the dynamic variant count threshold for the locus, the presence of the somatic sequence variant in the test subject is validated. And when the variant allele fragment count for the candidate variant does not satisfy the dynamic variant count threshold for the locus, the presence of the somatic sequence variant in the test subject is rejected.
  • Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIGS. 1A, 1B, 1C, 1D, 1E, and 1F collectively illustrate a block diagram of an example computing device for validating a somatic sequence variant in a test subject having a cancer condition, in accordance with some embodiments of the present disclosure.
  • FIG. 2A illustrates an example workflow for generating a clinical report based on information generated from analysis of one or more patient specimens, in accordance with some embodiments of the present disclosure.
  • FIG. 2B illustrates an example of a distributed diagnostic environment for collecting and evaluating patient data for the purpose of precision oncology, in accordance with some embodiments of the present disclosure.
  • FIG. 3 provides an example flow chart of processes and features for liquid biopsy sample collection and analysis for use in precision oncology, in accordance with some embodiments of the present disclosure.
  • FIGS. 4A, 4B, 4C, 4D, 4E, and 4F collectively illustrate an example bioinformatics pipeline for precision oncology, in accordance with various embodiments of the present disclosure. FIG. 4A provides an overview flow chart of processes and features in a bioinformatics pipeline, in accordance with some embodiments of the present disclosure. FIG. 4B provides an overview of a bioinformatics pipeline executed with either a liquid biopsy sample alone or a liquid biopsy sample and a matched normal sample. FIG. 4C illustrates that paired end reads from tumor and normal isolates are zipped and stored separately under the same order identifier, in accordance with some embodiments of the present disclosure. FIG. 4D illustrates quality correction for FASTQ files, in accordance with some embodiments of the present disclosure. FIG. 4E illustrates processes for obtaining tumor and normal BAM alignment files, in accordance with some embodiments of the present disclosure. FIG. 4F provides a flow chart of a method for validating a somatic sequence variant in a test subject having a cancer condition, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
  • FIGS. 4G1, 4G2, and 4G3 collectively illustrate an example bioinformatics pipeline for precision oncology, in accordance with various embodiments of the present disclosure. Specifically, these figures provide a flow chart of a method for validating a somatic sequence variant in a test subject having a cancer condition, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
  • FIGS. 5A and 5B collectively provide a flow chart of processes and features for validating a somatic sequence variant in a test subject having a cancer condition, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
  • FIG. 6 illustrates a flow chart of a method for obtaining a distribution of variant detection sensitivities as a function of circulating variant allele fraction from a cohort of subjects, in accordance with some embodiments of the present disclosure.
  • FIGS. 7A and 7B collectively illustrate a method of inferring an effect of a sequence variant as a gain-of-function or a loss-of-function of a gene, in accordance with some embodiments of the present disclosure.
  • FIGS. 8A, 8B, 8C, and 8D collectively illustrate results of an inter-assay comparison between a liquid biopsy assay, a digital droplet polymerase chain reaction (ddPCR), and a solid-tumor biopsy assay, in accordance with various embodiments of the present disclosure.
  • FIGS. 9A, 9B, 9C, 9D, 9E, 9F, 9G, and 9H collectively illustrate results of a comparison between circulating tumor fraction estimate (ctFE) and variant allele fraction (VAF) using an Off-Target Tumor Estimation Routine (OTTER) method, in accordance with various embodiments of the present disclosure.
  • FIGS. 10A and 10B collectively illustrate results of evaluating ctFE and mutational landscape according to cancer type, in accordance with various embodiments of the present disclosure.
  • FIGS. 11A, 11B, and 11C collectively illustrate results of evaluating associations between ctFE and advanced disease states, in accordance with various embodiments of the present disclosure.
  • FIGS. 12A, 12B, and 12C collectively illustrate results of comparing ctFE with recent clinical response outcomes, in accordance with various embodiments of the present disclosure.
  • FIG. 13 illustrates a first table describing sensitivity for all SNVs, indels, CNVs, and rearrangements targeted in reference samples, in accordance with various embodiments of the present disclosure.
  • FIG. 14 illustrates a second table describing sensitivity for all SNVs, indels, CNVs, and rearrangements targeted in reference samples, in accordance with various embodiments of the present disclosure.
  • FIG. 15 illustrates a third table describing comparisons between the presently disclosed liquid biopsy assay and a commercial liquid biopsy kit, in accordance with various embodiments of the present disclosure.
  • FIGS. 16A, 16B, and 16C collectively illustrate a fourth table describing variants detected by a liquid biopsy assay, in accordance with various embodiments of the present disclosure.
  • FIG. 17 illustrates a fifth table describing dynamic filtering methodology to further reduced discordance, in accordance with various embodiments of the present disclosure.
  • FIG. 18 illustrates a sixth table describing cancer groups included in clinical profiling analysis, in accordance with various embodiments of the present disclosure.
  • Like reference numerals refer to corresponding parts throughout the several views of the drawings.
  • DETAILED DESCRIPTION Introduction
  • As described above, conventional liquid biopsy assays do not provide accurate determination of variants (e.g., somatic variants), particularly at low circulating variant fractions. This is due, in large part, to the use of static variant count filters that require a common amount of support to call a variant positively as a somatic variant in sequencing data, regardless of the identity of the variant and its position within the genome. That is, conventional methods require that at least X number of unique sequence reads (e.g., 8 sequence reads) provide support for (e.g., encompass) a particular variant in order for that variant to be confirmed as a true somatic variant. While this may be fine for liquid biopsy samples having a high tumor fraction, where more copies of each somatic variant would be expected to be found, it results in a high number of false negatives when samples with lower tumor fractions are analyzed. On the other hand, simply lowering the threshold to allow calling of variants with lower support for a particular variant will increase the number of false positives, that is the number of untrue positive somatic variant calls, which are actually sequencing errors.
  • While there are many methods of performing noise suppression on ultra-high depth sequencing data commonly generated for liquid biopsy assays, there remains the fundamental fidelity boundary of sequencing by synthesis that cannot be overcome. Along with this, there are a variety of complexities and non-linearities within the ability to map reads across complex sets of genomic features and from these data, successfully call a variant. While it is possible to filter very stringently, one of the goals of liquid biopsy assays is to detect alterations at very low circulating fractions. This requires that low levels of support be sufficient to make a positive alteration call given that at 0.1% circulating fraction and an average depth of 5000×, only 5 reads containing alternate alleles will be present. Because of this, it is impossible to have a consistent set of thresholds that will be used to filter variants as any filter will either be too stringent or too permissive depending on the variant context and local sequence specific error generation models.
  • Advantageously, the present disclosure provides methods and systems that more accurately call somatic variants by adjusting the variant count threshold in a locus-by-locus fashion, e.g., by lowering the variant count threshold when there is an increased likelihood (orthogonal to the variant count in the sequencing reaction) that a variant at a particular locus is a true somatic variant and/or by raising the variant count threshold when there is an increased likelihood (orthogonal to the variant count in the sequencing reaction) that a variant at a particular locus is a result of a sequencing error, rather than a true somatic variant.
  • For example, in some embodiments, the methods and systems described herein employ a generalized application of Bayes' Theorem through the likelihood ratio test that allows dynamic calibration of filtering threshold for diagnostic assays. These thresholds are based on one or more of a sample-specific error rate, a methodology-specific sequencing error rate (e.g., from a pool of process matched healthy control samples), an estimate of the variant allele fraction for the variant being evaluated, and a historical likelihood that a variant would be present at a particular locus in a particular cancer (e.g., derived from an extensive cohort of human solid tumor tissue samples to inform probability models). This results in high sensitivity and specificity in variant detection, allowing identification of actionable oncologic targets, as well as determination of a precise limit of detection to reduce the occurrence of false negatives.
  • For instance, in some embodiments, the dynamic variant filtering methodology described herein uses an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant at a particular genomic region based on the prevalence of similar mutations within that genomic regions in similar cancers. For instance, where there is a high prevalence of a somatic variant in a given gene for a particular cancer, (e.g., BRCA1 mutations are common in breast cancers), the dynamic filtering method accounts for this prior (e.g., the prior knowledge that BRCA mutations are commonly found in breast cancers) by setting a lower variant count threshold to call somatic variants in the BRCA1 gene for a breast cancer. That is, the dynamic filtering methodology requires less evidence in order to call a variant in the BRCA1 gene when the subject has breast cancer than when the subject has a different cancer that is not associated with a high prevalence of BRCA1 mutations.
  • In some embodiments, the dynamic variant filtering methodology described herein uses an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant based on an estimated variant allele fraction for the variant being evaluated. That is, the dynamic filtering methodology takes into account the fact that in a sample having a lower tumor fraction, and therefore a lower variant allele fraction, a fewer number of sequences encompassing a somatic variant would be expected than in a sample having a higher tumor fraction, and therefore a higher variant allele fraction. Accordingly, the sensitivity and specificity of the dynamic filter are tuned to account for the expectation that a higher percentage of variant sequences with low sequence counts (e.g., lower support) represent true somatic variants in a sample with a low tumor fraction than in a sample with a high tumor fraction, for which a higher percentage of variant sequences with low sequence counts represent sequencing errors.
  • In some embodiments, the dynamic variant filtering methodology described herein used an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant at a particular genomic locus based on a historical sequencing error rate for the locus. That is, the dynamic filtering methodology takes into account the fact that at genomic loci that are more prone to sequencing errors, such as loci with short nucleotide repeat sequences (e.g., di-nucleotide or tri-nucleotide repeats), there is a higher likelihood that a particular variant is a product of a sequencing error, rather than a true somatic mutation, than at a locus that is not prone to sequencing errors.
  • Similarly, in some embodiments, the dynamic variant filtering methodology described herein used an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant at a particular genomic locus based on a reaction-specific sequencing error rate. That is, the dynamic filtering methodology takes into account the fact that in reactions with higher sequencing rates there is a higher likelihood that a particular variant is a product of a sequencing error, rather than a true somatic mutation.
  • The present disclosure provides improved systems and methods for precision oncology based on improved variant calling in liquid biopsy data. The various improvements described herein, e.g., improved variant detection at low circulating fractions, are embodied in an example liquid biopsy workflow described in Examples 2 and 3. These examples describe an example liquid biopsy assay employing a 105-gene hybrid-capture next-generation sequencing (NGS) panel spanning 270 kb of the human genome, configured to detect targets in four variant classes, including single nucleotide variants (SNVs), insertions and/or deletions (indels), copy number variants (CNVs), and gene rearrangements. To establish robust clinical performance, extensive validation studies were conducted that demonstrated high sensitivity and specificity. Accordingly, the example liquid biopsy assay detected actionable variants with high accuracy in comparison to a commercial ctDNA NGS kit, commercial solid tumor biopsy-based assays, such as a solid tumor biopsy NGS tissue assay, and digital droplet PCR (ddPCR). As shown in the results of FIG. 17, the methods and systems disclosed herein reduced false positive variant calling by 11.45% compared to conventional variant detection methods.
  • The identification of actionable genomic alterations in a patient's cancer genome is a difficult and computationally demanding problem. For instance, the determination of various prognostic metrics useful for precision oncology, such as variant allelic ratio, copy number variation, tumor mutational burden, microsatellite instability status, etc., requires analysis of hundreds of millions to billions, of sequenced nucleic acid bases. An example of a typical bioinformatics pipeline established for this purpose includes at least five stages of analysis: assessment of the quality of raw next generation sequencing data, generation of collapsed nucleic acid fragment sequences and alignment of such sequences to a reference genome, detection of structural variants in the aligned sequence data, annotation of identified variants, and visualization of the data. See, Wadapurkar and Vyas, Informatics in Medicine Unlocked, 11:75-82 (2018), the content of which is hereby incorporated by reference, in its entirety, for all purposes. Each one of these procedures is computationally taxing in its own right.
  • For instance, the overall temporal and spatial computation complexity of simple global and local pairwise sequence alignment algorithms are quadratic in nature (e.g., second order problems), that increase rapidly as a function of the size of the nucleic acid sequences (n and m) being compared. Specifically, the temporal and spatial complexities of these sequence alignment algorithms can be estimated as O(mn), where O is the upper bound on the asymptotic growth rate of the algorithm, n is the number of bases in the first nucleic acid sequence, and m is the number of bases in the second nucleic acid sequence. See, Baichoo and Ouzounis, BioSystems, 156-157:72-85 (2017), the content of which is hereby incorporated by reference, in its entirety, for all purposes. Given that the human genome contains more than 3 billion bases, these alignment algorithms are extremely computationally taxing, especially when used to analyze next generation sequencing (NGS) data, which can generate more than 3 billion sequence reads per reaction.
  • This is particularly true when performed in the context of a liquid biopsy assay, because liquid biopsy samples contain a complex mixture of short DNA fragments originating from many different germline (e.g., healthy) and diseased (e.g., cancerous) tissues. Thus, the cellular origins of the sequence reads are unknown, and the sequence signals originating from cancerous cells, which may constitute multiple sub-clonal populations, must be computationally deconvoluted from signals originating from germline and hematopoietic origins, in order to provide relevant information about the subject's cancer. Thus, in addition to the computationally taxing processes required to align sequence reads to a human genome, there is a computation problem of determining whether a particular abnormal signal, e.g., one or more sequence reads corresponding to a genomic alteration, (i) is not an artifact, and (ii) originated from a cancerous source in the subject. This is increasingly difficult during the early stages of cancer—when treatment is presumably most effective—when only small amounts of ctDNA are diluted by germline and hematopoietic DNA.
  • Advantageously, the present disclosure provides various systems and methods that improve the computational elucidation of actionable genomic alterations from a liquid biopsy sample of a cancer patient. Specifically, the present disclosure improves a method for identifying variants in ctDNA using a dynamic thresholding approach. As described above, the disclosed methods and systems are necessarily computer-implemented due to their complexity and heavy computational requirements, and thus solve a problem in the computing art.
  • Advantageously, the methods and systems described herein provide an improvement to the abovementioned technical problem (e.g., performing complex computer-implemented methods for identifying variants in ctDNA using a dynamic thresholding approach). The methods described herein therefore solve a problem in the computing art by improving upon conventional methods for identifying variants (e.g., actionable oncologic targets) for cancer diagnosis and treatment. For example, the application of Bayes' Theorem through the likelihood ratio test provides a means for improving detection of true positive variants and reducing detection of false positive variants for clinically relevant biomarkers, thus improving the accuracy and precision of genomic alteration detection in precision oncology.
  • The methods and systems described herein also improve precision oncology methods for assigning and/or administering treatment because of the improved accuracy of variation detection. The identification of therapeutically actionable variants that can be included in a clinical report for patient and/or clinician review, and/or matched with appropriate therapies and/or clinical trials for treatment and/or monitoring, allows for more accurate assignment of treatments. Furthermore, the removal of false positive variant detection reduces the risk of patients undergoing unnecessary or potentially harmful regimens due to misdiagnoses.
  • Definitions
  • As used herein, the term “subject” refers to any living or non-living organism including, but not limited to, a human (e.g., a male human, female human, fetus, pregnant female, child, or the like), a non-human mammal, or a non-human animal. Any human or non-human animal can serve as a subject, including but not limited to mammal, reptile, avian, amphibian, fish, ungulate, ruminant, bovine (e.g., cattle), equine (e.g., horse), caprine and ovine (e.g., sheep, goat), swine (e.g., pig), camelid (e.g., camel, llama, alpaca), monkey, ape (e.g., gorilla, chimpanzee), ursid (e.g., bear), poultry, dog, cat, mouse, rat, fish, dolphin, whale and shark. In some embodiments, a subject is a male or female of any age (e.g., a man, a woman, or a child).
  • As used herein, the terms “control,” “control sample,” “reference,” “reference sample,” “normal,” and “normal sample” describe a sample from a non-diseased tissue. In some embodiments, such a sample is from a subject that does not have a particular condition (e.g., cancer). In other embodiments, such a sample is an internal control from a subject, e.g., who may or may not have the particular disease (e.g., cancer), but is from a healthy tissue of the subject. For example, where a liquid or solid tumor sample is obtained from a subject with cancer, an internal control sample may be obtained from a healthy tissue of the subject, e.g., a white blood cell sample from a subject without a blood cancer or a solid germline tissue sample from the subject. Accordingly, a reference sample can be obtained from the subject or from a database, e.g., from a second subject who does not have the particular disease (e.g., cancer).
  • As used herein the term “cancer,” “cancerous tissue,” or “tumor” refers to an abnormal mass of tissue in which the growth of the mass surpasses, and is not coordinated with, the growth of normal tissue, including both solid masses (e.g., as in a solid tumor) or fluid masses (e.g., as in a hematological cancer). A cancer or tumor can be defined as “benign” or “malignant” depending on the following characteristics: degree of cellular differentiation including morphology and functionality, rate of growth, local invasion and metastasis. A “benign” tumor can be well differentiated, have characteristically slower growth than a malignant tumor and remain localized to the site of origin. In addition, in some cases a benign tumor does not have the capacity to infiltrate, invade or metastasize to distant sites. A “malignant” tumor can be a poorly differentiated (anaplasia), have characteristically rapid growth accompanied by progressive infiltration, invasion, and destruction of the surrounding tissue. Furthermore, a malignant tumor can have the capacity to metastasize to distant sites. Accordingly, a cancer cell is a cell found within the abnormal mass of tissue whose growth is not coordinated with the growth of normal tissue. Accordingly, a “tumor sample” refers to a biological sample obtained or derived from a tumor of a subject, as described herein.
  • Non-limiting examples of cancer types include ovarian cancer, cervical cancer, uveal melanoma, colorectal cancer, chromophobe renal cell carcinoma, liver cancer, endocrine tumor, oropharyngeal cancer, retinoblastoma, biliary cancer, adrenal cancer, neural cancer, neuroblastoma, basal cell carcinoma, brain cancer, breast cancer, non-clear cell renal cell carcinoma, glioblastoma, glioma, kidney cancer, gastrointestinal stromal tumor, medulloblastoma, bladder cancer, gastric cancer, bone cancer, non-small cell lung cancer, thymoma, prostate cancer, clear cell renal cell carcinoma, skin cancer, thyroid cancer, sarcoma, testicular cancer, head and neck cancer (e.g., head and neck squamous cell carcinoma), meningioma, peritoneal cancer, endometrial cancer, pancreatic cancer, mesothelioma, esophageal cancer, small cell lung cancer, Her2 negative breast cancer, ovarian serous carcinoma, HR+ breast cancer, uterine serous carcinoma, uterine corpus endometrial carcinoma, gastroesophageal junction adenocarcinoma, gallbladder cancer, chordoma, and papillary renal cell carcinoma.
  • As used herein, the terms “cancer state” or “cancer condition” refer to a characteristic of a cancer patient's condition, e.g., a diagnostic status, a type of cancer, a location of cancer, a primary origin of a cancer, a cancer stage, a cancer prognosis, and/or one or more additional characteristics of a cancer (e.g., tumor characteristics such as morphology, heterogeneity, size, etc.). In some embodiments, one or more additional personal characteristics of the subject are used further describe the cancer state or cancer condition of the subject, e.g., age, gender, weight, race, personal habits (e.g., smoking, drinking, diet), other pertinent medical conditions (e.g., high blood pressure, dry skin, other diseases), current medications, allergies, pertinent medical history, current side effects of cancer treatments and other medications, etc.
  • As used herein, the term “liquid biopsy” sample refers to a liquid sample obtained from a subject that includes cell-free DNA. Examples of liquid biopsy samples include, but are not limited to, blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal material, saliva, sweat, tears, pleural fluid, pericardial fluid, or peritoneal fluid of the subject. In some embodiments, a liquid biopsy sample is a cell-free sample, e.g., a cell free blood sample. In some embodiments, a liquid biopsy sample is obtained from a subject with cancer. In some embodiments, a liquid biopsy sample is collected from a subject with an unknown cancer status, e.g., for use in determining a cancer status of the subject. Likewise, in some embodiments, a liquid biopsy is collected from a subject with a non-cancerous disorder, e.g., a cardiovascular disease. In some embodiments, a liquid biopsy is collected from a subject with an unknown status for a non-cancerous disorder, e.g., for use in determining a non-cancerous disorder status of the subject.
  • As used herein, the term “cell-free DNA” and “cfDNA” interchangeably refer to DNA fragments that circulate in a subject's body (e.g., bloodstream) and originate from one or more healthy cells and/or from one or more cancer cells. These DNA molecules are found outside cells, in bodily fluids such as blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal material, saliva, sweat, sweat, tears, pleural fluid, pericardial fluid, or peritoneal fluid of a subject, and are believed to be fragments of genomic DNA expelled from healthy and/or cancerous cells, e.g., upon apoptosis and lysis of the cellular envelope.
  • As used herein, the term “locus” refers to a position (e.g., a site) within a genome, e.g., on a particular chromosome. In some embodiments, a locus refers to a single nucleotide position, on a particular chromosome, within a genome. In some embodiments, a locus refers to a group of nucleotide positions within a genome. In some instances, a locus is defined by a mutation (e.g., substitution, insertion, deletion, inversion, or translocation) of consecutive nucleotides within a cancer genome. In some instances, a locus is defined by a gene, a sub-genic structure (e.g., a regulatory element, exon, intron, or combination thereof), or a predefined span of a chromosome. Because normal mammalian cells have diploid genomes, a normal mammalian genome (e.g., a human genome) will generally have two copies of every locus in the genome, or at least two copies of every locus located on the autosomal chromosomes, e.g., one copy on the maternal autosomal chromosome and one copy on the paternal autosomal chromosome.
  • As used herein, the term “allele” refers to a particular sequence of one or more nucleotides at a chromosomal locus. In a haploid organism, the subject has one allele at every chromosomal locus. In a diploid organism, the subject has two alleles at every chromosomal locus.
  • As used herein, the term “base pair” or “bp” refers to a unit consisting of two nucleobases bound to each other by hydrogen bonds. Generally, the size of an organism's genome is measured in base pairs because DNA is typically double stranded. However, some viruses have single-stranded DNA or RNA genomes.
  • As used herein, the terms “genomic alteration,” “mutation,” and “variant” refer to a detectable change in the genetic material of one or more cells. A genomic alteration, mutation, or variant can refer to various type of changes in the genetic material of a cell, including changes in the primary genome sequence at single or multiple nucleotide positions, e.g., a single nucleotide variant (SNV), a multi-nucleotide variant (MNV), an indel (e.g., an insertion or deletion of nucleotides), a DNA rearrangement (e.g., an inversion or translocation of a portion of a chromosome or chromosomes), a variation in the copy number of a locus (e.g., an exon, gene, or a large span of a chromosome) (CNV), a partial or complete change in the ploidy of the cell, as well as in changes in the epigenetic information of a genome, such as altered DNA methylation patterns. In some embodiments, a mutation is a change in the genetic information of the cell relative to a particular reference genome, or one or more ‘normal’ alleles found in the population of the species of the subject. For instance, mutations can be found in both germline cells (e.g., non-cancerous, ‘normal’ cells) of a subject and in abnormal cells (e.g., pre-cancerous or cancerous cells) of the subject. As such, a mutation in a germline of the subject (e.g., which is found in substantially all ‘normal cells’ in the subject) is identified relative to a reference genome for the species of the subject. However, many loci of a reference genome of a species are associated with several variant alleles that are significantly represented in the population of the subject and are not associated with a diseased state, e.g., such that they would not be considered ‘mutations.’ By contrast, in some embodiments, a mutation in a cancerous cell of a subject can be identified relative to either a reference genome of the subject or to the subject's own germline genome. In certain instances, identification of both types of variants can be informative. For instance, in some instances, a mutation that is present in both the cancer genome of the subject and the germline of the subject is informative for precision oncology when the mutation is a so-called ‘driver mutation,’ which contributes to the initiation and/or development of a cancer. However, in other instances, a mutation that is present in both the cancer genome of the subject and the germline of the subject is not informative for precision oncology, e.g., when the mutation is a so-called ‘passenger mutation,’ which does not contribute to the initiation and/or development of the cancer. Likewise, in some instances, a mutation that is present in the cancer genome of the subject but not the germline of the subject is informative for precision oncology, e.g., where the mutation is a driver mutation and/or the mutation facilitates a therapeutic approach, e.g., by differentiating cancer cells from normal cells in a therapeutically actionable way. However, in some instances, a mutation that is present in the cancer genome but not the germline of a subject is not informative for precision oncology, e.g., where the mutation is a passenger mutation and/or where the mutation fails to differentiate the cancer cell from a germline cell in a therapeutically actionable way.
  • As used herein, the term “reference allele” refers to the sequence of one or more nucleotides at a chromosomal locus that is either the predominant allele represented at that chromosomal locus within the population of the species (e.g., the “wild-type” sequence), or an allele that is predefined within a reference genome for the species.
  • As used herein, the term “variant allele” refers to a sequence of one or more nucleotides at a chromosomal locus that is either not the predominant allele represented at that chromosomal locus within the population of the species (e.g., not the “wild-type” sequence), or not an allele that is predefined within a reference sequence construct (e.g., a reference genome or set of reference genomes) for the species. In some instances, sequence isoforms found within the population of a species that do not affect a change in a protein encoded by the genome, or that result in an amino acid substitution that does not substantially affect the function of an encoded protein, are not variant alleles.
  • As used herein, the term “variant allele fraction,” “VAF,” “allelic fraction,” or “AF” refers to the number of times a variant or mutant allele was observed (e.g., a number of reads supporting a candidate variant allele) divided by the total number of times the position was sequenced (e.g., a total number of reads covering a candidate locus).
  • As used herein, the terms “variant fragment count” and “variant allele fragment count” interchangeably refer to a quantification, e.g., a raw or normalized count, of the number of sequences representing unique cell-free DNA fragments encompassing a variant allele in a sequencing reaction. That is, a variant fragment count represents a count of sequence reads representing unique molecules in the liquid biopsy sample, after duplicate sequence reads in the raw sequencing data have been collapsed, e.g., through the use of unique molecular indices (UMI) and bagging, etc. as described herein.
  • As used herein, the term “germline variants” refers to genetic variants inherited from maternal and paternal DNA. Germline variants may be determined through a matched tumor-normal calling pipeline.
  • As used herein, the term “somatic variants” refers to variants arising as a result of dysregulated cellular processes associated with neoplastic cells, e.g., a mutation. Somatic variants may be detected via subtraction from a matched normal sample.
  • As used herein, the term “single nucleotide variant” or “SNV” refers to a substitution of one nucleotide to a different nucleotide at a position (e.g., site) of a nucleotide sequence, e.g., a sequence read from an individual. A substitution from a first nucleobase X to a second nucleobase Y may be denoted as “X>Y.” For example, a cytosine to thymine SNV may be denoted as “C>T.”
  • As used herein, the term “insertions and deletions” or “indels” refers to a variant resulting from the gain or loss of DNA base pairs within an analyzed region.
  • As used herein, the term “copy number variation” or “CNV” refers to the process by which large structural changes in a genome associated with tumor aneuploidy and other dysregulated repair systems are detected. These processes are used to detect large scale insertions or deletions of entire genomic regions. CNV is defined as structural insertions or deletions greater than a certain base pair (“bp”) in size, such as 500 bp.
  • As used herein, the term “gene fusion” refers to the product of large-scale chromosomal aberrations resulting in the creation of a chimeric protein. These expressed products can be non-functional, or they can be highly over or underactive. This can cause deleterious effects in cancer such as hyper-proliferative or anti-apoptotic phenotypes.
  • As used herein, the term “loss of heterozygosity” refers to the loss of one copy of a segment (e.g., including part or all of one or more genes) of the genome of a diploid subject (e.g., a human) or loss of one copy of a sequence encoding a functional gene product in the genome of the diploid subject, in a tissue, e.g., a cancerous tissue, of the subject. As used herein, when referring to a metric representing loss of heterozygosity across the entire genome of the subject, loss of heterozygosity is caused by the loss of one copy of various segments in the genome of the subject. Loss of heterozygosity across the entire genome may be estimated without sequencing the entire genome of a subject, and such methods for such estimations based on gene panel targeting-based sequencing methodologies are described in the art. Accordingly, in some embodiments, a metric representing loss of heterozygosity across the entire genome of a tissue of a subject is represented as a single value, e.g., a percentage or fraction of the genome. In some cases, a tumor is composed of various sub-clonal populations, each of which may have a different degree of loss of heterozygosity across their respective genomes. Accordingly, in some embodiments, loss of heterozygosity across the entire genome of a cancerous tissue refers to an average loss of heterozygosity across a heterogeneous tumor population. As used herein, when referring to a metric for loss of heterozygosity in a particular gene, e.g., a DNA repair protein such as a protein involved in the homologous DNA recombination pathway (e.g., BRCA1 or BRCA2), loss of heterozygosity refers to complete or partial loss of one copy of the gene encoding the protein in the genome of the tissue and/or a mutation in one copy of the gene that prevents translation of a full-length gene product, e.g., a frameshift or truncating (creating a premature stop codon in the gene) mutation in the gene of interest. In some cases, a tumor is composed of various sub-clonal populations, each of which may have a different mutational status in a gene of interest. Accordingly, in some embodiments, loss of heterozygosity for a particular gene of interest is represented by an average value for loss of heterozygosity for the gene across all sequenced sub-clonal populations of the cancerous tissue. In other embodiments, loss of heterozygosity for a particular gene of interest is represented by a count of the number of unique incidences of loss of heterozygosity in the gene of interest across all sequenced sub-clonal populations of the cancerous tissue (e.g., the number of unique frame-shift and/or truncating mutations in the gene identified in the sequencing data).
  • As used herein, the term “microsatellites” refers to short, repeated sequences of DNA. The smallest nucleotide repeated unit of a microsatellite is referred to as the “repeated unit” or “repeat unit.” In some embodiments, the stability of a microsatellite locus is evaluated by comparing some metric of the distribution of the number of repeated units at a microsatellite locus to a reference number or distribution.
  • As used herein, the term “microsatellite instability” or “MSI” refers to a genetic hypermutability condition associated with various cancers that results from impaired DNA mismatch repair (MMR) in a subject. Among other phenotypes, MSI causes changes in the size of microsatellite loci, e.g., a change in the number of repeated units at microsatellite loci, during DNA replication. Accordingly, the size of microsatellite repeats is varied in MSI cancers as compared to the size of the corresponding microsatellite repeats in the germline of a cancer subject. The term “Microsatellite Instability-High” or “MSI-H” refers to a state of a cancer (e.g., a tumor) that has a significant MMR defect, resulting in microsatellite loci with significantly different lengths than the corresponding microsatellite loci in normal cells of the same individual. The term “Microsatellite Stable” or “MSS” refers to a state of a cancer (e.g., a tumor) without significant MMR defects, such that there is no significant difference between the lengths of the microsatellite loci in cancerous cells and the lengths of the corresponding microsatellite loci in normal (e.g., non-cancerous) cells in the same individual. The term “Microsatellite Equivocal” or “MSE” refers to a state of a cancer (e.g., a tumor) having an intermediate microsatellite length phenotype, that cannot be clearly classified as MSI-H or MSS based on statistical cutoffs used to define those two categories.
  • As used herein, the term “gene product” refers to an RNA (e.g., mRNA or miRNA) or protein molecule transcribed or translated from a particular genomic locus, e.g., a particular gene. The genomic locus can be identified using a gene name, a chromosomal location, or any other genetic mapping metric.
  • As used herein, the term “ratio” refers to any comparison of a first metric X, or a first mathematical transformation thereof X′ (e.g., measurement of a number of units of a genomic sequence in a first one or more biological samples or a first mathematical transformation thereof) to another metric Y or a second mathematical transformation thereof Y′ (e.g., the number of units of a respective genomic sequence in a second one or more biological samples or a second mathematical transformation thereof) expressed as X/Y, Y/X, logN(X/Y), logN(Y/X), X′/Y, Y/X′, logN(X′/Y), or logN(Y/X′), X/Y′, Y′/X, logN(X/Y′), logN(Y′/X), X′/Y′, Y′/X′, logN(X′/Y′), or logN(Y′/X′), where N is any real number greater than 1 and where example mathematical transformations of X and Y include, but are not limited to. raising X or Y to a power Z, multiplying X or Y by a constant Q, where Z and Q are any real numbers, and/or taking an M based logarithm of X and/or Y, where M is a real number greater than 1. In one non-limiting example, X is transformed to X′ prior to ratio calculation by raising X by the power of two (X2) and Y is transformed to Y′ prior to ratio calculation by raising Y by the power of 3.2 (Y3.2) and the ratio of X and Y is computed as log2(X′/Y′).
  • As used herein, the terms “expression level,” “abundance level,” or simply “abundance” refers to an amount of a gene product, (an RNA species, e.g., mRNA or miRNA, or protein molecule) transcribed or translated by a cell, or an average amount of a gene product transcribed or translated across multiple cells. When referring to mRNA or protein expression, the term generally refers to the amount of any RNA or protein species corresponding to a particular genomic locus, e.g., a particular gene. However, in some embodiments, an expression level can refer to the amount of a particular isoform of an mRNA or protein corresponding to a particular gene that gives rise to multiple mRNA or protein isoforms. The genomic locus can be identified using a gene name, a chromosomal location, or any other genetic mapping metric.
  • As used herein, the term “relative abundance” refers to a ratio of a first amount of a compound measured in a sample, e.g., a gene product (an RNA species, e.g., mRNA or miRNA, or protein molecule) or nucleic acid fragments having a particular characteristic (e.g., aligning to a particular locus or encompassing a particular allele), to a second amount of a compound measured in a second sample. In some embodiments, relative abundance refers to a ratio of an amount of species of a compound to a total amount of the compound in the same sample. For instance, a ratio of the amount of mRNA transcripts encoding a particular gene in a sample (e.g., aligning to a particular region of the exome) to the total amount of mRNA transcripts in the sample. In other embodiments, relative abundance refers to a ratio of an amount of a compound or species of a compound in a first sample to an amount of the compound of the species of the compound in a second sample. For instance, a ratio of a normalized amount of mRNA transcripts encoding a particular gene in a first sample to a normalized amount of mRNA transcripts encoding the particular gene in a second and/or reference sample.
  • As used herein, the terms “sequencing,” “sequence determination,” and the like refer to any biochemical processes that may be used to determine the order of biological macromolecules such as nucleic acids or proteins. For example, sequencing data can include all or a portion of the nucleotide bases in a nucleic acid molecule such as an mRNA transcript or a genomic locus.
  • As used herein, the term “genetic sequence” refers to a recordation of a series of nucleotides present in a subject's RNA or DNA as determined by sequencing of nucleic acids from the subject.
  • As used herein, the term “sequence reads” or “reads” refers to nucleotide sequences produced by any nucleic acid sequencing process described herein or known in the art. Reads can be generated from one end of nucleic acid fragments (“single-end reads”) or from both ends of nucleic acid fragments (e.g., paired-end reads, double-end reads). The length of the sequence read is often associated with the particular sequencing technology. High-throughput methods, for example, provide sequence reads that can vary in size from tens to hundreds of base pairs (bp). In some embodiments, the sequence reads are of a mean, median or average length of about 15 bp to 900 bp long (e.g., about 20 bp, about 25 bp, about 30 bp, about 35 bp, about 40 bp, about 45 bp, about 50 bp, about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110 bp, about 120 bp, about 130, about 140 bp, about 150 bp, about 200 bp, about 250 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500 bp. In some embodiments, the sequence reads are of a mean, median or average length of about 1000 bp, 2000 bp, 5000 bp, 10,000 bp, or 50,000 bp or more. Nanopore® sequencing, for example, can provide sequence reads that can vary in size from tens to hundreds to thousands of base pairs. Illumina® parallel sequencing, for example, can provide sequence reads that do not vary as much, for example, most of the sequence reads can be smaller than 200 bp. A sequence read (or sequencing read) can refer to sequence information corresponding to a nucleic acid molecule (e.g., a string of nucleotides). For example, a sequence read can correspond to a string of nucleotides (e.g., about 20 to about 150) from part of a nucleic acid fragment, can correspond to a string of nucleotides at one or both ends of a nucleic acid fragment, or can correspond to nucleotides of the entire nucleic acid fragment. A sequence read can be obtained in a variety of ways, e.g., using sequencing techniques or using probes, e.g., in hybridization arrays or capture probes, or amplification techniques, such as the polymerase chain reaction (PCR) or linear amplification using a single primer or isothermal amplification.
  • As used herein, the term “read segment” refers to any form of nucleotide sequence read including the raw sequence reads obtained directly from a nucleic acid sequencing technique or from a sequence derived therefrom, e.g., an aligned sequence read, a collapsed sequence read, or a stitched sequence read.
  • As used herein, the term “read count” refers to the total number of nucleic acid reads generated, which may or may not be equivalent to the number of nucleic acid molecules generated, during a nucleic acid sequencing reaction.
  • As used herein, the term “read-depth,” “sequencing depth,” or “depth” can refer to a total number of unique nucleic acid fragments encompassing a particular locus or region of the genome of a subject that are sequenced in a particular sequencing reaction. Sequencing depth can be expressed as “Yx”, e.g., 50×, 100×, etc., where “Y” refers to the number of unique nucleic acid fragments encompassing a particular locus that are sequenced in a sequencing reaction. In such a case, Y is necessarily an integer, because it represents the actual sequencing depth for a particular locus. Alternatively, read-depth, sequencing depth, or depth can refer to a measure of central tendency (e.g., a mean or mode) of the number of unique nucleic acid fragments that encompass one of a plurality of loci or regions of the genome of a subject that are sequenced in a particular sequencing reaction. For example, in some embodiments, sequencing depth refers to the average depth of every locus across an arm of a chromosome, a targeted sequencing panel, an exome, or an entire genome. In such case, Y may be expressed as a fraction or a decimal, because it refers to an average coverage across a plurality of loci. When a mean depth is recited, the actual depth for any particular locus may be different than the overall recited depth. Metrics can be determined that provide a range of sequencing depths in which a defined percentage of the total number of loci fall. For instance, a range of sequencing depths within which 90% or 95%, or 99% of the loci fall. As understood by the skilled artisan, different sequencing technologies provide different sequencing depths. For instance, low-pass whole genome sequencing can refer to technologies that provide a sequencing depth of less than 5×, less than 4×, less than 3×, or less than 2×, e.g., from about 0.5× to about 3×.
  • As used herein, the term “sequencing breadth” refers to what fraction of a particular reference exome (e.g., human reference exome), a particular reference genome (e.g., human reference genome), or part of the exome or genome has been analyzed. Sequencing breadth can be expressed as a fraction, a decimal, or a percentage, and is generally calculated as (the number of loci analyzed/the total number of loci in a reference exome or reference genome). The denominator of the fraction can be a repeat-masked genome, and thus 100% can correspond to all of the reference genome minus the masked parts. A repeat-masked exome or genome can refer to an exome or genome in which sequence repeats are masked (e.g., sequence reads align to unmasked portions of the exome or genome). In some embodiments, any part of an exome or genome can be masked and, thus, sequencing breadth can be evaluated for any desired portion of a reference exome or genome. In some embodiments, “broad sequencing” refers to sequencing/analysis of at least 0.1% of an exome or genome.
  • As used herein, the term “sequencing probe” refers to a molecule that binds to a nucleic acid with affinity that is based on the expected nucleotide sequence of the RNA or DNA present at that locus.
  • As used herein, the term “targeted panel” or “targeted gene panel” refers to a combination of probes for sequencing (e.g., by next-generation sequencing) nucleic acids present in a biological sample from a subject (e.g., a tumor sample, liquid biopsy sample, germline tissue sample, white blood cell sample, or tumor or tissue organoid sample), selected to map to one or more loci of interest on one or more chromosomes. An example set of loci/genes useful for precision oncology, e.g., via solid or liquid biopsy assay, that can be analyzed using a targeted panel is described in Table 1. In some embodiments, in addition to loci that are informative for precision oncology, a targeted panel includes one or more probes for sequencing one or more of a loci associated with a different medical condition, a loci used for internal control purposes, or a loci from a pathogenic organism (e.g., an oncogenic pathogen).
  • As used herein, the term, “reference exome” refers to any sequenced or otherwise characterized exome, whether partial or complete, of any tissue from any organism or pathogen that may be used to reference identified sequences from a subject. Typically, a reference exome will be derived from a subject of the same species as the subject whose sequences are being evaluated. Example reference exomes used for human subjects as well as many other organisms are provided in the on-line genome browser hosted by the National Center for Biotechnology Information (“NCBI”). An “exome” refers to the complete transcriptional profile of an organism or pathogen, expressed in nucleic acid sequences. As used herein, a reference sequence or reference exome often is an assembled or partially assembled exomic sequence from an individual or multiple individuals. In some embodiments, a reference exome is an assembled or partially assembled exomic sequence from one or more human individuals. The reference exome can be viewed as a representative example of a species' set of expressed genes. In some embodiments, a reference exome comprises sequences assigned to chromosomes.
  • As used herein, the term “reference genome” refers to any sequenced or otherwise characterized genome, whether partial or complete, of any organism or pathogen that may be used to reference identified sequences from a subject. Typically, a reference genome will be derived from a subject of the same species as the subject whose sequences are being evaluated. Exemplary reference genomes used for human subjects as well as many other organisms are provided in the on-line genome browser hosted by the National Center for Biotechnology Information (“NCBI”) or the University of California, Santa Cruz (UCSC). A “genome” refers to the complete genetic information of an organism or pathogen, expressed in nucleic acid sequences. As used herein, a reference sequence or reference genome often is an assembled or partially assembled genomic sequence from an individual or multiple individuals. In some embodiments, a reference genome is an assembled or partially assembled genomic sequence from one or more human individuals. The reference genome can be viewed as a representative example of a species' set of genes. In some embodiments, a reference genome comprises sequences assigned to chromosomes. Exemplary human reference genomes include but are not limited to NCBI build 34 (UCSC equivalent: hg16), NCBI build 35 (UCSC equivalent: hg17), NCBI build 36.1 (UCSC equivalent: hg18), GRCh37 (UCSC equivalent: hg19), and GRCh38 (UCSC equivalent: hg38). For a haploid genome, there can be only one nucleotide at each locus. For a diploid genome, heterozygous loci can be identified; each heterozygous locus can have two alleles, where either allele can allow a match for alignment to the locus.
  • As used herein, the term “bioinformatics pipeline” refers to a series of processing stages used to determine characteristics of a subject's genome or exome based on sequencing data of the subject's genome or exome. A bioinformatics pipeline may be used to determine characteristics of a germline genome or exome of a subject and/or a cancer genome or exome of a subject. In some embodiments, the pipeline extracts information related to genomic alterations in the cancer genome of a subject, which is useful for guiding clinical decisions for precision oncology, from sequencing results of a biological sample, e.g., a tumor sample, liquid biopsy sample, reference normal sample, etc., from the subject. Certain processing stages in a bioinformatics may be ‘connected,’ meaning that the results of a first respective processing stage are informative and/or essential for execution of a second, downstream processing stage. For instance, in some embodiments, a bioinformatics pipeline includes a first respective processing stage for identifying genomic alterations that are unique to the cancer genome of a subject and a second respective processing stage that uses the quantity and/or identity of the identified genomic alterations to determine a metric that is informative for precision oncology, e.g., a tumor mutational burden. In some embodiments, the bioinformatics pipeline includes a reporting stage that generates a report of relevant and/or actionable information identified by upstream stages of the pipeline, which may or may not further include recommendations for aiding clinical therapy decisions.
  • As used herein, the term “limit of detection” or “LOD” refers to the minimal quantity of a feature that can be identified with a particular level of confidence. Accordingly, level of detection can be used to describe an amount of a substance that must be present in order for a particular assay to reliably detect the substance. A level of detection can also be used to describe a level of support needed for an algorithm to reliably identify a genomic alteration based on sequencing data. For example, a minimal number of unique sequence reads to support identification of a sequence variant such as a SNV.
  • As used herein, the term “BAM File” or “Binary file containing Alignment Maps” refers to a file storing sequencing data aligned to a reference sequence (e.g., a reference genome or exome). In some embodiments, a BAM file is a compressed binary version of a SAM (Sequence Alignment Map) file that includes, for each of a plurality of unique sequence reads, an identifier for the sequence read, information about the nucleotide sequence, information about the alignment of the sequence to a reference sequence, and optionally metrics relating to the quality of the sequence read and/or the quality of the sequence alignment. While BAM files generally relate to files having a particular format, for simplicity they are used herein to simply refer to a file, of any format, containing information about a sequence alignment, unless specifically stated otherwise.
  • As used herein, the term “measure of central tendency” refers to a central or representative value for a distribution of values. Non-limiting examples of measures of central tendency include an arithmetic mean, weighted mean, midrange, midhinge, trimean, geometric mean, geometric median, Winsorized mean, median, and mode of the distribution of values.
  • As used herein, the term “Positive Predictive Value” or “PPV” means the likelihood that a variant is properly called given that a variant has been called by an assay. PPV can be expressed as (number of true positives)/(number of false positives+number of true positives).
  • As used herein, the term “assay” refers to a technique for determining a property of a substance, e.g., a nucleic acid, a protein, a cell, a tissue, or an organ. An assay (e.g., a first assay or a second assay) can comprise a technique for determining the copy number variation of nucleic acids in a sample, the methylation status of nucleic acids in a sample, the fragment size distribution of nucleic acids in a sample, the mutational status of nucleic acids in a sample, or the fragmentation pattern of nucleic acids in a sample. Any assay known to a person having ordinary skill in the art can be used to detect any of the properties of nucleic acids mentioned herein. Properties of a nucleic acids can include a sequence, genomic identity, copy number, methylation state at one or more nucleotide positions, size of the nucleic acid, presence or absence of a mutation in the nucleic acid at one or more nucleotide positions, and pattern of fragmentation of a nucleic acid (e.g., the nucleotide position(s) at which a nucleic acid fragments). An assay or method can have a particular sensitivity and/or specificity, and their relative usefulness as a diagnostic tool can be measured using ROC-AUC statistics.
  • As used herein, the term “classification” can refer to any number(s) or other characters(s) that are associated with a particular property of a sample. For example, in some embodiments, the term “classification” can refer to a type of cancer in a subject, a stage of cancer in a subject, a prognosis for a cancer in a subject, a tumor load, a presence of tumor metastasis in a subject, and the like. The classification can be binary (e.g., positive or negative) or have more levels of classification (e.g., a scale from 1 to 10 or 0 to 1). The terms “cutoff” and “threshold” can refer to predetermined numbers used in an operation. For example, a cutoff size can refer to a size above which fragments are excluded. A threshold value can be a value above or below which a particular classification applies. Either of these terms can be used in either of these contexts.
  • As used herein, the term “sensitivity” or “true positive rate” (TPR) refers to the number of true positives divided by the sum of the number of true positives and false negatives. Sensitivity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly has a condition. For example, sensitivity can characterize the ability of a method to correctly identify the number of subjects within a population having cancer. In another example, sensitivity can characterize the ability of a method to correctly identify the one or more markers indicative of cancer.
  • As used herein, the term “specificity” or “true negative rate” (TNR) refers to the number of true negatives divided by the sum of the number of true negatives and false positives. Specificity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly does not have a condition. For example, specificity can characterize the ability of a method to correctly identify the number of subjects within a population not having cancer. In another example, specificity characterizes the ability of a method to correctly identify one or more markers indicative of cancer.
  • As used herein, an “actionable genomic alteration” or “actionable variant” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), that is known or believed to be associated with a therapeutic course of action that is more likely to produce a positive effect in a cancer patient that has the actionable variant than in a similarly situated cancer patient that does not have the actionable variant. For instance, administration of EGFR inhibitors (e.g., afatinib, erlotinib, gefitinib) is more effective for treating non-small cell lung cancer in patients with an EGFR mutation in exons 19/21 than for treating non-small cell lung cancer in patients that do not have an EGFR mutations in exons 19/21. Accordingly, an EGFR mutation in exon 19/21 is an actionable variant. In some instances, an actionable variant is only associated with an improved treatment outcome in one or a group of specific cancer types. In other instances, an actionable variant is associated with an improved treatment outcome in substantially all cancer types.
  • As used herein, a “variant of uncertain significance” or “VUS” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), whose impact on disease development/progression is unknown.
  • As used herein, a “benign variant” or “likely benign variant” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), that is known or believed to not contribute to disease development/progression.
  • As used herein, a “pathogenic variant” or “likely pathogenic variant” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), that is known or believed to contribute to disease development/progression.
  • As used herein, an “effective amount” or “therapeutically effective amount” is an amount sufficient to affect a beneficial or desired clinical result upon treatment. An effective amount can be administered to a subject in one or more doses. In terms of treatment, an effective amount is an amount that is sufficient to palliate, ameliorate, stabilize, reverse or slow the progression of the disease, or otherwise reduce the pathological consequences of the disease. The effective amount is generally determined by the physician on a case-by-case basis and is within the skill of one in the art. Several factors are typically taken into account when determining an appropriate dosage to achieve an effective amount. These factors include age, sex and weight of the subject, the condition being treated, the severity of the condition and the form and effective concentration of the therapeutic agent being administered.
  • The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the description of the invention and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, to the extent that the terms “including,” “includes,” “having,” “has,” “with,” or variants thereof are used in either the detailed description and/or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.”
  • As used herein, the term “if” may be construed to mean “when” or “upon” or “in response to determining” or “in response to detecting,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” may be construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event],” depending on the context.
  • It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first subject could be termed a second subject, and, similarly, a second subject could be termed a first subject, without departing from the scope of the present disclosure. The first subject and the second subject are both subjects, but they are not the same subject. Furthermore, the terms “subject,” “user,” and “patient” are used interchangeably herein.
  • Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, including example systems, methods, techniques, instruction sequences, and computing machine program products that embody illustrative implementations. However, the illustrative discussions below are not intended to be exhaustive or to limit the implementations to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The features described herein are not limited by the illustrated ordering of acts or events, as some acts can occur in different orders and/or concurrently with other acts or events.
  • The implementations provided herein are chosen and described in order to best explain the principles and their practical applications, to thereby enable others skilled in the art to best utilize the various embodiments with various modifications as are suited to the particular use contemplated. In some instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments. In other instances, it will be apparent to one of ordinary skill in the art that the present disclosure may be practiced without one or more of the specific details.
  • It will be appreciated that, in the development of any such actual implementation, numerous implementation-specific decisions are made in order to achieve the designer's specific goals, such as compliance with use case- and business-related constraints, and that these specific goals will vary from one implementation to another and from one designer to another. Moreover, it will be appreciated that though such a design effort might be complex and time-consuming, it will nevertheless be a routine undertaking of engineering for those of ordering skill in the art having the benefit of the present disclosure.
  • Example System Embodiments
  • Now that an overview of some aspects of the present disclosure and some definitions used in the present disclosure have been provided, details of an exemplary system for providing clinical support for personalized cancer therapy using a liquid biopsy assay are now described in conjunction with FIGS. 1A-1D. FIGS. 1A-1D collectively illustrate the topology of an example system for providing clinical support for personalized cancer therapy using a liquid biopsy assay, in accordance with some embodiments of the present disclosure. Advantageously, the example system illustrated in FIGS. 1A-1D improves upon conventional methods for providing clinical support for personalized cancer therapy by validating a somatic sequence variant in a test subject having a cancer condition.
  • FIG. 1A is a block diagram illustrating a system in accordance with some implementations. The device 100 in some implementations includes one or more processing units CPU(s) 102 (also referred to as processors), one or more network interfaces 104, a user interface 106, e.g., including a display 108 and/or an input 110 (e.g., a mouse, touchpad, keyboard, etc.), a non-persistent memory 111, a persistent memory 112, and one or more communication buses 114 for interconnecting these components. The one or more communication buses 114 optionally include circuitry (sometimes called a chipset) that interconnects and controls communications between system components. The non-persistent memory 111 typically includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, ROM, EEPROM, flash memory, whereas the persistent memory 112 typically includes CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. The persistent memory 112 optionally includes one or more storage devices remotely located from the CPU(s) 102. The persistent memory 112, and the non-volatile memory device(s) within the non-persistent memory 112, comprise non-transitory computer readable storage medium. In some implementations, the non-persistent memory 111 or alternatively the non-transitory computer readable storage medium stores the following programs, modules and data structures, or a subset thereof, sometimes in conjunction with the persistent memory 112:
      • an operating system 116, which includes procedures for handling various basic system services and for performing hardware dependent tasks;
      • a network communication module (or instructions) 118 for connecting the system 100 with other devices and/or a communication network 105;
      • a test patient data store 120 for storing one or more collections of features from patients (e.g., subjects);
      • a bioinformatics module 140 for processing sequencing data and extracting features from sequencing data, e.g., from liquid biopsy sequencing assays;
      • a feature analysis module 160 for evaluating patient features, e.g., genomic alterations, compound genomic features, and clinical features; and
      • a reporting module 180 for generating and transmitting reports that provide clinical support for personalized cancer therapy.
  • Although FIGS. 1A-1D depict a “system 100,” the figures are intended more as a functional description of the various features that may be present in computer systems than as a structural schematic of the implementations described herein. In practice, and as recognized by those of ordinary skill in the art, items shown separately could be combined and some items could be separated. Moreover, although FIG. 1 depicts certain data and modules in non-persistent memory 111, some or all of these data and modules may be in persistent memory 112. For example, in various implementations, one or more of the above identified elements are stored in one or more of the previously mentioned memory devices and correspond to a set of instructions for performing a function described above. The above identified modules, data, or programs (e.g., sets of instructions) need not be implemented as separate software programs, procedures, datasets, or modules, and thus various subsets of these modules and data may be combined or otherwise re-arranged in various implementations.
  • In some implementations, the non-persistent memory 111 optionally stores a subset of the modules and data structures identified above. Furthermore, in some embodiments, the memory stores additional modules and data structures not described above. In some embodiments, one or more of the above-identified elements is stored in a computer system, other than that of system 100, that is addressable by system 100 so that system 100 may retrieve all or a portion of such data when needed.
  • For purposes of illustration in FIG. 1A, system 100 is represented as a single computer that includes all of the functionality for providing clinical support for personalized cancer therapy. However, while a single machine is illustrated, the term “system” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
  • For example, in some embodiments, system 100 includes one or more computers. In some embodiments, the functionality for providing clinical support for personalized cancer therapy is spread across any number of networked computers and/or resides on each of several networked computers and/or is hosted on one or more virtual machines at a remote location accessible across the communications network 105. For example, different portions of the various modules and data stores illustrated in FIGS. 1A-1D can be stored and/or executed on the various instances of a processing device and/or processing server/database in the distributed diagnostic environment 210 illustrated in FIG. 2B (e.g., processing devices 224, 234, 244, and 254, processing server 262, and database 264).
  • The system may operate in the capacity of a server or a client machine in client-server network environment, as a peer machine in a peer-to-peer (or distributed) network environment, or as a server or a client machine in a cloud computing infrastructure or environment. The system may be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, a switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
  • In another implementation, the system comprises a virtual machine that includes a module for executing instructions for performing any one or more of the methodologies disclosed herein. In computing, a virtual machine (VM) is an emulation of a computer system that is based on computer architectures and provides functionality of a physical computer. Some such implementations may involve specialized hardware, software, or a combination of hardware and software.
  • One of skill in the art will appreciate that any of a wide array of different computer topologies are used for the application and all such topologies are within the scope of the present disclosure.
  • Test Patient Data Store (120)
  • Referring to FIG. 1B, in some embodiments, the system (e.g., system 100) includes a patient data store 120 that stores data for patients 121-1 to 121-M (e.g., cancer patients or patients being tested for cancer) including one or more sequencing data 122, feature data 125, and clinical assessments 139. These data are used and/or generated by the various processes stored in the bioinformatics module 140 and feature analysis module 160 of system 100, to ultimately generate a report providing clinical support for personalized cancer therapy of a patient. While the feature scope of patient data 121 across all patients may be informationally dense, an individual patient's feature set may be sparsely populated across the entirety of the collective feature scope of all features across all patients. That is to say, the data stored for one patient may include a different set of features that the data stored for another patient. Further, while illustrated as a single data construct in FIG. 1B, different sets of patient data may be stored in different databases or modules spread across one or more system memories.
  • In some embodiments, sequencing data 122 from one or more sequencing reactions 122-i, including a plurality of sequence reads 123-i-1 to 123-i-K, is stored in the test patient data store 120. The data store may include different sets of sequencing data from a single subject, corresponding to different samples from the patient, e.g., a tumor sample, liquid biopsy sample, tumor organoid derived from a patient tumor, and/or a normal sample, and/or to samples acquired at different times, e.g., while monitoring the progression, regression, remission, and/or recurrence of a cancer in a subject. The sequence reads may be in any suitable file format, e.g., BCL, FASTA, FASTQ, etc. In some embodiments, sequencing data 122 is accessed by a sequencing data processing module 141, which performs various pre-processing, genome alignment, and demultiplexing operations, as described in detail below with reference to bioinformatics module 140. In some embodiments, sequence data that has been aligned to a reference construct, e.g., BAM file 124, is stored in test patient data store 120.
  • In some embodiments, the test patient data store 120 includes feature data 125, e.g., that is useful for identifying clinical support for personalized cancer therapy. In some embodiments, the feature data 125 includes personal characteristics 126 of the patient, such as patient name, date of birth, gender, ethnicity, physical address, smoking status, alcohol consumption characteristic, anthropomorphic data, etc.
  • In some embodiments, the feature data 125 includes medical history data 127 for the patient, such as cancer diagnosis information (e.g., date of initial diagnosis, date of metastatic diagnosis, cancer staging, tumor characterization, tissue of origin, previous treatments and outcomes, adverse effects of therapy, therapy group history, clinical trial history, previous and current medications, surgical history, etc.), previous or current symptoms, previous or current therapies, previous treatment outcomes, previous disease diagnoses, diabetes status, diagnoses of depression, diagnoses of other physical or mental maladies, and family medical history. In some embodiments, the feature data 125 includes clinical features 128, such as pathology data 128-1, medical imaging data 128-2, and tissue culture and/or tissue organoid culture data 128-3.
  • In some embodiments, yet other clinical features, such as previous laboratory testing results, are stored in the test patient data store 120. Medical history data 127 and clinical features may be collected from various sources, including at intake directly from the patient, from an electronic medical record (EMR) or electronic health record (EHR) for the patient, or curated from other sources, such as fields from various testing records (e.g., genetic sequencing reports).
  • In some embodiments, the feature data 125 includes genomic features 131 for the patient. Non-limiting examples of genomic features include allelic states 132 (e.g., the identity of alleles at one or more loci, support for wild type or variant alleles at one or more loci, support for SNVs/MNVs at one or more loci, support for indels at one or more loci, and/or support for gene rearrangements at one or more loci), allelic fractions 133 (e.g., ratios of variant to reference alleles (or vice versa), methylation states 132 (e.g., a distribution of methylation patterns at one or more loci and/or support for aberrant methylation patterns at one or more loci), genomic copy numbers 135 (e.g., a copy number value at one or more loci and/or support for an aberrant (increased or decreased) copy number at one or more loci), tumor mutational burden 136 (e.g., a measure of the number of mutations in the cancer genome of the subject), and microsatellite instability status 137 (e.g., a measure of the repeated unit length at one or more microsatellite loci and/or a classification of the MSI status for the patient's cancer). In some embodiments, one or more of the genomic features 131 are determined by a nucleic acid bioinformatics pipeline, e.g., as described in detail below with reference to FIGS. 4A-4F. In particular, in some embodiments, the feature data 125 include variant allele fractions 133, as determined using the improved methods for validating somatic sequence variants and described in further detail below with reference to FIGS. 1C, 1D, and 4F. In some embodiments, one or more of the genomic features 131 are obtained from an external testing source, e.g., not connected to the bioinformatics pipeline as described below.
  • In some embodiments, the feature data 125 further includes data 138 from other -omics fields of study. Non-limiting examples of -omics fields of study that may yield feature data useful for providing clinical support for personalized cancer therapy include transcriptomics, epigenomics, proteomics, metabolomics, metabonomics, microbiomics, lipidomics, glycomics, cellomics, and organoidomics.
  • In some embodiments, yet other features may include features derived from machine learning approaches, e.g., based at least in part on evaluation of any relevant molecular or clinical features, considered alone or in combination, not limited to those listed above. For instance, in some embodiments, one or more latent features learned from evaluation of cancer patient training datasets improve the diagnostic and prognostic power of the various analysis algorithms in the feature analysis module 160.
  • The skilled artisan will know of other types of features useful for providing clinical support for personalized cancer therapy. The listing of features above is merely representative and should not be construed to be limiting.
  • In some embodiments, a test patient data store 120 includes clinical assessment data 139 for patients, e.g., based on the feature data 125 collected for the subject. In some embodiments, the clinical assessment data 139 includes a catalogue of actionable variants and characteristics 139-1 (e.g., genomic alterations and compound metrics based on genomic features known or believed to be targetable by one or more specific cancer therapies), matched therapies 139-2 (e.g., the therapies known or believed to be particularly beneficial for treatment of subjects having actionable variants), and/or clinical reports 139-3 generated for the subject, e.g., based on identified actionable variants and characteristics 139-1 and/or matched therapies 139-2.
  • In some embodiments, clinical assessment data 139 is generated by analysis of feature data 125 using the various algorithms of feature analysis module 160, as described in further detail below. In some embodiments, clinical assessment data 139 is generated, modified, and/or validated by evaluation of feature data 125 by a clinician, e.g., an oncologist. For instance, in some embodiments, a clinician (e.g., at clinical environment 220) uses feature analysis module 160, or accesses test patient data store 120 directly, to evaluate feature data 125 to make recommendations for personalized cancer treatment of a patient. Similarly, in some embodiments, a clinician (e.g., at clinical environment 220) reviews recommendations determined using feature analysis module 160 and approves, rejects, or modifies the recommendations, e.g., prior to the recommendations being sent to a medical professional treating the cancer patient.
  • Bioinformatics Module (140)
  • Referring again to FIG. 1A, the system (e.g., system 100) includes a bioinformatics module 140 that includes a feature extraction module 145 and optional ancillary data processing constructs, such as a sequence data processing module 141 and/or one or more reference sequence constructs 158 (e.g., a reference genome, exome, or targeted-panel construct that includes reference sequences for a plurality of loci targeted by a sequencing panel).
  • In some embodiments, bioinformatics module 140 includes a sequence data processing module 141 that includes instructions for processing sequence reads, e.g., raw sequence reads 123 from one or more sequencing reactions 122, prior to analysis by the various feature extraction algorithms, as described in detail below. In some embodiments, sequence data processing module 141 includes one or more pre-processing algorithms 142 that prepare the data for analysis. In some embodiments, the pre-processing algorithms 142 include instructions for converting the file format of the sequence reads from the output of the sequencer (e.g., a BCL file format) into a file format compatible with downstream analysis of the sequences (e.g., a FASTQ or FASTA file format). In some embodiments, the pre-processing algorithms 142 include instructions for evaluating the quality of the sequence reads (e.g., by interrogating quality metrics like Phred score, base-calling error probabilities, Quality (Q) scores, and the like) and/or removing sequence reads that do not satisfy a threshold quality (e.g., an inferred base call accuracy of at least 80%, at least 90%, at least 95%, at least 99%, at least 99.5%, at least 99.9%, or higher). In some embodiments, the pre-processing algorithms 142 include instructions for filtering the sequence reads for one or more properties, e.g., removing sequences failing to satisfy a lower or upper size threshold or removing duplicate sequence reads.
  • In some embodiments, sequence data processing module 141 includes one or more alignment algorithms 143, for aligning pre-processed sequence reads 123 to a reference sequence construct 158, e.g., a reference genome, exome, or targeted-panel construct. Many algorithms for aligning sequencing data to a reference construct are known in the art, for example, BWA, Blat, SHRiMP, LastZ, and MAQ. One example of a sequence read alignment package is the Burrows-Wheeler Alignment tool (BWA), which uses a Burrows-Wheeler Transform (BWT) to align short sequence reads against a large reference construct, allowing for mismatches and gaps. Li and Durbin, Bioinformatics, 25(14):1754-60 (2009), the content of which is incorporated herein by reference, in its entirety, for all purposes. Sequence read alignment packages import raw or pre-processed sequence reads 122, e.g., in BCL, FASTA, or FASTQ file formats, and output aligned sequence reads 124, e.g., in SAM or BAM file formats.
  • In some embodiments, sequence data processing module 141 includes one or more demultiplexing algorithms 144, for dividing sequence read or sequence alignment files generated from sequencing reactions of pooled nucleic acids into separate sequence read or sequence alignment files, each of which corresponds to a different source of nucleic acids in the nucleic acid sequencing pool. For instance, because of the cost of sequencing reactions, it is common practice to pool nucleic acids from a plurality of samples into a single sequencing reaction. The nucleic acids from each sample are tagged with a sample-specific and/or molecule-specific sequence tag (e.g., a UMI), which is sequenced along with the molecule. In some embodiments, demultiplexing algorithms 144 sort these sequence tags in the sequence read or sequence alignment files to demultiplex the sequencing data into separate files for each of the samples included in the sequencing reaction.
  • Bioinformatics module 140 includes a feature extraction module 145, which includes instructions for identifying diagnostic features, e.g., genomic features 131, from sequencing data 122 of biological samples from a subject, e.g., one or more of a solid tumor sample, a liquid biopsy sample, or a normal tissue (e.g., control) sample. For instance, in some embodiments, a feature extraction algorithm compares the identity of one or more nucleotides at a locus from the sequencing data 122 to the identity of the nucleotides at that locus in a reference sequence construct (e.g., a reference genome, exome, or targeted-panel construct) to determine whether the subject has a variant at that locus. In some embodiments, a feature extraction algorithm evaluates data other than the raw sequence, to identify a genomic alteration in the subject, e.g., an allelic ratio, a relative copy number, a repeat unit distribution, etc.
  • For instance, in some embodiments, feature extraction module 145 includes one or more variant identification modules that include instructions for various variant calling processes. In some embodiments, variants in the germline of the subject are identified, e.g., using a germline variant identification module 146. In some embodiments, variants in the cancer genome, e.g., somatic variants, are identified, e.g., using a somatic variant identification module 150. While separate germline and somatic variant identification modules are illustrated in FIG. 1A, in some embodiments they are integrated into a single module. In some embodiments, the variant identification module includes instructions for identifying one or more of nucleotide variants (e.g., single nucleotide variants (SNV) and multi-nucleotide variants (MNV)) using one or more SNV/MNV calling algorithms (e.g., algorithms 147 and/or 151), indels (e.g., insertions or deletions of nucleotides) using one or more indel calling algorithms (e.g., algorithms 148 and/or 152), and genomic rearrangements (e.g., inversions, translocation, and fusions of nucleotide sequences) using one or more genomic rearrangement calling algorithms (e.g., algorithms 149 and/or 153).
  • For example, referring to FIGS. 1C and 1D, in some embodiments, feature extraction module 145 comprises, in the variant identification module 146, a variant thresholding module 146-a, a sequence variant data store 146-r, and a variant validation module 146-o. In some such embodiments, the sequence variant data store 146-r comprises one or more candidate variants for a test subject identified by aligning to a reference sequence a plurality of sequence reads obtained from sequencing a liquid biopsy sample of the test subject, the one or more candidate variants corresponding to a respective one or more loci in the reference sequence. The plurality of sequence reads aligned to the reference sequence is used to identify a variant allele fragment count for each candidate variant. The sequence variant data store 146-r further comprises, in some embodiments, a plurality of variants from a first set of nucleic acids obtained from a cohort of subjects (e.g., from a tumor tissue biopsy for each subject in a baseline cohort of subjects). The variant thresholding module 146-a performs a function for each candidate variant in the one or more candidate variants where, for each corresponding locus 146-b (e.g., 146-b-1, . . . , 146-b-P), a dynamic variant count threshold 146-d (e.g., 146-d-1) is obtained based on a pre-test odds of a positive variant call for the locus, based on the prevalence of variants in the genomic region that includes the locus, using the plurality of variants for the baseline cohort. The variant thresholding module 146-a compares the variant allele fragment count 146-c (e.g., 146-c-1) for the candidate variant against the dynamic variant count threshold 146-d for the locus corresponding to the candidate variant. In some embodiments, the variant validation module 146-o determines whether the candidate variant is validated or rejected as a somatic sequence variant based on the comparison. For example, when the variant allele fragment count for the candidate variant satisfies the dynamic variant count threshold for the locus, the somatic sequence variant is validated, and when the variant allele fragment count for the candidate variant does not satisfy the dynamic variant count threshold for the locus, the somatic sequence variant is rejected.
  • In some embodiments, the dynamic variant count threshold is determined based on a distribution of variant detection sensitivities as a function of circulating variant allele fraction from the cohort of subjects (e.g., the baseline cohort). For example, referring to FIG. 1C, in some such embodiments, the variant thresholding module 146-a takes as input one or more variant allele fractions 133 from the genomic features module 131. In some such embodiments, the variant allele fractions 133 comprises a plurality of variant allele fractions obtained from tumor tissue biopsies 133-t (e.g., 133-t-1, 133-t-2 . . . , 133-t-O) for the cohort of subjects. In some embodiments, the variant allele fractions comprise a plurality of variant allele fractions obtained from liquid biopsy samples 133-cf (e.g., 133-cf-1, 133-cf-2 . . . , 133-cf-N) for the cohort of subjects. In some embodiments, the circulating variant allele fraction is obtained by comparing the liquid biopsy variant allele fractions 133-cf to the tumor biopsy variant allele fraction 133-t.
  • Additional embodiments for using variant allele fractions (e.g., variant allele frequencies) to identify somatic variants are detailed below (see, Example Methods: Variant Identification).
  • A SNV/MNV algorithm 147 may identify a substitution of a single nucleotide that occurs at a specific position in the genome. For example, at a specific base position, or locus, in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position and the two possible nucleotide variations, C or A, are said to be alleles for this position. SNPs underlie differences in human susceptibility to a wide range of diseases (e.g. —sickle-cell anemia, β-thalassemia and cystic fibrosis result from SNPs). The severity of illness and the way the body responds to treatments are also manifestations of genetic variations. For example, a single-base mutation in the APOE (apolipoprotein E) gene is associated with a lower risk for Alzheimer's disease. A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. A somatic single-nucleotide variation (e.g., caused by cancer) may also be called a single-nucleotide alteration. An MNP (Multiple-nucleotide polymorphisms) module may identify the substitution of consecutive nucleotides at a specific position in the genome.
  • An indel calling algorithm 148 may identify an insertion or deletion of bases in the genome of an organism classified among small genetic variations. While indels usually measure from 1 to 10 000 base pairs in length, a microindel is defined as an indel that results in a net change of 1 to 50 nucleotides. Indels can be contrasted with a SNP or point mutation. An indel inserts and/or deletes nucleotides from a sequence, while a point mutation is a form of substitution that replaces one of the nucleotides without changing the overall number in the DNA. Indels, being insertions and/or deletions, can be used as genetic markers in natural populations, especially in phylogenetic studies. Indel frequency tends to be markedly lower than that of single nucleotide polymorphisms (SNP), except near highly repetitive regions, including homopolymers and microsatellites.
  • A genomic rearrangement algorithm 149 may identify hybrid genes formed from two previously separate genes. It can occur as a result of translocation, interstitial deletion, or chromosomal inversion. Gene fusion can play an important role in tumorigenesis. Fusion genes can contribute to tumor formation because fusion genes can produce much more active abnormal protein than non-fusion genes. Often, fusion genes are oncogenes that cause cancer; these include BCR-ABL, TEL-AML1 (ALL with t(12; 21)), AML1-ETO (M2 AML with t(8; 21)), and TMPRSS2-ERG with an interstitial deletion on chromosome 21, often occurring in prostate cancer. In the case of TMPRSS2-ERG, by disrupting androgen receptor (AR) signaling and inhibiting AR expression by oncogenic ETS transcription factor, the fusion product regulates prostate cancer. Most fusion genes are found from hematological cancers, sarcomas, and prostate cancer. BCAM-AKT2 is a fusion gene that is specific and unique to high-grade serous ovarian cancer. Oncogenic fusion genes may lead to a gene product with a new or different function from the two fusion partners. Alternatively, a proto-oncogene is fused to a strong promoter, and thereby the oncogenic function is set to function by an upregulation caused by the strong promoter of the upstream fusion partner. The latter is common in lymphomas, where oncogenes are juxtaposed to the promoters of the immunoglobulin genes. Oncogenic fusion transcripts may also be caused by trans-splicing or read-through events. Since chromosomal translocations play such a significant role in neoplasia, a specialized database of chromosomal aberrations and gene fusions in cancer has been created. This database is called Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer.
  • In some embodiments, feature extraction module 145 includes instructions for identifying one or more complex genomic alterations (e.g., features that incorporate more than a change in the primary sequence of the genome) in the cancer genome of the subject. For instance, in some embodiments, feature extraction module 145 includes modules for identifying one or more of copy number variation (e.g., copy number variation analysis module 153), microsatellite instability status (e.g., microsatellite instability analysis module 154), tumor mutational burden (e.g., tumor mutational burden analysis module 155), tumor ploidy (e.g., tumor ploidy analysis module 156), and homologous recombination pathway deficiencies (e.g., homologous recombination pathway analysis module 157).
  • Feature Analysis Module (160)
  • Referring again to FIG. 1A, the system (e.g., system 100) includes a feature analysis module 160 that includes one or more genomic alteration interpretation algorithms 161, one or more optional clinical data analysis algorithms 165, an optional therapeutic curation algorithm 165, and an optional recommendation validation module 167. In some embodiments, feature analysis module 160 identifies actionable variants and characteristics 139-1 and corresponding matched therapies 139-2 and/or clinical trials using one or more analysis algorithms (e.g., algorithms 162, 163, 164, and 165) to evaluate feature data 125. The identified actionable variants and characteristics 139-1 and corresponding matched therapies 139-2, which are optionally stored in test patient data store 120, are then curated by feature analysis module 160 to generate a clinical report 139-3, which is optionally validated by a user, e.g., a clinician, before being transmitted to a medical professional, e.g., an oncologist, treating the patient.
  • In some embodiments, the genomic alteration interpretation algorithms 161 include instructions for evaluating the effect that one or more genomic features 131 of the subject, e.g., as identified by feature extraction module 145, have on the characteristics of the patient's cancer and/or whether one or more targeted cancer therapies may improve the clinical outcome for the patient. For example, in some embodiments, one or more genomic variant analysis algorithms 163 evaluate various genomic features 131 by querying a database, e.g., a look-up-table (“LUT”) of actionable genomic alterations, targeted therapies associated with the actionable genomic alterations, and any other conditions that should be met before administering the targeted therapy to a subject having the actionable genomic alteration. For instance, evidence suggests that depatuxizumab mafodotin (an anti-EGFR mAb conjugated to monomethyl auristatin F) has improved efficacy for the treatment of recurrent glioblastomas having EGFR focal amplifications. van den Bent M. et al., Cancer Chemother Pharmacol., 80(6):1209-17 (2017). Accordingly, the actionable genomic alteration LUT would have an entry for the focal amplification of the EGFR gene indicating that depatuxizumab mafodotin is a targeted therapy for glioblastomas (e.g., recurrent glioblastomas) having a focal gene amplification. In some instances, the LUT may also include counter indications for the associated targeted therapy, e.g., adverse drug interactions or personal characteristics that are counter-indicated for administration of the particular targeted therapy.
  • In some embodiments, a genomic alteration interpretation algorithm 161 determines whether a particular genomic feature 131 should be reported to a medical professional treating the cancer patient. In some embodiments, genomic features 131 (e.g., genomic alterations and compound features) are reported when there is clinical evidence that the feature significantly impacts the biology of the cancer, impacts the prognosis for the cancer, and/or impacts pharmacogenomics, e.g., by indicating or counter-indicating particular therapeutic approaches. For instance, a genomic alteration interpretation algorithm 161 may classify a particular CNV feature 135 as “Reportable,” e.g., meaning that the CNV has been identified as influencing the character of the cancer, the overall disease state, and/or pharmacogenomics, as “Not Reportable,” e.g., meaning that the CNV has not been identified as influencing the character of the cancer, the overall disease state, and/or pharmacogenomics, as “No Evidence,” e.g., meaning that no evidence exists supporting that the CNV is “Reportable” or “Not Reportable,” or as “Conflicting Evidence,” e.g., meaning that evidence exists supporting both that the CNV is “Reportable” and that the CNV is “Not Reportable.”
  • In some embodiments, the genomic alteration interpretation algorithms 161 include one or more pathogenic variant analysis algorithms 162, which evaluate various genomic features to identify the presence of an oncogenic pathogen associated with the patient's cancer and/or targeted therapies associated with an oncogenic pathogen infection in the cancer. For instance, RNA expression patterns of some cancers are associated with the presence of an oncogenic pathogen that is helping to drive the cancer. See, for example, U.S. patent application Ser. No. 16/802,126, filed Feb. 26, 2020, the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some instances, the recommended therapy for the cancer is different when the cancer is associated with the oncogenic pathogen infection than when it is not. Accordingly, in some embodiments, e.g., where feature data 125 includes RNA abundance data for the cancer of the patient, one or more pathogenic variant analysis algorithms 162 evaluate the RNA abundance data for the patient's cancer to determine whether a signature exists in the data that indicates the presence of the oncogenic pathogen in the cancer. Similarly, in some embodiments, bioinformatics module 140 includes an algorithm that searches for the presence of pathogenic nucleic acid sequences in sequencing data 122. See, for example, U.S. Provisional Patent Application Ser. No. 62/978,067, filed Feb. 18, 2020, the content of which is hereby incorporated by reference, in its entirety, for all purposes. Accordingly, in some embodiments, one or more pathogenic variant analysis algorithms 162 evaluates whether the presence of an oncogenic pathogen in a subject is associated with an actionable therapy for the infection. In some embodiments, system 100 queries a database, e.g., a look-up-table (“LUT”), of actionable oncogenic pathogen infections, targeted therapies associated with the actionable infections, and any other conditions that should be met before administering the targeted therapy to a subject that is infected with the oncogenic pathogen. In some instances, the LUT may also include counter indications for the associated targeted therapy, e.g., adverse drug interactions or personal characteristics that are counter-indicated for administration of the particular targeted therapy.
  • In some embodiments, the genomic alteration interpretation algorithms 161 include one or more multi-feature analysis algorithms 164 that evaluate a plurality of features to classify a cancer with respect to the effects of one or more targeted therapies. For instance, in some embodiments, feature analysis module 160 includes one or more classifiers trained against feature data, one or more clinical therapies, and their associated clinical outcomes for a plurality of training subjects to classify cancers based on their predicted clinical outcomes following one or more therapies.
  • In some embodiments, the classifier is implemented as an artificial intelligence engine and may include gradient boosting models, random forest models, neural networks (NN), regression models, Naive Bayes models, and/or machine learning algorithms (MLA). An MLA or a NN may be trained from a training data set that includes one or more features 125, including personal characteristics 126, medical history 127, clinical features 128, genomic features 131, and/or other -omic features 138. MLAs include supervised algorithms (such as algorithms where the features/classifications in the data set are annotated) using linear regression, logistic regression, decision trees, classification and regression trees, naïve Bayes, nearest neighbor clustering; unsupervised algorithms (such as algorithms where no features/classification in the data set are annotated) using Apriori, means clustering, principal component analysis, random forest, adaptive boosting; and semi-supervised algorithms (such as algorithms where an incomplete number of features/classifications in the data set are annotated) using generative approach (such as a mixture of Gaussian distributions, mixture of multinomial distributions, hidden Markov models), low density separation, graph-based approaches (such as mincut, harmonic function, manifold regularization), heuristic approaches, or support vector machines.
  • NNs include conditional random fields, convolutional neural networks, attention based neural networks, deep learning, long short term memory networks, or other neural models where the training data set includes a plurality of tumor samples, RNA expression data for each sample, and pathology reports covering imaging data for each sample.
  • While MLA and neural networks identify distinct approaches to machine learning, the terms may be used interchangeably herein. Thus, a mention of MLA may include a corresponding NN or a mention of NN may include a corresponding MLA unless explicitly stated otherwise. Training may include providing optimized datasets, labeling these traits as they occur in patient records, and training the MLA to predict or classify based on new inputs. Artificial NNs are efficient computing models which have shown their strengths in solving hard problems in artificial intelligence. They have also been shown to be universal approximators, that is, they can represent a wide variety of functions when given appropriate parameters.
  • In some embodiments, system 100 includes a classifier training module that includes instructions for training one or more untrained or partially trained classifiers based on feature data from a training dataset. In some embodiments, system 100 also includes a database of training data for use in training the one or more classifiers. In other embodiments, the classifier training module accesses a remote storage device hosting training data. In some embodiments, the training data includes a set of training features, including but not limited to, various types of the feature data 125 illustrated in FIG. 1B. In some embodiments, the classifier training module uses patient data 121, e.g., when test patient data store 120 also stores a record of treatments administered to the patient and patient outcomes following therapy.
  • In some embodiments, feature analysis module 160 includes one or more clinical data analysis algorithms 165, which evaluate clinical features 128 of a cancer to identify targeted therapies which may benefit the subject. For example, in some embodiments, e.g., where feature data 125 includes pathology data 128-1, one or more clinical data analysis algorithms 165 evaluate the data to determine whether an actionable therapy is indicated based on the histopathology of a tumor biopsy from the subject, e.g., which is indicative of a particular cancer type and/or stage of cancer. In some embodiments, system 100 queries a database, e.g., a look-up-table (“LUT”), of actionable clinical features (e.g., pathology features), targeted therapies associated with the actionable features, and any other conditions that should be met before administering the targeted therapy to a subject associated with the actionable clinical features 128 (e.g., pathology features 128-1). In some embodiments, system 100 evaluates the clinical features 128 (e.g., pathology features 128-1) directly to determine whether the patient's cancer is sensitive to a particular therapeutic agent. Further details on example methods, systems, and algorithms for classifying cancer and identifying targeted therapies based on clinical data, such as pathology data 128-1, imaging data 138-2, and/or tissue culture/organoid data 128-3 are discussed, for example, in U.S. patent application Ser. No. 16/830,186, filed on Mar. 25, 2020, U.S. patent application Ser. No. 16/789,363, filed on Feb. 12, 2020, and U.S. Provisional Application No. 63/007,874, filed on Apr. 9, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
  • In some embodiments, feature analysis module 160 includes a clinical trials module that evaluates test patient data 121 to determine whether the patient is eligible for inclusion in a clinical trial for a cancer therapy, e.g., a clinical trial that is currently recruiting patients, a clinical trial that has not yet begun recruiting patients, and/or an ongoing clinical trial that may recruit additional patients in the future. In some embodiments, a clinical trial module evaluates test patient data 121 to determine whether the results of a clinical trial are relevant for the patient, e.g., the results of an ongoing clinical trial and/or the results of a completed clinical trial. For instance, in some embodiments, system 100 queries a database, e.g., a look-up-table (“LUT”) of clinical trials, e.g., active and/or completed clinical trials, and compares patient data 121 with inclusion criteria for the clinical trials, stored in the database, to identify clinical trials with inclusion criteria that closely match and/or exactly match the patient's data 121. In some embodiments, a record of matching clinical trials, e.g., those clinical trials that the patient may be eligible for and/or that may inform personalized treatment decisions for the patient, are stored in clinical assessment database 139.
  • In some embodiments, feature analysis module 160 includes a therapeutic curation algorithm 166 that assembles actionable variants and characteristics 139-1, matched therapies 139-2, and/or relevant clinical trials identified for the patient, as described above. In some embodiments, a therapeutic curation algorithm 166 evaluates certain criteria related to which actionable variants and characteristics 139-1, matched therapies 139-2, and/or relevant clinical trials should be reported and/or whether certain matched therapies, considered alone or in combination, may be counter-indicated for the patient, e.g., based on personal characteristics 126 of the patient and/or known drug-drug interactions. In some embodiments, the therapeutic curation algorithm then generates one or more clinical reports 139-3 for the patient. In some embodiments, the therapeutic curation algorithm generates a first clinical report 139-3-1 that is to be reported to a medical professional treating the patient and a second clinical report 139-3-2 that will not be communicated to the medical professional, but may be used to improve various algorithms within the system.
  • In some embodiments, feature analysis module 160 includes a recommendation validation module 167, that includes an interface allowing a clinician to review, modify, and approve a clinical report 139-3 prior to the report being sent to a medical professional, e.g., an oncologist, treating the patient.
  • In some embodiments, each of the one or more feature collections, sequencing modules, bioinformatics modules (including, e.g., alteration module(s), structural variant calling and data processing modules), classification modules and outcome modules are communicatively coupled to a data bus to transfer data between each module for processing and/or storage. In some alternative embodiments, each of the feature collection, alteration module(s), structural variant and feature store are communicatively coupled to each other for independent communication without sharing the data bus.
  • Further details on systems and exemplary embodiments of modules and feature collections are discussed in PCT Application PCT/US19/69149, titled “A METHOD AND PROCESS FOR PREDICTING AND ANALYZING PATIENT COHORT RESPONSE, PROGRESSION, AND SURVIVAL,” filed Dec. 31, 2019, which is hereby incorporated herein by reference in its entirety.
  • Example Methods
  • Now that details of a system 100 for providing clinical support for personalized cancer therapy, e.g., with improved validation of somatic sequence variants, have been disclosed, details regarding processes and features of the system, in accordance with various embodiments of the present disclosure, are disclosed below. Specifically, example processes are described below with reference to FIGS. 2A, 3, 4A-4F, 5A-5B, 6, and 7. In some embodiments, such processes and features of the system are carried out by modules 118, 120, 140, 160, and/or 170, as illustrated in FIG. 1A. Referring to these methods, the systems described herein (e.g., system 100) include instructions for validating somatic variants that are improved compared to conventional methods for somatic variant detection.
  • FIG. 2B: Distributed Diagnostic and Clinical Environment
  • In some aspects, the methods described herein for providing clinical support for personalized cancer therapy are performed across a distributed diagnostic/clinical environment, e.g., as illustrated in FIG. 2B. However, in some embodiments, the improved methods described herein for validating somatic sequence variants are performed at a single location, e.g., at a single computing system or environment, although ancillary procedures supporting the methods described herein, and/or procedures that make further use of the results of the methods described herein, may be performed across a distributed diagnostic/clinical environment.
  • FIG. 2B illustrates an example of a distributed diagnostic/clinical environment 210. In some embodiments, the distributed diagnostic/clinical environment is connected via communication network 105. In some embodiments, one or more biological samples, e.g., one or more liquid biopsy samples, solid tumor biopsy, normal tissue samples, and/or control samples, are collected from a subject in clinical environment 220, e.g., a doctor's office, hospital, or medical clinic, or at a home health care environment (not depicted). Advantageously, while solid tumor samples should be collected within a clinical setting, liquid biopsy samples can be acquired in a less invasive fashion and are more easily collected outside of a traditional clinical setting. In some embodiments, one or more biological samples, or portions thereof, are processed within the clinical environment 220 where collection occurred, using a processing device 224, e.g., a nucleic acid sequencer for obtaining sequencing data, a microscope for obtaining pathology data, a mass spectrometer for obtaining proteomic data, etc. In some embodiments, one or more biological samples, or portions thereof are sent to one or more external environments, e.g., sequencing lab 230, pathology lab 240, and/or molecular biology lab 250, each of which includes a processing device 234, 244, and 254, respectively, to generate biological data 121 for the subject. Each environment includes a communications device 222, 232, 242, and 252, respectively, for communicating biological data 121 about the subject to a processing server 262 and/or database 264, which may be located in yet another environment, e.g., processing/storage center 260. Thus, in some embodiments, different portions of the systems and methods described herein are fulfilled by different processing devices located in different physical environments.
  • Accordingly, in some embodiments, a method for providing clinical support for personalized cancer therapy, e.g., with improved validation of somatic sequence variants, is performed across one or more environments, as illustrated in FIG. 2B. For instance, in some such embodiments, a liquid biopsy sample is collected at clinical environment 220 or in a home healthcare environment. The sample, or a portion thereof, is sent to sequencing lab 230 where raw sequence reads 123 of nucleic acids in the sample are generated by sequencer 234. The raw sequencing data 123 is communicated, e.g., from communications device 232, to database 264 at processing/storage center 260, where processing server 262 extracts features from the sequence reads by executing one or more of the processes in bioinformatics module 140, thereby generating genomic features 131 for the sample. Processing server 262 may then analyze the identified features by executing one or more of the processes in feature analysis module 160, thereby generating clinical assessment 139, including a clinical report 139-3. A clinician may access clinical report 139-3, e.g., at processing/storage center 260 or through communications network 105, via recommendation validation module 167. After final approval, clinical report 139-3 is transmitted to a medical professional, e.g., an oncologist, at clinical environment 220, who uses the report to support clinical decision making for personalized treatment of the patient's cancer.
  • FIG. 2A: Example Workflow for Precision Oncology
  • FIG. 2A is a flowchart of an example workflow 200 for collecting and analyzing data in order to generate a clinical report 139 to support clinical decision making in precision oncology. Advantageously, the methods described herein improve this process, for example, by improving various stages within feature extraction 206, including validation of somatic sequence variants.
  • Briefly, the workflow begins with patient intake and sample collection 201, where one or more liquid biopsy samples, one or more tumor biopsy, and one or more normal and/or control tissue samples are collected from the patient (e.g., at a clinical environment 220 or home healthcare environment, as illustrated in FIG. 2B). In some embodiments, personal data 126 corresponding to the patient and a record of the one or more biological samples obtained (e.g., patient identifiers, patient clinical data, sample type, sample identifiers, cancer conditions, etc.) are entered into a data analysis platform, e.g., test patient data store 120. Accordingly, in some embodiments, the methods disclosed herein include obtaining one or more biological samples from one or more subjects, e.g., cancer patients. In some embodiments, the subject is a human, e.g., a human cancer patient.
  • In some embodiments, one or more of the biological samples obtained from the patient are a biological liquid sample, also referred to as a liquid biopsy sample. In some embodiments, one or more of the biological samples obtained from the patient are selected from blood, plasma, serum, urine, vaginal fluid, fluid from a hydrocele (e.g., of the testis), vaginal flushing fluids, pleural fluid, ascitic fluid, cerebrospinal fluid, saliva, sweat, tears, sputum, bronchoalveolar lavage fluid, discharge fluid from the nipple, aspiration fluid from different parts of the body (e.g., thyroid, breast), etc. In some embodiments, the liquid biopsy sample includes blood and/or saliva. In some embodiments, the liquid biopsy sample is peripheral blood. In some embodiments, blood samples are collected from patients in commercial blood collection containers, e.g., using a PAXgene® Blood DNA Tubes. In some embodiments, saliva samples are collected from patients in commercial saliva collection containers, e.g., using an Oragene® DNA Saliva Kit.
  • In some embodiments, the liquid biopsy sample has a volume of from about 1 mL to about 50 mL. For example, in some embodiments, the liquid biopsy sample has a volume of about 1 mL, about 2 mL, about 3 mL, about 4 mL, about 5 mL, about 6 mL, about 7 mL, about 8 mL, about 9 mL, about 10 mL, about 11 mL, about 12 mL, about 13 mL, about 14 mL, about 15 mL, about 16 mL, about 17 mL, about 18 mL, about 19 mL, about 20 mL, or greater.
  • Liquid biopsy samples include cell free nucleic acids, including cell-free DNA (cfDNA). As described above, cfDNA isolated from cancer patients includes DNA originating from cancerous cells, also referred to as circulating tumor DNA (ctDNA), cfDNA originating from germline (e.g., healthy or non-cancerous) cells, and cfDNA originating from hematopoietic cells (e.g., white blood cells). The relative proportions of cancerous and non-cancerous cfDNA present in a liquid biopsy sample varies depending on the characteristics (e.g., the type, stage, lineage, genomic profile, etc.) of the patient's cancer. As used herein, the ‘tumor burden’ of the subject refers to the percentage cfDNA that originated from cancerous cells.
  • As described herein, cfDNA is a particularly useful source of biological data for various implementations of the methods and systems described herein, because it is readily obtained from various body fluids. Advantageously, use of bodily fluids facilitates serial monitoring because of the ease of collection, as these fluids are collectable by non-invasive or minimally invasive methodologies. This is in contrast to methods that rely upon solid tissue samples, such as biopsies, which often times require invasive surgical procedures. Further, because bodily fluids, such as blood, circulate throughout the body, the cfDNA population represents a sampling of many different tissue types from many different locations.
  • In some embodiments, a liquid biopsy sample is separated into two different samples. For example, in some embodiments, a blood sample is separated into a blood plasma sample, containing cfDNA, and a buffy coat preparation, containing white blood cells.
  • In some embodiments, a plurality of liquid biopsy samples is obtained from a respective subject at intervals over a period of time (e.g., using serial testing). For example, in some such embodiments, the time between obtaining liquid biopsy samples from a respective subject is at least 1 day, at least 2 days, at least 1 week, at least 2 weeks, at least 1 month, at least 2 months, at least 3 months, at least 4 months, at least 6 months, or at least 1 year.
  • In some embodiments, one or more biological samples collected from the patient is a solid tissue sample, e.g., a solid tumor sample or a solid normal tissue sample. Methods for obtaining solid tissue samples, e.g., of cancerous and/or normal tissue are known in the art and are dependent upon the type of tissue being sampled. For example, bone marrow biopsies and isolation of circulating tumor cells can be used to obtain samples of blood cancers, endoscopic biopsies can be used to obtain samples of cancers of the digestive tract, bladder, and lungs, needle biopsies (e.g., fine-needle aspiration, core needle aspiration, vacuum-assisted biopsy, and image-guided biopsy, can be used to obtain samples of subdermal tumors, skin biopsies, e.g., shave biopsy, punch biopsy, incisional biopsy, and excisional biopsy, can be used to obtain samples of dermal cancers, and surgical biopsies can be used to obtain samples of cancers affecting internal organs of a patient. In some embodiments, a solid tissue sample is a formalin-fixed tissue (FFT). In some embodiments, a solid tissue sample is a macro-dissected formalin fixed paraffin embedded (FFPE) tissue. In some embodiments, a solid tissue sample is a fresh frozen tissue sample.
  • In some embodiments, a dedicated normal sample is collected from the patient, for co-processing with a liquid biopsy sample. Generally, the normal sample is of a non-cancerous tissue, and can be collected using any tissue collection means described above. In some embodiments, buccal cells collected from the inside of a patient's cheeks are used as a normal sample. Buccal cells can be collected by placing an absorbent material, e.g., a swab, in the subject's mouth and rubbing it against their cheek, e.g., for at least 15 second or for at least 30 seconds. The swab is then removed from the patient's mouth and inserted into a tube, such that the tip of the tube is submerged into a liquid that serves to extract the buccal cells off of the absorbent material. An example of buccal cell recovery and collection devices is provided in U.S. Pat. No. 9,138,205, the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some embodiments, the buccal swab DNA is used as a source of normal DNA in circulating heme malignancies.
  • The biological samples collected from the patient are, optionally, sent to various analytical environments (e.g., sequencing lab 230, pathology lab 240, and/or molecular biology lab 250) for processing (e.g., data collection) and/or analysis (e.g., feature extraction). Wet lab processing 204 may include cataloguing samples (e.g., accessioning), examining clinical features of one or more samples (e.g., pathology review), and nucleic acid sequence analysis (e.g., extraction, library prep, capture+hybridize, pooling, and sequencing). In some embodiments, the workflow includes clinical analysis of one or more biological samples collected from the subject, e.g., at a pathology lab 240 and/or a molecular and cellular biology lab 250, to generate clinical features such as pathology features 128-3, imaging data 128-3, and/or tissue culture/organoid data 128-3.
  • In some embodiments, the pathology data 128-1 collected during clinical evaluation includes visual features identified by a pathologist's inspection of a specimen (e.g., a solid tumor biopsy), e.g., of stained H&E or IHC slides. In some embodiments, the sample is a solid tissue biopsy sample. In some embodiments, the tissue biopsy sample is a formalin-fixed tissue (FFT), e.g., a formalin-fixed paraffin-embedded (FFPE) tissue. In some embodiments, the tissue biopsy sample is an FFPE or FFT block. In some embodiments, the tissue biopsy sample is a fresh-frozen tissue biopsy. The tissue biopsy sample can be prepared in thin sections (e.g., by cutting and/or affixing to a slide), to facilitate pathology review (e.g., by staining with immunohistochemistry stain for IHC review and/or with hematoxylin and eosin stain for H&E pathology review). For instance, analysis of slides for H&E staining or IHC staining may reveal features such as tumor infiltration, programmed death-ligand 1 (PD-L1) status, human leukocyte antigen (HLA) status, or other immunological features.
  • In some embodiments, a liquid sample (e.g., blood) collected from the patient (e.g., in EDTA-containing collection tubes) is prepared on a slide (e.g., by smearing) for pathology review. In some embodiments, macrodissected FFPE tissue sections, which may be mounted on a histopathology slide, from solid tissue samples (e.g., tumor or normal tissue) are analyzed by pathologists. In some embodiments, tumor samples are evaluated to determine, e.g., the tumor purity of the sample, the percent tumor cellularity as a ratio of tumor to normal nuclei, etc. For each section, background tissue may be excluded or removed such that the section meets a tumor purity threshold, e.g., where at least 20% of the nuclei in the section are tumor nuclei, or where at least 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or more of the nuclei in the section are tumor nuclei.
  • In some embodiments, pathology data 128-1 is extracted, in addition to or instead of visual inspection, using computational approaches to digital pathology, e.g., providing morphometric features extracted from digital images of stained tissue samples. A review of digital pathology methods is provided in Bera, K. et al., Nat. Rev. Clin. Oncol., 16:703-15 (2019), the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some embodiments, pathology data 128-1 includes features determined using machine learning algorithms to evaluate pathology data collected as described above.
  • Further details on methods, systems, and algorithms for using pathology data to classify cancer and identify targeted therapies are discussed, for example, in are discussed, for example, in U.S. patent application Ser. No. 16/830,186, filed on Mar. 25, 2020, and U.S. Provisional Application No. 63/007,874, filed on Apr. 9, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
  • In some embodiments, imaging data 128-2 collected during clinical evaluation includes features identified by review of in-vitro and/or in-vivo imaging results (e.g., of a tumor site), for example a size of a tumor, tumor size differentials over time (such as during treatment or during other periods of change). In some embodiments, imaging data 128-2 includes features determined using machine learning algorithms to evaluate imaging data collected as described above.
  • Further details on methods, systems, and algorithms for using medical imaging to classify cancer and identify targeted therapies are discussed, for example, in are discussed, for example, in U.S. patent application Ser. No. 16/830,186, filed on Mar. 25, 2020, and U.S. Provisional Application No. 63/007,874, filed on Apr. 9, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
  • In some embodiments, tissue culture/organoid data 128-3 collected during clinical evaluation includes features identified by evaluation of cultured tissue from the subject. For instance, in some embodiments, tissue samples obtained from the patients (e.g., tumor tissue, normal tissue, or both) are cultured (e.g., in liquid culture, solid-phase culture, and/or organoid culture) and various features, such as cell morphology, growth characteristics, genomic alterations, and/or drug sensitivity, are evaluated. In some embodiments, tissue culture/organoid data 128-3 includes features determined using machine learning algorithms to evaluate tissue culture/organoid data collected as described above. Examples of tissue organoid (e.g., personal tumor organoid) culturing and feature extractions thereof are described in U.S. Provisional Application Ser. No. 62/924,621, filed on Oct. 22, 2019, and U.S. patent application Ser. No. 16/693,117, filed on Nov. 22, 2019, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
  • Nucleic acid sequencing of one or more samples collected from the subject is performed, e.g., at sequencing lab 230, during wet lab processing 204. An example workflow for nucleic acid sequencing is illustrated in FIG. 3. In some embodiments, the one or more biological samples obtained at the sequencing lab 230 are accessioned (302), to track the sample and data through the sequencing process.
  • Next, nucleic acids, e.g., RNA and/or DNA are extracted (304) from the one or more biological samples. Methods for isolating nucleic acids from biological samples are known in the art, and are dependent upon the type of nucleic acid being isolated (e.g., cfDNA, DNA, and/or RNA) and the type of sample from which the nucleic acids are being isolated (e.g., liquid biopsy samples, white blood cell buffy coat preparations, formalin-fixed paraffin-embedded (FFPE) solid tissue samples, and fresh frozen solid tissue samples). The selection of any particular nucleic acid isolation technique for use in conjunction with the embodiments described herein is well within the skill of the person having ordinary skill in the art, who will consider the sample type, the state of the sample, the type of nucleic acid being sequenced and the sequencing technology being used.
  • For instance, many techniques for DNA isolation, e.g., genomic DNA isolation, from a tissue sample are known in the art, such as organic extraction, silica adsorption, and anion exchange chromatography. Likewise, many techniques for RNA isolation, e.g., mRNA isolation, from a tissue sample are known in the art. For example, acid guanidinium thiocyanate-phenol-chloroform extraction (see, for example, Chomczynski and Sacchi, 2006, Nat Protoc, 1(2):581-85, which is hereby incorporated by reference herein), and silica bead/glass fiber adsorption (see, for example, Poeckh, T. et al., 2008, Anal Biochem., 373(2):253-62, which is hereby incorporated by reference herein). The selection of any particular DNA or RNA isolation technique for use in conjunction with the embodiments described herein is well within the skill of the person having ordinary skill in the art, who will consider the