AU2021224670A1 - Methods and systems for a liquid biopsy assay - Google Patents

Methods and systems for a liquid biopsy assay Download PDF

Info

Publication number
AU2021224670A1
AU2021224670A1 AU2021224670A AU2021224670A AU2021224670A1 AU 2021224670 A1 AU2021224670 A1 AU 2021224670A1 AU 2021224670 A AU2021224670 A AU 2021224670A AU 2021224670 A AU2021224670 A AU 2021224670A AU 2021224670 A1 AU2021224670 A1 AU 2021224670A1
Authority
AU
Australia
Prior art keywords
bin
sequence
segment
level
measure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
AU2021224670A
Inventor
Terri M. Driessen
Justin David Finkle
Christine LO
Robert Tell
Wei Zhu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tempus Labs Inc
Original Assignee
Tempus Labs Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US202062978130P priority Critical
Priority to US62/978,130 priority
Priority to US202063041272P priority
Priority to US202063041424P priority
Priority to US202063041293P priority
Priority to US63/041,424 priority
Priority to US63/041,272 priority
Priority to US63/041,293 priority
Application filed by Tempus Labs Inc filed Critical Tempus Labs Inc
Priority to PCT/US2021/018622 priority patent/WO2021168146A1/en
Publication of AU2021224670A1 publication Critical patent/AU2021224670A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Abstract

Methods, systems, and software are provided for validating a copy number variation, validating a somatic sequence variant, and/or determining circulating tumor fraction estimates using on-target and off-target sequence reads in a test subject. A copy number status annotation for a genomic segment is validated by applying a first dataset to a plurality of filters comprising a measure of central tendency bin-level sequence ratio filter, a confidence filter, and a measure of central tendency-plus-deviation bin-level sequence ratio filter. A somatic sequence variant is validated by comparing a variant allele fragment count for a candidate somatic sequence variant for a respective locus, against a dynamic variant count threshold for the locus in a respective reference sequence. A circulating tumor fraction is estimated based on a measure of fit between genomic segment-level coverage ratios and integer copy states across a plurality of simulated circulated tumor fractions.

Description

METHODS AND SYSTEMS FOR A LIQUID BIOPSY ASSAY
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Patent Application No. 63/041,272, filed June 19, 2020, U.S. Provisional Patent Application No. 63/041,293, filed June 19, 2020, U.S. Provisional Patent Application No. 63/041,424, filed June 19, 2020, and U.S. Provisional Patent Application No. 62/978,130, filed February 18, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
FIELD OF THE INVENTION
[0002] The present disclosure relates generally to the use of cell-free DNA sequencing data to provide clinical support for personalized treatment of cancer.
BACKGROUND
[0003] Precision oncology is the practice of tailoring cancer therapy to the unique genomic, epigenetic, and/or transcriptomic profile of an individual’s cancer. Personalized cancer treatment builds upon conventional therapeutic regimens used to treat cancer based only on the gross classification of the cancer, e.g., treating all breast cancer patients with a first therapy and all lung cancer patients with a second therapy. This field was home out of many observations that different patients diagnosed with the same type of cancer, e.g., breast cancer, responded very differently to common treatment regimens. Over time, researchers have identified genomic, epigenetic, and transcriptomic markers that improve predictions as to how an individual cancer will respond to a particular treatment modality.
[0004] There is growing evidence that cancer patients who receive therapy guided by their genetics have better outcomes. For example, studies have shown that targeted therapies result in significantly improved progression-free cancer survival. See, e.g., Radovich M. et cil, Oncotarget, 7(35):56491-500 (2016). Similarly, reports from the IMPACT trial — a large (n = 1307) retrospective analysis of consecutive, prospectively molecularly profiled patients with advanced cancer who participated in a large, personalized medicine trial — indicate that patients receiving targeted therapies matched to their tumor biology had a response rate of 16.2%, as opposed to a response rate of 5.2% for patients receiving non-matched therapy. Tsimberidou AM et cil, ASCO 2018, Abstract LBA2553 (2018). [0005] In fact, therapy targeted to specific genomic alterations is already the standard of care in several tumor types, e.g., as suggested in the National Comprehensive Cancer Network (NCCN) guidelines for melanoma, colorectal cancer, and non-small cell lung cancer. In practice, implementation of these targeted therapies requires determining the status of the diagnostic marker in each eligible cancer patient. While this can be accomplished for the few, well known mutations associated with treatment recommendations in the NCCN guidelines using individual assays or small next generation sequencing (NGS) panels, the growing number of actionable genomic alterations and increasing complexity of diagnostic classifiers necessitates a more comprehensive evaluation of each patient’s cancer genome, epigenome, and/or transcriptome.
[0006] For instance, some evidence suggests that use of combination therapies where each component is matched to an actionable genomic alteration holds the greatest potential for treating individual cancers. To this point, a retroactive study of cancer patients treated with one or more therapeutic regimens revealed that patients who received therapies matched to a higher percentage of their genomic alterations experienced a greater frequency of stable disease (e.g., a longer time to recurrence), longer time to treatment failure, and greater overall survival. Wheeler JJ et al, Cancer Res., 76:3690-701 (2016). Thus, comprehensive evaluation of each cancer patient’s genome, epigenome, and/or transcriptome should maximize the benefits provided by precision oncology, by facilitating more fine-tuned combination therapies, use of novel off-label drug indications, and/or tissue agnostic immunotherapy. See, for example, Schwaederle M. et al, J Clin Oncol., 33(32):3817-25 (2015); Schwaederle M. et al, JAMA Oncol., 2(11): 1452-59 (2016); and Wheler JJ et al, Cancer Res., 76(13):3690-701 (2016). Further, the use of comprehensive next generation sequencing analysis of cancer genomes facilitates better access and a larger patient pool for clinical trial enrollment. Coyne GO et al, Curr. Probl. Cancer, 41(3): 182-93 (2017); and Markman M., Oncology, 31(3): 158, 168.
[0007] The use of large NGS genomic analysis is growing in order to address the need for more comprehensive characterization of an individual’s cancer genome. See, for example, Fernandes GS et al, Clinics, 72(10):588-94. Recent studies indicate that of the patients for which large NGS genomic analysis is performed, 30-40% then receive clinical care based on the assay results, which is limited by at least the identification of actionable genomic alterations, the availability of medication for treatment of identified actionable genomic alterations, and the clinical condition of the subject. See, Ross JS et al, JAMA Oncol., l(l):40-49 (2015); Ross JS etal, Arch. Pathol. Lab Med., 139:642-49 (2015); Hirshfield KM et cil, Oncologist, 21(11): 1315-25 (2016); and Groisberg R. etal., Oncotarget, 8:39254-67 (2017).
[0008] However, these large NGS genomic analyses are conventionally performed on solid tumor samples. For instance, each of the studies referenced in the paragraph above performed NGS analysis of FFPE tumor blocks from patients. Solid tissue biopsies remain the gold standard for diagnosis and identification of predictive biomarkers because they represent well-known and validated methodologies that provide a high degree of accuracy. Nevertheless, there are significant limitations to the use of solid tissue material for large NGS genomic analyses of cancers. For example, tumor biopsies are subject to sampling bias caused by spatial and/or temporal genetic heterogeneity, e.g., between two regions of a single tumor and/or between different cancerous tissues (such as between primary and metastatic tumor sites or between two different primary tumor sites). Such intertumor or intratumor heterogeneity can cause sub-clonal or emerging mutations to be overlooked when using localized tissue biopsies, with the potential for sampling bias to be exacerbated over time as sub-clonal populations further evolve and/or shift in predominance.
[0009] Additionally, the acquisition of solid tissue biopsies often requires invasive surgical procedures, e.g., when the primary tumor site is located at an internal organ. These procedures can be expensive, time consuming, and carry a significant risk to the patient, e.g., when the patient’s health is poor and may not be able to tolerate invasive medical procedures and/or the tumor is located in a particularly sensitive or inoperable location, such as in the brain or heart. Further, the amount of tissue, if any, that can be procured depends on multiple factors, including the location of the tumor, the size of the tumor, the fragility of the patient, and the risk of comorbidities related to biopsies, such as bleeding and infections. For instance, recent studies report that tissue samples in a majority of advanced non-small cell lung cancer patients are limited to small biopsies and cannot be obtained at all in up to 31% of patients. Ilie and Hofman, Transl. Lung Cancer Res., 5(4):420-23 (2016). Even when a tissue biopsy is obtained, the sample may be too scant for comprehensive testing.
[0010] Further, the method of tissue collection, preservation (e.g. , formalin fixation), and/or storage of tissue biopsies can result in sample degradation and variable quality DNA. This, in turn, leads to inaccuracies in downstream assays and analysis, including next- generation sequencing (NGS) for the identification of biomarkers. Ilie and Hofman, Transl Lung Cancer Res., 5(4):420-23 (2016). [0011] In addition, the invasive nature of the biopsy procedure, the time and cost associated with obtaining the sample, and the compromised state of cancer patients receiving therapy render repeat testing of cancerous tissues impracticable, if not impossible. As a result, solid tissue biopsy analysis is not amenable to many monitoring schemes that would benefit cancer patients, such as disease progression analysis, treatment efficacy evaluation, disease recurrence monitoring, and other techniques that require data from several time points.
[0012] Cell-free DNA (cfDNA) has been identified in various bodily fluids, e.g., blood serum, plasma, urine, etc. Chan etal., Ann. Clin. Biochem., 40(Pt 2): 122-30 (2003). This cfDNA originates from necrotic or apoptotic cells of all types, including germline cells, hematopoietic cells, and diseased (e.g., cancerous) cells. Advantageously, genomic alterations in cancerous tissues can be identified from cfDNA isolated from cancer patients. See, e.g., Stroun et ctl, Oncology, 46(5):318-22 (1989); Goessl et ctl, Cancer Res., 60(21):5941-45 (2000); and Frenel etal, Clin. Cancer Res. 21(20):4586-96 (2015). Thus, one approach to overcoming the problems presented by the use of solid tissue biopsies described above is to analyze cell-free nucleic acids (e.g., cfDNA) and/or nucleic acids in circulating tumor cells present in biological fluids, e.g., via a liquid biopsy.
[0013] Specifically, liquid biopsies offer several advantages over conventional solid tissue biopsy analysis. For instance, because bodily fluids can be collected in a minimally invasive or non-invasive fashion, sample collection is simpler, faster, safer, and less expensive than solid tumor biopsies. Such methods require only small amounts of sample (e.g., 10 mL or less of whole blood per biopsy) and reduce the discomfort and risk of complications experienced by patients during conventional tissue biopsies. In fact, liquid biopsy samples can be collected with limited or no assistance from medical professionals and can be performed at almost any location. Further, liquid biopsy samples can be collected from any patient, regardless of the location of their cancer, their overall health, and any previous biopsy collection. This allows for analysis of the cancer genome of patients from which a solid tumor sample cannot be easily and/or safely obtained. In addition, because cell-free DNA in the bodily fluids arise from many different types of tissues in the patient, the genomic alterations present in the pool of cell-free DNA are representative of various different clonal sub-populations of the cancerous tissue of the subject, facilitating a more comprehensive analysis of the cancerous genome of the subject than is possible from one or more sections of a single solid tumor sample. [0014] Liquid biopsies also enable serial genetic testing prior to cancer detection, during the early stages of cancer progression, throughout the course of treatment, and during remission, e.g., to monitor for disease recurrence. The ability to conduct serial testing via non-invasive liquid biopsies throughout the course of disease could prove beneficial for many patients, e.g., through monitoring patient response to therapies, the emergence of new actionable genomic alterations, and/or drug-resistance alterations. These types of information allow medical professionals to more quickly tailor and update therapeutic regimens, e.g., facilitating more timely intervention in the case of disease progression. See, e.g., Ilie and Hofman, Transl. Lung Cancer Res., 5(4):420-23 (2016).
[0015] Nevertheless, while liquid biopsies are promising tools for improving outcomes using precision oncology, there are significant challenges specific to the use of cell-free DNA for evaluation of a subject’s cancer genome. For instance, there is a highly variable signal -to- noise ratio from one liquid biopsy sample to the next. This occurs because cfDNA originates from a variety of different cells in a subject, both healthy and diseased. Depending on the stage and type of cancer in any particular subject, the fraction of cfDNA fragments originating from cancerous cells (the “tumor fraction” or “ctDNA fraction” of the sample/subject) can range from almost 0% to well over 50%. Other factors, including tumor type and mutation profile, can also impact the amount of DNA released from cancerous tissues. For instance, cfDNA clearance through the liver and kidneys is affected by a variety of factors, including renal dysfunction or other tissue damaging factors (e.g., chemotherapy, surgery, and/or radiotherapy).
[0016] This, in turn, leads to problems detecting and/or validating cancer-specific genomic alterations in a liquid sample. This is particularly true during early stages of the disease — when cancer therapies have much higher success rates — because the tumor fraction in the patient is lowest at this point. Thus, early stage cancer patients can have ctDNA fractions below the limit of detection (LOD) for one or more informative genomic alterations, limiting clinical utility because of the risk of false negatives and/or providing an incomplete picture of the cancer genome of the patient. Further, because cancers, and even individual tumors, can be clonally diverse, actionable genomic alterations that arise in only a subset of clonal populations are diluted below the overall tumor fraction of the sample, further frustrating attempts to tailor combination therapies to the various actionable mutations in the patient’s cancer genome. Consequently, most studies using liquid biopsy samples to date have focused on late stage patients for assay validation and research. [0017] Another challenge associated with liquid biopsies is the accurate determination of tumor fraction in a sample. This difficulty arises from at least the heterogeneity of cancers and the increased frequency of large chromosomal duplications and deletions found in cancers. As a result, the frequency of genomic alterations from cancerous tissues varies from locus to locus based on at least (i) their prevalence in different sub-clonal populations of the subject’s cancer, and (ii) their location within the genome, relative to large chromosomal copy number variations. The difficulty in accurately determining the tumor fraction of liquid biopsy samples affects accurate measurement of various cancer features shown to have diagnostic value for the analysis of solid tumor biopsies. These include allelic ratios, copy number variations, overall mutational burden, frequency of abnormal methylation patterns, etc., all of which are correlated with the percentage of DNA fragments that arise from cancerous tissue, as opposed to healthy tissue.
[0018] Altogether, these factors result in highly variable concentrations of ctDNA — from patient to patient and possibly from locus to locus — that confound accurate measurement of disease indicators and actionable genomic alterations. Further, the quantity and quality of cfDNA obtained from liquid biopsy samples are highly dependent on the particular methodology for collecting the samples, storing the samples, sequencing the samples, and standardizing the sequencing data.
[0019] While validation studies of existing liquid biopsy assays have shown high sensitivity and specificity, few studies have corroborated results with orthogonal methods, or between particular testing platforms, e.g., different NGS technologies and/or targeted panel sequencing versus whole genome/exome sequence. Reports of liquid biopsy-based studies are limited by comparison to non-comprehensive tissue testing algorithms including Sanger sequencing, small NGS hotspot panels, polymerase chain reaction (PCR), and fluorescent in situ hybridization (FISH), which may not contain all NCCN guideline genes in their reportable range, thus suffering in comparison to a more comprehensive liquid biopsy assay.
[0020] As an example, conventional liquid biopsy assays do not provide accurate classifications of copy number variations (CNVs) for genomic targets (e.g., biomarkers), where CNVs are a form of genomic alteration with known relevance to cancer. Conventional methodologies typically assign a genomic target to an integer copy number and/or one of three copy number states (e.g., amplified, neutral, or deleted) using a copy ratio cutoff above or below which an amplified or deleted status is called, respectively, or in which a neutral status is otherwise called. Such methodologies make these assignments based on the fact that at a given tumor fraction and a known ploidy, the copy number in a segment is positively correlated with its copy ratio and thus the copy ratio can be mathematically converted to an integer copy number. For example, one conventional method ichorCNA utilizes software that estimates tumor fraction in circulating cfDNA from ultra-low-pass whole genome sequencing, which is then used to determine genomic alterations such as copy number alterations. See, Adalsteinsson et cil, Nat Commun., 8:1324 (2017).
[0021] However, this approach can be problematic due to the current challenges in accurately determining tumor fraction in liquid biopsy samples. For example, estimating the ctDNA fraction of total cell-free DNA in plasma can be difficult due to highly variable tumor fractions that can range from 0 to approximately 90%, and in many cases can be below 1% and/or below the limit of detection. See, Shigematsu and Koyama, Nihon Jinzo Gakkai Shi., 30(9): 1115-22 (1988). Methods based on mean, median, maximum or other point estimates of somatic variant allele fractions (VAFs) require the difficult task of accurate quantification and classification of somatic and germline variants in liquid biopsy samples, which can be further complicated by the absence of a matched normal sample or the presence of artifactual variants and/or clonal heterogeneity. In addition to the reliance on potentially inaccurate tumor fraction estimations, methods that utilize ultra-low-pass whole genome sequencing assays may be inappropriate for analyzing copy number variations from capture-based deep sequencing assays.
[0022] Additional challenges arise in cases where non-focal copy number variations are identified (e.g., where an entire chromosome or a large portion of a chromosome is amplified or deleted). Non-focal copy number variations are often difficult to interpret, as these large- scale copy number changes may represent real copy number variations or may be artifacts resulting from incorrect normalization due to low sample quality, capture failures, or other unknown issues during library preparation or sequencing. Because such large-scale copy number changes are unlikely to be associated with therapeutically actionable genomic alterations, the ability to differentiate between real and artifactual copy number variations is an important and unmet need in precision oncology applications. For example, two conventional methods that are insufficient to distinguish focal copy number variations from non-focal copy number variations include CNVkit and AVENIO. See, for example, Talevich etal, PLoS Comput Biol, 12:1004873 (2016), and Roche, “AVENIO ctDNA Expanded Kit,” (2018), the contents of which are incorporated herein by reference, in their entireties, for all purposes. [0023] As another example, conventional liquid biopsy assays do not provide a method for accurately detecting variants (e.g., variant alleles) in ctDNA NGS assays. As described above, many patients may not have abundant ctDNA in early stage disease and may shed variants below the limit of detection (LOD) for ctDNA assays, resulting in false negatives. Detecting these variants at low circulating fractions is also technically challenging due to constraints of sequencing by synthesis. Additionally, differentiating between germline and somatic variants in ctDNA is difficult, as is differentiating between mutations derived from clonal hematopoiesis (CH) and the solid tumor being assayed. In such cases, mutations in hematopoietic lineage cells may be mistaken for tumor-derived mutations. Indeed, researchers have identified several genes frequently mutated in CH with potential importance in cancer, such as JAK2, TP53, GNAS, IDH2, and KRAS. Mayrhofer et al, 2018, “Cell-free DNA profiling of metastatic prostate cancer reveals microsatellite instability, structural rearrangements and clonal hematopoiesis,” Genome Med, (10), pg. 85; Hu et al. , 2018, “False-Positive Plasma Genotyping Due to Clonal Hematopoiesis,” Clin Cancer Res, (24), pg. 4437.
[0024] Additionally, conventional conventional liquid biopsy assays do not provide accurate circulating tumor fraction estimates (ctFEs). Accurate ctFEs provide several benefits to liquid biopsy applications, including classification of variants as somatic or germline, detection of clinically relevant copy number variations, and/or use of ctFEs as biomarkers.
[0025] For example, because up to 30% of breast cancer patients and up to 55% of lung cancer patients relapse after initial treatment, as well as a significant portion of patients in other cancer cohorts, the ability to detect metastasis and disease recurrence earlier in these patients could significantly improve patient outcomes. See, Colleoni et al. , 2016, “Annual Hazard Rates of Recurrence for Breast Cancer During 24 Years of Follow-Up: Results From the International Breast Cancer Study Group Trials I to V,” J Clin Oncol, (34), pg. 927; Yates et al. , 2017, “Genomic Evolution of Breast Cancer Metastasis and Relapse,” Cancer Cell, (32), pg. 169; Uramoto et al. , 2014, “Recurrence after surgery in patients with NSCLC,” Transl Lung Cancer Res, (3), pg. 242; Taunk et al. , 2017, “Immunotherapy and radiation therapy for operable early stage and locally advanced non-small cell lung cancer,” Transl Lung Cancer Res, (6), pg. 178. Indeed, recent retrospective and prospective studies have shown ctDNA after completion of treatment or surgery can act as a biomarker for disease recurrence in many cancer types, including breast cancer, lung cancer, melanoma, bladder cancer, and colon cancer. See, Coombes et al, 2019, “Personalized Detection of Circulating Tumor DNA Antedates Breast Cancer Metastatic Recurrence,” Clin Cancer Res, (25), pg. 4255; Tie et al, 2019, “Circulating Tumor DNA Analyses as Markers of Recurrence Risk and Benefit of Adjuvant Therapy for Stage III Colon Cancer,” JAMA Oncol, print; McEvoy et al, 2019, “Monitoring melanoma recurrence with circulating tumor DNA: a proof of concept from three case studies,” Oncotarget, (10), pg. 113; Christensen et al, 2019, “Early Detection of Metastatic Relapse and Monitoring of Therapeutic Efficacy by Ultra-Deep Sequencing of Plasma Cell-Free DNA in Patients With Urothelial Bladder Carcinoma,” J Clin Oncol, (37), pg. 1547; Isaksson et al, 2019, “Pre-operative plasma cell-free circulating tumor DNA and serum protein tumor markers as predictors of lung adenocarcinoma recurrence,” Acta Oncol, (58), pg. 1079. Higher ctFEs are associated with disease progression at radiographic evaluation and an increased metastatic lesion count.
[0026] Furthermore, ctFEs correlate with important clinical outcomes, and provide a minimally invasive method to monitor patients for response to therapy, disease relapse, and disease progression. However, conventional methodologies used for determining ctFEs in liquid biopsy samples rely on low-pass, whole-genome sequencing, which cannot also be used for variant detection (see, for example, Adalsteinsson et al, “Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors,” (2017) Nature Communications Nov 6;8(1):1324, doi:10.1038/s41467-017-00965-y; and ichorCNA, the Broad Institute, available on the internet at github.com/broadinstitute/ichorCNA). Other traditional approaches use variant allele fractions (VAFs) to estimate tumor fraction, but such approaches are confounded by variant tissue source and capture bias resulting in high levels of noise. Additionally, conventional methodologies for determining tumor purity estimates in solid tumor biopsy samples rely solely on on-target probe regions, which cannot be used in conjunction with targeted gene panels containing small numbers of genes.
[0027] The information disclosed in this Background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
SUMMARY
[0028] Given the above background, there is a need in the art for improved methods and systems for supporting clinical decisions in precision oncology using liquid biopsy assays. In particular, there is a need in the art for improved methods and systems for identifying focal copy number variations in liquid biopsy assays. The present disclosure solves this and other needs in the art by providing improvements in validating copy number variation annotations, thus identifying focal copy number variations in genomic segments obtained from liquid biopsy assays. For example, by applying a plurality of amplification and/or deletion filters to a dataset comprising bin-level copy ratios, segment-level copy ratios, and segment-level confidence intervals for a plurality of bins and segments, respectively, the systems and methods described herein reject or validate a focal copy number status annotation for a at a locus that is potentially actionable using precision oncology.
[0029] For example, in one aspect, the present disclosure provides a method of validating a copy number variation in a test subject, at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors. The method comprises obtaining a first dataset that comprises a plurality of bin- level sequence ratios, each respective bin-level sequence ratio in the plurality of bin-level sequence ratios corresponding to a respective bin in a plurality of bins. Each respective bin in the plurality of bins represents a corresponding region of a human reference genome, and each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is determined from a sequencing of a plurality of cell-free nucleic acids in a first liquid biopsy sample of the test subject and one or more reference samples.
[0030] The first dataset also comprises a plurality of segment-level sequence ratios, each respective segment-level sequence ratio in the plurality of segment-level sequence ratios corresponding to a segment in a plurality of segments. Each respective segment in the plurality of segments represents a corresponding region of the human reference genome encompassing a subset of adjacent bins in the plurality of bins, and each respective segment- level sequence ratio in the plurality of segment-level sequence ratios is determined from a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0031] The first dataset further comprises a plurality of segment-level measures of dispersion, where each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion (i) corresponds to a respective segment in the plurality of segments and (ii) is determined using the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment. [0032] In this aspect, the method comprises validating a copy number status annotation of a respective segment in the plurality of segments that is annotated with a copy number variation by applying the first dataset to an algorithm having a plurality of filters. A first filter in the plurality of filters is a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds. A second filter in the plurality of filters is a confidence filter that is fired when the segment-level measure of dispersion corresponding to the respective segment fails to satisfy a confidence threshold. A third filter in the plurality of filters is a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds.
In the third filter, the one or more measure of central tendency -plus-deviation bin-level copy ratio thresholds are derived from (i) a measure of the bin-level sequence ratios corresponding to the plurality of bins that map to the same chromosome of the human reference genome as the respective segment, and (ii) a measure of dispersion across the bin-level sequence ratios corresponding to the plurality of bins that map to the respective chromosome.
[0033] When a filter in the plurality of filters is fired, the copy number status annotation of the respective segment is rejected; and when no filter in the plurality of filters is fired, the copy number status annotation of the respective segment is validated.
[0034] In another aspect, the present disclosure provides a method for treating a patient with a cancer containing a copy number variation of a target gene. The method comprises determining whether the patient has an aggressive form of cancer associated with a focal copy number variation of the target gene by obtaining a first biological sample of the cancer from the patient and performing copy number variation analysis on the first biological sample to identify the copy number status of the target gene in the cancer.
[0035] The copy number variation analysis generates a first dataset comprising a plurality of bin-level sequence ratios, each respective bin-level sequence ratio in the plurality of bin- level sequence ratios corresponding to a respective bin in a plurality of bins. Each respective bin in the plurality of bins represents a corresponding region of a human reference genome, and each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is determined from a sequencing of a plurality of nucleic acids in the first biological sample of the cancer from the patient and one or more reference samples.
[0036] The first dataset also comprises a plurality of segment-level sequence ratios, each respective segment-level sequence ratio in the plurality of segment-level sequence ratios corresponding to a segment in a plurality of segments. Each respective segment in the plurality of segments represents a corresponding region of the human reference genome encompassing a subset of adjacent bins in the plurality of bins, and the plurality of segment- level sequence ratios is determined from a measure of central tendency of the plurality of bin- level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0037] The first dataset further comprises a plurality of segment-level measures of dispersion, where each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion (i) corresponds to a respective segment in the plurality of segments and (ii) is determined using the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0038] The method further comprises determining whether the copy number variation of the target gene is a focal copy number variation by applying the first dataset to an algorithm having a plurality of copy number variation filters. When the patient has the aggressive form of cancer associated with focal copy number variation of the target gene, a first therapy for the aggressive form of the cancer to the patient is administered, and when the patient does not have the aggressive form of cancer associated with focal copy number variation of the target gene, a second therapy for a less aggressive form of the cancer to the patient is administered.
[0039] Additionally, there is a need in the art for improved methods and systems for identifying somatic tumor mutations in cell-free DNA, particularly where the sample has low tumor fractions. Advantageously, the present disclosure solves this and other needs in the art by providing improved somatic variant identification methodology that better accounts for locus-specific and/or sample specific considerations to more accurately identify true somatic mutations in a liquid biopsy sample. For example, by using an application of Bayes theorem to account for one or more of (i) the prevalence of variants at a specific locus in a specific cancer type, (ii) the variant allele fraction for the variant being evaluated, (iii) the prevalence of sequencing errors at a particular locus, and (iv) the actual sequencing error rate of a particular reaction, the variant filter methodologies described herein tune the specificity and sensitivity of variant count thresholds in a locus-specific fashion to achieve higher accuracy of true somatic variant calling in a liquid biopsy assay.
[0040] For example, in one aspect, the present disclosure provides a method of validating a somatic sequence variant in a test subject having a cancer condition. The method is performed at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors. The method includes obtaining, from a first sequencing reaction, a corresponding sequence of each cell-free DNA fragment in a first plurality of cell-free DNA fragments in a liquid biopsy sample of the test subject, thus obtaining a first plurality of sequence reads. Each respective sequence read in the first plurality of sequence reads is aligned to a reference sequence for the species of the subject, thus identifying a variant allele fragment count for a candidate variant that maps to a locus in the reference sequence, and a locus fragment count for the locus encompassing the candidate variant.
[0041] The method further includes comparing the variant allele fragment count for the candidate variant against a dynamic variant count threshold for the locus in the reference sequence that the candidate variant maps to. The dynamic variant count threshold is based upon a pre-test odds of a positive variant call for the locus based on the prevalence of variants in a genomic region that includes the locus from a first set of nucleic acids obtained from a cohort of subjects having the cancer condition.
[0042] The method then includes rejecting or validating the variant as a true somatic variant based upon the dynamic variant count threshold. For instance, when the variant allele fragment count for the candidate variant satisfies the dynamic variant count threshold for the locus, the presence of the somatic sequence variant in the test subject is validated. And when the variant allele fragment count for the candidate variant does not satisfy the dynamic variant count threshold for the locus, the presence of the somatic sequence variant in the test subject is rejected.
[0043] Additionally, there is a need in the art for improved methods and systems for determining accurate circulating tumor fraction estimates (ctFEs) in liquid biopsy assays.
The present disclosure solves this and other needs in the art by providing methods and systems for estimating the circulating tumor fraction of a liquid biopsy sample from a targeted-panel sequencing reaction. For example, by fitting segment-level coverage ratios for on-target and off-target sequence reads distributed relatively uniformly along the genome to integer copy states across a range of simulated tumor fractions (e.g., using maximum likelihood estimation, for example, with an expectation-maximization algorithm), the systems and methods described herein can generate an accurate estimate of the circulating tumor fraction of a liquid biopsy sample. This is achieved, in some embodiments, by identifying the expected coverage ratios, given the fitted integer copy states, that best match the experimental coverage ratios. Such an accurate estimate of the circulating tumor fraction can be used in conjunction with on-target sequencing results to improve variant detection identification, as well as serve as an informative biomarker itself.
[0044] For example, in one aspect, the present disclosure provides a method of estimating a circulating tumor fraction for a test subject from panel-enriched sequencing data for a plurality of sequences, at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors.
[0045] The method includes obtaining, from a first panel-enriched sequencing reaction, a first plurality of sequences. The plurality of sequences includes a corresponding sequence for each cell-free DNA fragment in a first plurality of cell-free DNA fragments obtained from a liquid biopsy sample from the test subject, wherein each respective cell-free DNA fragment in the first plurality of cell-free DNA fragments corresponds to a respective probe sequence in a plurality of probe sequences used to enrich cell-free DNA fragments in the liquid biopsy sample in the first panel-enriched sequencing reaction.
[0046] The first plurality of sequences also includes a corresponding sequence for each cell-free DNA fragment in a second plurality of cell-free DNA fragments obtained from the liquid biopsy sample, wherein each respective cell-free DNA fragment in the second plurality of DNA fragments does not correspond to any probe sequence in the plurality of probe sequences.
[0047] The method includes determining a plurality of bin-level coverage ratios from the plurality of sequences, each respective bin-level coverage ratio in the plurality of bin-level coverage ratios corresponding to a respective bin in a plurality of bins. Each respective bin in the plurality of bins represents a corresponding region of a human reference genome. Each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is determined from a comparison of (i) a number of sequence reads in the plurality of sequences that map to the corresponding bin and (ii) a number of sequence reads from one or more reference samples that map to the corresponding bin. [0048] The method further includes determining a plurality of segment-level coverage ratios by forming a plurality of segments by grouping respective subsets of adjacent bins in the plurality of bins based on a similarity between the respective coverage ratios of the subset of adjacent bins, and determining, for each respective segment in the plurality of segments, a segment-level coverage ratio based on the corresponding bin-level coverage ratios for each bin in the respective segment.
[0049] For each respective simulated circulating tumor fraction in a plurality of simulated circulating tumor fractions, the method includes fitting each respective segment in the plurality of segments to a respective integer copy state in a plurality of integer copy states, by identifying the respective integer copy state in the plurality of integer copy states that best matches the segment-level coverage ratio, thus generating, for each respective simulated circulating tumor fraction in the plurality of simulated tumor fractions, a respective set of integer copy states for the plurality of segments.
[0050] The method further includes determining the circulating tumor fraction for the test subject based on a comparison between the corresponding segment-level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions. In some embodiments, the comparison includes optimization of an error between corresponding segment-level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions. In some embodiments, the comparison includes finding two or more local optima for fit (e.g., local minima for an error between corresponding segment- level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions) and choosing the local optima (e.g., minima) that is most consistent with one or more alternative estimations of the tumor fraction.
[0051] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive. BRIEF DESCRIPTION OF THE DRAWINGS
[0052] Figures 1A, IB, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and 1D3 collectively illustrate a block diagram of an example computing device for supporting clinical decisions in precision oncology using liquid biopsy assays ( e.g ., by validating a copy number variation, validating a somatic sequence variant in a test subject having a cancer condition, estimating the circulating tumor fraction of a liquid biopsy sample based on on-target and off-target sequence reads from targeted-panel sequencing data etc.), in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0053] Figure 2A illustrates an example workflow for generating a clinical report based on information generated from analysis of one or more patient specimens, in accordance with some embodiments of the present disclosure.
[0054] Figure 2B illustrates an example of a distributed diagnostic environment for collecting and evaluating patient data for the purpose of precision oncology, in accordance with some embodiments of the present disclosure.
[0055] Figure 3 provides an example flow chart of processes and features for liquid biopsy sample collection and analysis for use in precision oncology, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0056] Figures 4A, 4B, 4C, 4D, 4E, 4F1, 4F2, 4G1, 4G2, 4G3, and 4F3 collectively illustrate an example bioinformatics pipeline for precision oncology. Figure 4A provides an overview flow chart of processes and features in a bioinformatics pipeline, in accordance with some embodiments of the present disclosure. Figure 4B provides an overview of a bioinformatics pipeline executed with either a liquid biopsy sample alone or a liquid biopsy sample and a matched normal sample. Figure 4C illustrates that paired end reads from tumor and normal isolates are zipped and stored separately under the same order identifier, in accordance with some embodiments of the present disclosure. Figure 4D illustrates quality correction for FASTQ files, in accordance with some embodiments of the present disclosure. Figure 4E illustrates processes for obtaining tumor and normal BAM alignment files, in accordance with some embodiments of the present disclosure. Figure 4F1 provides a flow chart of a method for validating a copy number variation, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure. Figure 4F2 provides a flow chart of a method for validating a somatic sequence variant in a test subject having a cancer condition, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure. Figures 4G1, 4G2, and 4G3 illustrate a method of variant detection, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure. Figure 4F3 provides an overview of a method for estimating the circulating tumor fraction for a liquid biopsy sample, based on targeted panel sequencing data, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0057] Figures 5A1, 5B1, 5C1, 5D1, and 5E1 collectively provide a flow chart of processes and features for validating a copy number variation in a test subject, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0058] Figures 5A2 and 5B2 collectively provide a flow chart of processes and features for validating a somatic sequence variant in a test subject, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0059] Figures 5A3 and 5B3 collectively provide a flow chart of processes and features for estimating the circulating tumor fraction of a liquid biopsy sample based on on-target and off-target sequence reads from a targeted-panel sequencing data, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0060] Figures 6A1, 6B1, and 6C1 collectively provide a flow chart of processes and features for treating a patient with a cancer containing a copy number variation of a target gene, in which dashed boxes represent optional portions of the method, in accordance with some embodiments of the present disclosure.
[0061] Figure 6A2 illustrates a flow chart of a method for obtaining a distribution of variant detection sensitivities as a function of circulating variant allele fraction from a cohort of subjects, in accordance with some embodiments of the present disclosure.
[0062] Figures 6A3, 6B3, and 6C3 collectively illustrate a process for fitting segment- level coverage ratios to an integer copy number (6A3 and 6B3) and subsequently determining the error associated with the fit (6C3) at a particular simulated circulating tumor fraction, in accordance with some embodiments of the present disclosure.
[0063] Figures 7A1 and 7B1 illustrate a non-focal amplified segment and a focal amplified segment comprising the MYC gene, in accordance with some embodiments of the present disclosure.
[0064] Figure 7C1 illustrates a focal deleted segment comprising the BRCA2 gene, in accordance with some embodiments of the present disclosure.
[0065] Figures 7A2 and 7B2 collectively illustrate a method of inferring an effect of a sequence variant as a gain-of-function or a loss-of-function of a gene, in accordance with some embodiments of the present disclosure.
[0066] Figure 7A3 illustrates an overview of an experimental and analytical workflow used for validation of the performance of a method for estimating the circulating tumor fraction of a liquid biopsy sample based on on-target and off-target sequence reads from a targeted-panel sequencing data, in accordance with some embodiments of the present disclosure.
[0067] Figures 8A, 8B, 8C, and 8D collectively illustrate results of an inter-assay comparison between a liquid biopsy assay, a digital droplet polymerase chain reaction (ddPCR), and a solid-tumor biopsy assay, in accordance with various embodiments of the present disclosure.
[0068] Figures 9A, 9B, 9C, 9D, 9E, 9F, 9G, and 9H collectively illustrate results of a comparison between circulating tumor fraction estimate (ctFE) and variant allele fraction (VAF) using an Off-Target Tumor Estimation Routine (OTTER) method, in accordance with various embodiments of the present disclosure.
[0069] Figures 10A and 10B collectively illustrate results of evaluating ctFE and mutational landscape according to cancer type, in accordance with various embodiments of the present disclosure.
[0070] Figures 11A, 11B, and 11C collectively illustrate results of evaluating associations between ctFE and advanced disease states, in accordance with various embodiments of the present disclosure. [0071] Figures 12A, 12B, and 12C collectively illustrate results of comparing ctFE with recent clinical response outcomes, in accordance with various embodiments of the present disclosure.
[0072] Figure 13 illustrates a first table describing sensitivity for all SNVs, indels, CNVs, and rearrangements targeted in reference samples, in accordance with various embodiments of the present disclosure.
[0073] Figure 14 illustrates a second table describing sensitivity for all SNVs, indels, CNVs, and rearrangements targeted in reference samples, in accordance with various embodiments of the present disclosure.
[0074] Figure 15 illustrates a third table describing comparisons between the presently disclosed liquid biopsy assay and a commercial liquid biopsy kit, in accordance with various embodiments of the present disclosure.
[0075] Figures 16A, 16B, and 16C collectively illustrate a fourth table describing variants detected by a liquid biopsy assay, in accordance with various embodiments of the present disclosure.
[0076] Figure 17 illustrates a fifth table describing dynamic filtering methodology to further reduced discordance, in accordance with various embodiments of the present disclosure.
[0077] Figure 18 illustrates a sixth table describing cancer groups included in clinical profiling analysis, in accordance with various embodiments of the present disclosure.
[0078] Figure 19 illustrates an example plot of the errors between corresponding segment-level coverage ratios and integer copy states determined across a plurality of simulated circulated tumor fractions ranging from about 0 to about 1, in accordance with some embodiments of the disclosure.
[0079] Like reference numerals refer to corresponding parts throughout the several views of the drawings.
DETAILED DESCRIPTION
Introduction
[0080] As described above, conventional liquid biopsy assays do not provide accurate determination of copy number variations (CNYs) for actionable genomic targets, particularly focal amplifications. For example, some conventional methodologies determine copy number variations by mathematically converting copy ratios (e.g., of experimental samples compared to reference samples) to integer copy numbers based on tumor fraction estimates and known ploidy. These approaches have disadvantages due to the presence of artifactual variants and/or clonal heterogeneity in liquid biopsy samples, leading to unreliable tumor fraction estimates and, subsequently, unreliable copy number annotations. Furthermore, the identification of therapeutically actionable copy number variations is limited when using conventional methods because many large-scale (e.g., non-focal) copy number variations contain artifactual variants due to errors in normalization, poor sample quality, and/or other technical issues.
[0081] Thus, there is a need in the art for improved methods of validating CNV calls in order to distinguish between real and artifactual copy number variations. Specifically, there is a need in the art for a method of detecting focal copy number variations, e.g., in order to identify therapeutically actionable genomic alterations.
[0082] Advantageously, disclosed herein are methods and systems that do provide accurate determination of copy number variations by detecting actionable, focal copy number variations in circulating tumor DNA (ctDNA) with high confidence without the need for tumor fraction estimation. For example, in some embodiments, the methods and systems described herein utilize annotation and filtering that applies a statistical method to bin-level copy ratios, segment-level copy ratios and corresponding segment-level confidence intervals of binned and segmented sequence reads aligned to a reference genome. The statistical method filters out segments with non-focal copy number variations, which are either non- actionable, e.g., in the case of a copy number variation spanning a significant portion of a chromosome, or artifactual, e.g., due to incorrect data normalization.
[0083] As an example, Figure 4F1 illustrates a workflow of a method 400-1 for validating copy number variation, e.g., to identify therapeutically actionable genomic alterations, in accordance with some embodiments of the present disclosure.
[0084] In some embodiments, the methods described herein utilize conventional methodologies to putatively identify copy number variations, which are then validated using the methodologies described herein. For instance, in some embodiments, copy number variations (CNVs) are analyzed using a combination of an open-source tool, e.g., CNVkit, to putatively identify copy number variations, and a script, e.g., a Python script, to validate or reject the putative copy number variations, using the validation methodologies described herein. In other embodiments, the validation methodologies described herein are used to identify focal copy number variations independently of conventional bioinformatics tools, e.g., CNVkit.
[0085] As described herein, in some embodiments, the methods described herein include one or more data collection steps, in addition to data analysis and downstream steps. For example, as described below, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include collection of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non- cancerous sample from the subject). Likewise, as described below, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include extraction of DNA from the liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). Similarly, as described below, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include nucleic acid sequencing of DNA from the liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject).
[0086] However, in other embodiments, the methods described herein begin with obtaining nucleic acid sequencing results, e.g., raw or collapsed sequence reads of DNA from a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject), from which the statistics needed for focal CNV validation (e.g., bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion) can be determined. For example, in some embodiments, sequencing data 122 for a patient 121 is accessed and/or downloaded over network 105 by system 100.
[0087] Likewise, in some embodiments, the methods described herein begin with obtaining genomic bin values (e.g., bin counts or bin coverages) for a sequencing of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject), from which the statistics needed for focal CNV validation (e.g., bin-level sequence ratios, segment- level sequence ratios, and segment-level measures of dispersion) can be determined. For example, in some embodiments, genomic bin values 135-cf-bv for a patient 121 is accessed and/or downloaded over network 105 by system 100. [0088] Similarly, in some embodiments, the methods described herein begin with obtaining the statistics needed for focal CNV validation (e.g., bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion) for a sequencing of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject), e.g., as an output of a conventional bioinformatics tool (such as CNVkit). For example, in some embodiments, bin-level sequence ratios 135-cf-br, segment-level sequence ratios 135- cf-sr, and segment-level measures of dispersion for a patient 121 is accessed and/or downloaded over network 105 by system 100.
[0089] Referring again to method 400-1 in Figure 4F1, in some embodiments, the method includes obtaining a dataset including cell-free DNA sequencing data (Block 402-1), and determining the statistics needed for focal CNV validation (e.g., bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion). For instance, in some embodiments, system 100 obtains sequencing data 122 (e.g., sequence reads 123 and/or aligned sequences 124) and applies a copy number segmentation algorithm 153-b (e.g., CNVkit) to the sequencing data.
[0090] For example, in some embodiments, sequence reads 123 obtained from the sequencing dataset 122 are aligned to a reference human construct (Block 404-1), generating a plurality of aligned reads 124 (Block 406-1). Aligned cfDNA sequence reads are then optionally processed (e.g., using normalization, filtering, and/or quality control) (Block 408- iy
[0091] A copy number segmentation algorithm 153-b is then used for genomic region binning, coverage calculation, bias correction, normalization to a reference pool, segmentation, and/or visualization (Block 410-1). For example, in some embodiments, aligned sequence reads are sorted into bins (e.g., on target bins 153-b-l-a and off-target bins 153-b-l-b) of pre-specified bin sizes (e.g., 100-150 base pairs) based on their genomic location using binning subroutine 153-b-l. For example, in some embodiments, binning subroutine 153-b-l reads in mapped sequences 124 and pre-selected bins (e.g., target bins 153-b-l-a and off-target bins 153-b-l-b for target panel sequencing analysis) and assigns respective sequences to the bins based on their mapping within the reference genome. Bin values 135-bv (e.g., liquid biopsy genomic bin values 135-cf-bv) for each of the bins, e.g., bin counts or bin coverages, can be read out from binning subroutine 153-b-l. Bin values 135-bv are optionally pre-processed, e.g., normalized, standardized, corrected, etc., as described in further detail herein.
[0092] Bin values 135-bv are then used to determine bin-level sequence ratios 135-br (e.g., liquid biopsy bin-level sequence ratios 135-cf-br). Briefly, a copy ratio subroutine 153- b-2 reads in bin values 135-bv and reference bin coverages 153-b-2-a determined for one or more reference samples (e.g., a matched non-cancerous sample of the subject or a an average from a plurality of non-cancerous reference samples), and compares bin values for corresponding bins, thereby generating bin-level sequence ratios 135-br.
[0093] These bin-level sequence ratios 135-br are then used to group adjacent bins, having similar sequence ratios, into segments, e.g., using circular binary segmentation. For example, in some embodiments, segmentation subroutine 153-a-3 reads in and applies a segmentation model (e.g., a circular binary segmentation model) to bin-level sequence ratios 135-br, thereby generating a plurality of genomic segments, each corresponding to one or more contiguous bins.
[0094] Segment-level sequence ratios 135-sr (e.g., liquid biopsy segment-level sequence ratios 135-cf-sr) and segment-level measures of dispersion 135-sd (e.g., liquid biopsy segment-level measures of dispersion 135-cf-sd) can be determined using a statistics subroutine 153-a-4, which may be read out from the copy number segmentation algorithm 153-b, as illustrated in Figure 1D1, or may be separately implemented, e.g., by reading-in segment annotations (e.g., including bin assignments to each segment) generated by the segmentation subroutine 153-a-3 and bin-level sequence ratios 135-br from the copy ratio subroutine 153-b-2.
[0095] Optionally, a copy number annotation subroutine 153-a-5 reads in one or both segment-level sequence ratios 135-sr (e.g., liquid biopsy segment-level sequence ratios 135- cf-sr) and segment-level measures of dispersion 135-sd, to provide copy number status annotations (e.g., amplified, neutral, or deleted) 135-cn (e.g., liquid biopsy copy numb annotations 135-cf-cn) for one or more of the identified segments.
[0096] In some embodiments, the process above is also performed for a matched tumor tissue biopsy of the subject, e.g., thereby generating one or more tumor segment copy number annotations 135-t-cn.
[0097] The bin-level copy ratios, segment-level copy ratios and the corresponding segment-level confidence intervals statistics obtained from the copy number segmentation algorithm 153 (e.g., CNVkit) output are used as inputs for a focal amplification / deletion validation algorithm, to determine whether putative segment amplifications and/or deletions can be validated. The copy number segmentation algorithm 153 applies a plurality of filters to statistics for one or more identified segment (. Block 412-1). In some embodiments, these filters include one or more of:
• a bin-level measure of central tendency sequence ratio filter 153-a-l, e.g., a median bin-level copy ratio filter (. Block 414-1);
• a segment-level measure of dispersion confidence filter 153-a-2, e.g., a segment-level confidence interval filter {Block 416-1);
• a bin-level measure of central tendency plus deviation filter 153-a-3, e.g., a median- plus-median absolute deviation (MAD) bin-level copy ratio filter {Block 418-1); and
• a segment-level sequence ratio filter 153-a-4, e.g., a segment-level copy ratio filter {Block 419-1).
In some embodiments, the plurality of filters includes at least two of the above filters. In some embodiments, the plurality of filters includes at least three of the above filters. In some embodiments, the plurality of filters includes all four of the above filters.
[0098] The copy number status annotation {e.g., amplified, neutral, deleted) for each segment is validated or rejected if it passes or fails the plurality of copy number status annotation validation filters {Block 420-1). Specifically, when a filter in the plurality of filters is fired, the copy number annotation of the segment is rejected, and the copy number variation is determined to be a non-focal copy number variation. When no filter in the plurality of filters is fired, the copy number annotation of the segment is validated, and the copy number variation is determined to be a focal copy number variation {Block 422-1).
[0099] Validated copy number variations {e.g., focal amplifications and/or focal deletions of target genes) can then be used for variant analysis and clinical report generation. For example, focal copy number variations can be matched to the appropriate therapies and/or clinical trials {Block 426-1). A patient report indicating the validated copy number variations and any matched therapies and/or clinical trials can then be generated for use in precision oncology applications {Block 426-1).
[0100] Additional embodiments of the presently disclosed systems and methods are described in further detail below with reference to Figures 2A and 4F1 (see, Example Workflow for Precision Oncology: Copy Number Variation Analysis) and Example 2 - Identification of Focal Copy Number Variation (see, Examples).
[0101] Copy number variations are considered a biomarker for cancer diagnosis and certain copy number variations are targets of treatment. For example, a subset of copy number variations that can be investigated using the methods disclosed herein include amplifications in MET, EGFR, ERBB2, CD274, CCNE1, and MYC, and deletions in BRCA1 and BRCA2. However, the analysis is not limited to these reportable genes. The method utilizes bin-level copy ratios, in addition to segment-level copy ratios, to validate the copy number variations of target genomic segments, thus allowing a highly sensitive characterization of local (both internal and external) changes in copy number to detect true copy number variations with greater accuracy. The presently disclosed systems and methods enable an automatic and reliable way to detect actionable, focal copy number variations via a liquid biopsy assay that is not achieved by conventional methods and is considerably less invasive than a tissue biopsy. The combination of liquid biopsy and copy number variation detection benefits physicians, clinicians, and medical institutions by providing a powerful tool for diagnosing cancer conditions and administering treatments. Furthermore, the methods disclosed herein can be performed alone or alongside traditional solid tumor biopsy methods as a validation method for detecting copy number variations.
[0102] Specifically, the annotation and filtering algorithm can be used to distinguish between actionable and non-actionable copy number variations of target biomarkers that are informative for precision oncology. For example, as reported in Example 2 (Identification of Focal Copy Number Variation; see Examples, below), when applied to two experimental samples both containing a conventionally obtained amplification status for the MYC gene, the method rejected the amplification in a first sample as anon-focal amplification, and validated the amplification in a second sample as a focal, and likely actionable, amplification.
[0103] The identification of actionable genomic alterations in a patient’s cancer genome is a difficult and computationally demanding problem. For instance, the determination of various prognostic metrics useful for precision oncology, such as variant allelic ratio, copy number variation, tumor mutational burden, microsatellite instability status, etc., requires analysis of hundreds of millions to billions, of sequenced nucleic acid bases. An example of a typical bioinformatics pipeline established for this purpose includes at least five stages of analysis: assessment of the quality of raw next generation sequencing data, generation of collapsed nucleic acid fragment sequences and alignment of such sequences to a reference genome, detection of structural variants in the aligned sequence data, annotation of identified variants, and visualization of the data. See, Wadapurkar and Vyas, Informatics in Medicine Unlocked, 11:75-82 (2018), the content of which is hereby incorporated by reference, in its entirety, for all purposes. Each one of these procedures is computationally taxing in its own right.
[0104] For instance, the overall temporal and spatial computation complexity of simple global and local pairwise sequence alignment algorithms are quadratic in nature (e.g., second order problems), that increase rapidly as a function of the size of the nucleic acid sequences (n and m) being compared. Specifically, the temporal and spatial complexities of these sequence alignment algorithms can be estimated as O(mn), where O is the upper bound on the asymptotic growth rate of the algorithm, n is the number of bases in the first nucleic acid sequence, and m is the number of bases in the second nucleic acid sequence. See, Baichoo and Ouzounis, BioSystems, 156-157:72-85 (2017), the content of which is hereby incorporated by reference, in its entirety, for all purposes. Given that the human genome contains more than 3 billion bases, these alignment algorithms are extremely computationally taxing, especially when used to analyze next generation sequencing (NGS) data, which can generate more than 3 billion sequence reads per reaction.
[0105] This is particularly true when performed in the context of a liquid biopsy assay, because liquid biopsy samples contain a complex mixture of short DNA fragments originating from many different germline (e.g., healthy) and diseased (e.g., cancerous) tissues. Thus, the cellular origins of the sequence reads are unknown, and the sequence signals originating from cancerous cells, which may constitute multiple sub-clonal populations, must be computationally deconvoluted from signals originating from germline and hematopoietic origins, in order to provide relevant information about the subject’s cancer. Thus, in addition to the computationally taxing processes required to align sequence reads to a human genome, there is a computation problem of determining whether a particular abnormal signal, e.g., one or more sequence reads corresponding to a genomic alteration, (i) is not an artifact, and (ii) originated from a cancerous source in the subject. This is increasingly difficult during the early stages of cancer — when treatment is presumably most effective — when only small amounts of ctDNA are diluted by germline and hematopoietic DNA.
[0106] In addition to the computationally demanding problem of aligning sequencing data to a human reference genome, the method comprises dividing the plurality of aligned sequence reads into “bins” (e.g., regions of a predefined span of base pairs corresponding to a reference genome), determining the copy ratio of each bin by calculating the differential read depths between experimental and reference samples, and grouping subsets of adjacent bins with shared copy ratios into segments. Grouping bins into segments divides each chromosome into regions of equal copy number that minimizes noise in the data. Such methods essentially perform a change-point or edge detection algorithm, which are either temporally limited or computationally intense. For example, in some embodiments, the segmentation is performed using circular binary segmentation. Circular binary segmentation calculates a statistic for each genomic position, where the statistic comprises a likelihood ratio for the null hypothesis (no change in copy ratio at the respective position) against the alternative (one change in copy ratio at the respective position), and where the null hypothesis is rejected if the statistic is greater than a predefined distribution threshold. Notably, in circular binary segmentation, the chromosome is assumed to be circularized, such that the calculation is performed recursively for each position (e.g., each bin) around the circumference of the circle to identify all change-points across the length of the chromosome. Furthermore, for each position (e.g., bin) under investigation, a reference distribution is generated using a permutation approach, where the copy ratios for the plurality of bins are randomized (typically 10,000 times). For some embodiments that utilize bins of approximately 100-150 bases long spanning a human reference genome of several billion bases, the number of permutations required to perform this recursive method contributes to a computationally intense procedure. See, for example, Olshen et al, Biostatistics 5, 4, 557- 572 (2004), doi:10.1093/biostatistics/kxh008, which is hereby incorporated herein by reference in its entirety.
[0107] Advantageously, the present disclosure provides various systems and methods that improve the computational elucidation of actionable genomic alterations from a liquid biopsy sample of a cancer patient. Specifically, the present disclosure improves a computer- implemented method for identifying focal copy number variations by validating copy number status annotations assigned to genomic segments. As a further example, the application of the plurality of filters to the bin-level copy ratios, segment-level copy ratios, and corresponding segment-level confidence intervals is iterated, on a computer system, over each segment in the plurality of segments, and in some embodiments requires calculations using the copy ratios of each bin in the plurality of bins for each chromosome, for each segment in the plurality of segments. Taken together, the methods disclosed herein are a computational process designed to solve a computational problem.
[0108] Advantageously, the methods and systems described herein provide an improvement to the abovementioned technical problem (e.g., performing complex computer- implemented methods for analyzing a plurality of sequence reads for detection and validation of copy number variations in human genetic targets). The methods described herein therefore solve a problem in the computing art by improving upon conventional methods for identifying copy number variations for cancer diagnosis and treatment. For example, the application of a plurality of filters to the bin-level copy ratios, segment-level copy ratios, and corresponding segment-level confidence intervals provides a means for detecting true copy number variations for clinically relevant biomarkers and filtering out artifactual variations that are not therapeutically actionable, thus improving the accuracy and precision of genomic alteration detection in precision oncology.
[0109] The methods and systems described herein also improve precision oncology methods for assigning and/or administering treatment because of the improved accuracy of copy number variation detection. The identification of therapeutically actionable, focal copy number variations that can be included in a clinical report for patient and/or clinician review, and/or matched with appropriate therapies and/or clinical trials for treatment and/or monitoring, allows for more accurate assignment of treatments. Furthermore, the removal of non-therapeutically actionable, non-focal copy number variations reduces the risk of patients undergoing unnecessary or potentially harmful regimens due to misdiagnoses.
[0110] As described above, conventional liquid biopsy assays also do not provide accurate determination of variants (e.g., somatic variants), particularly at low circulating variant fractions. This is due, in large part, to the use of static variant count filters that require a common amount of support to call a variant positively as a somatic variant in sequencing data, regardless of the identity of the variant and its position within the genome. That is, conventional methods require that at least X number of unique sequence reads (e.g., 8 sequence reads) provide support for (e.g., encompass) a particular variant in order for that variant to be confirmed as a true somatic variant. While this may be fine for liquid biopsy samples having a high tumor fraction, where more copies of each somatic variant would be expected to be found, it results in a high number of false negatives when samples with lower tumor fractions are analyzed. On the other hand, simply lowering the threshold to allow calling of variants with lower support for a particular variant will increase the number of false positives, that is the number of untrue positive somatic variant calls, which are actually sequencing errors.
[0111] While there are many methods of performing noise suppression on ultra-high depth sequencing data commonly generated for liquid biopsy assays, there remains the fundamental fidelity boundary of sequencing by synthesis that cannot be overcome. Along with this, there are a variety of complexities and non-linearities within the ability to map reads across complex sets of genomic features and from these data, successfully call a variant. While it is possible to filter very stringently, one of the goals of liquid biopsy assays is to detect alterations at very low circulating fractions. This requires that low levels of support be sufficient to make a positive alteration call given that at 0.1% circulating fraction and an average depth of 5000x, only 5 reads containing alternate alleles will be present. Because of this, it is impossible to have a consistent set of thresholds that will be used to filter variants as any filter will either be too stringent or too permissive depending on the variant context and local sequence specific error generation models.
[0112] Advantageously, the present disclosure provides methods and systems that more accurately call somatic variants by adjusting the variant count threshold in a locus-by-locus fashion, e.g., by lowering the variant count threshold when there is an increased likelihood (orthogonal to the variant count in the sequencing reaction) that a variant at a particular locus is a true somatic variant and/or by raising the variant count threshold when there is an increased likelihood (orthogonal to the variant count in the sequencing reaction) that a variant at a particular locus is a result of a sequencing error, rather than a true somatic variant.
[0113] For example, in some embodiments, the methods and systems described herein employ a generalized application of Bayes’ Theorem through the likelihood ratio test that allows dynamic calibration of filtering threshold for diagnostic assays. These thresholds are based on one or more of a sample-specific error rate, a methodology-specific sequencing error rate (e.g., from a pool of process matched healthy control samples), an estimate of the variant allele fraction for the variant being evaluated, and a historical likelihood that a variant would be present at a particular locus in a particular cancer (e.g., derived from an extensive cohort of human solid tumor tissue samples to inform probability models). This results in high sensitivity and specificity in variant detection, allowing identification of actionable oncologic targets, as well as determination of a precise limit of detection to reduce the occurrence of false negatives. [0114] For instance, in some embodiments, the dynamic variant filtering methodology described herein uses an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant at a particular genomic region based on the prevalence of similar mutations within that genomic regions in similar cancers. For instance, where there is a high prevalence of a somatic variant in a given gene for a particular cancer, ( e.g . , BRCA1 mutations are common in breast cancers), the dynamic filtering method accounts for this prior (e.g., the prior knowledge that BRCA mutations are commonly found in breast cancers) by setting a lower variant count threshold to call somatic variants in the BRCA1 gene for a breast cancer. That is, the dynamic filtering methodology requires less evidence in order to call a variant in the BRCA1 gene when the subject has breast cancer than when the subject has a different cancer that is not associated with a high prevalence of BRCA1 mutations.
[0115] In some embodiments, the dynamic variant filtering methodology described herein uses an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant based on an estimated variant allele fraction for the variant being evaluated. That is, the dynamic filtering methodology takes into account the fact that in a sample having a lower tumor fraction, and therefore a lower variant allele fraction, a fewer number of sequences encompassing a somatic variant would be expected than in a sample having a higher tumor fraction, and therefore a higher variant allele fraction. Accordingly, the sensitivity and specificity of the dynamic filter are tuned to account for the expectation that a higher percentage of variant sequences with low sequence counts (e.g., lower support) represent true somatic variants in a sample with a low tumor fraction than in a sample with a high tumor fraction, for which a higher percentage of variant sequences with low sequence counts represent sequencing errors.
[0116] In some embodiments, the dynamic variant filtering methodology described herein used an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant at a particular genomic locus based on a historical sequencing error rate for the locus. That is, the dynamic filtering methodology takes into account the fact that at genomic loci that are more prone to sequencing errors, such as loci with short nucleotide repeat sequences (e.g., di-nucleotide or tri-nucleotide repeats), there is a higher likelihood that a particular variant is a product of a sequencing error, rather than a true somatic mutation, than at a locus that is not prone to sequencing errors. [0117] Similarly, in some embodiments, the dynamic variant filtering methodology described herein used an application of Bayes theorem to dynamically tune a variant count threshold for calling a somatic variant at a particular genomic locus based on a reaction- specific sequencing error rate. That is, the dynamic filtering methodology takes into account the fact that in reactions with higher sequencing rates there is a higher likelihood that a particular variant is a product of a sequencing error, rather than a true somatic mutation.
[0118] The present disclosure provides improved systems and methods for precision oncology based on improved variant calling in liquid biopsy data. The various improvements described herein, e.g., improved variant detection at low circulating fractions, are embodied in an example liquid biopsy workflow described in Examples 2 and 3. These examples describe an example liquid biopsy assay employing a 105-gene hybrid-capture next- generation sequencing (NGS) panel spanning 270 kb of the human genome, configured to detect targets in four variant classes, including single nucleotide variants (SNVs), insertions and/or deletions (indels), copy number variants (CNVs), and gene rearrangements. To establish robust clinical performance, extensive validation studies were conducted that demonstrated high sensitivity and specificity. Accordingly, the example liquid biopsy assay detected actionable variants with high accuracy in comparison to a commercial ctDNA NGS kit, commercial solid tumor biopsy-based assays, such as a solid tumor biopsy NGS tissue assay, and digital droplet PCR (ddPCR). As shown in the results of Figure 17, the methods and systems disclosed herein reduced false positive variant calling by 11.45% compared to conventional variant detection methods.
[0119] As described in detail above, the identification of actionable genomic alterations in a patient’s cancer genome is a difficult and computationally demanding problem.
[0120] Advantageously, the present disclosure provides various systems and methods that improve the computational elucidation of actionable genomic alterations from a liquid biopsy sample of a cancer patient. Specifically, the present disclosure improves a method for identifying variants in ctDNA using a dynamic thresholding approach. As described above, the disclosed methods and systems are necessarily computer-implemented due to their complexity and heavy computational requirements, and thus solve a problem in the computing art.
[0121] Advantageously, the methods and systems described herein provide an improvement to the abovementioned technical problem (e.g., performing complex computer- implemented methods for identifying variants in ctDNA using a dynamic thresholding approach). The methods described herein therefore solve a problem in the computing art by improving upon conventional methods for identifying variants (e.g., actionable oncologic targets) for cancer diagnosis and treatment. For example, the application of Bayes’ Theorem through the likelihood ratio test provides a means for improving detection of true positive variants and reducing detection of false positive variants for clinically relevant biomarkers, thus improving the accuracy and precision of genomic alteration detection in precision oncology.
[0122] The methods and systems described herein also improve precision oncology methods for assigning and/or administering treatment because of the improved accuracy of variation detection. The identification of therapeutically actionable variants that can be included in a clinical report for patient and/or clinician review, and/or matched with appropriate therapies and/or clinical trials for treatment and/or monitoring, allows for more accurate assignment of treatments. Furthermore, the removal of false positive variant detection reduces the risk of patients undergoing unnecessary or potentially harmful regimens due to misdiagnoses.
[0123] Additionally, as described above, conventional liquid biopsy assays do not provide accurate determination of circulating tumor fraction estimates (ctFEs). For example, while low-pass, whole-genome sequencing can be used to estimate tumor fractions, somatic variant sequences are poorly identified from low-pass, whole genome sequencing data, particularly from samples having low tumor fractions. Accordingly, conventional liquid biopsy assays typically use targeted-panel sequencing in order to achieve higher sequence coverage required to identify somatic variants present at low levels within the sample. However, targeted-panel sequencing data does not span a large enough portion of the genome to accurately estimate tumor fraction. Rather, tumor fraction estimates obtained using variant allele fractions (VAFs) in targeted-panel sequencing data are noisy, due to variant tissue source and capture bias.
[0124] Advantageously, the present disclosure provides methods and systems that do provide accurate determination of circulating tumor fraction estimates by using on-target and off-target sequence reads from targeted-panel sequencing data. For example, in some embodiments, the methods and systems described herein fit experimental coverage ratios for segmented sequence reads across the genome to integer copy numbers across a range of simulated tumor fractions. These fitted copy numbers can then be used to determine the expected coverage ratio for the segment, at the given simulated tumor fraction. The aggregate difference between the experimental coverage ratios for all segments and the expected coverage ratios based on the fitted copy number at the given simulated tumor fraction is used as a measure of the accuracy of the fit. That is, where the experimental coverage ratios closely match the expected coverage ratios, the simulated tumor fraction is a good estimate of the actual tumor fraction of the sample. Likewise, where the experimental coverage ratios do not closely match the expected coverage ratios, the simulated tumor fraction is a poor estimate of the actual tumor fraction of the sample.
[0125] By using on-target and off-target sequence reads, the systems and methods described herein leverage data collected across a majority of the human genome, which allows for more accurate estimation of circulating tumor fraction than data that is limited to on-target probe regions. Advantageously, this method allows for both accurate tumor fraction estimation and robust variant identification from a single, low-cost sequencing reaction. Previously, in order to generate suitable data for both accurate tumor fraction estimate and robust variant identification two sequencing reactions would need to be performed; a low-pass whole genome sequencing reaction to generate data across the genome for estimating circulating tumor fraction and a targeted-panel sequencing reaction to generate sufficiently deep sequencing data to identify variants.
[0126] Accordingly, the systems and methods described herein can be used in conjunction with variant detection methods that rely on targeted panel sequencing, such as high-depth sequencing reactions. By ensuring uniform distribution of sequence reads across a genome (e.g., by a process of binning sequencing reads and correcting bins for size, GC content, sequencing depth, etc.), the systems and methods described herein ensure that any variation detected in regions of the genome are representative of the reference genome. This approach reduces noise resulting from capture bias, which can result in unreliable circulating tumor fraction estimates.
[0127] By using a maximum likelihood estimation (e.g., an expectation-maximization algorithm) to fit on-target and off-target sequence reads to genomic variations (e.g., integer copy states), the systems and methods described herein further improve the accuracy and reliability of circulating tumor fraction estimates. For example, in some embodiments, the sequencing coverage of on-target and off-target sequence reads are used to determine a test coverage ratio for regions of the genome in a test liquid biopsy sample. The test coverage ratio is compared to a set of expected coverage ratios obtained using assumptions for expected copy states and expected tumor fractions, which gives a distance (e.g., an error) of the test coverage ratio from the expected copy state. Using this model, by minimizing the distance (e.g., the error) between test parameters and expected parameters, it is possible to estimate the test tumor fraction with high confidence.
[0128] An improved method for obtaining accurate circulating tumor fraction estimates provide several benefits to liquid biopsies. Advantageously, more reliable ctFEs improves the classification accuracy of detected variants as somatic or germline variants (e.g., any variant detected at or below the ctFE can be classified as a somatic variant with high confidence). In addition, accurate ctFEs can greatly improve the sensitivity of detection of clinically relevant copy number variations, including integer copy number calling. Furthermore, in some embodiments, ctFEs are used as biomarkers for tumor burden, metastases, disease progression, or treatment resistance. For example, ctFEs have been shown to correlate with tumor volumes and vary in response to treatment.
[0129] As a result, the methods and systems disclosed herein provide a sensitive, cost- effective, and minimally invasive method to monitor patients for response to therapy, disease burden, relapse, progression, and/or emerging resistance mutations, which can translate into better care for patients. When used as part of the course of care, serial ctFE monitoring can predict objective measures of progression in at-risk individuals. Due to cost and convenience of sampling, the methods and systems disclosed herein can be applied at shorter time intervals than radiographic methods and can allow for more timely intervention in the case of disease progression.
[0130] Additionally, the methods and systems disclosed herein provide benefits to clinicians by generating more accurate variant calls and/or informative ctFE biomarkers that can aid in the prediction of clinical outcomes in patients and/or the selection of appropriate treatment plans.
[0131] Specifically, a validation of the performance of a method for on-target and off- target tumor estimation, in accordance with some embodiments of the present disclosure, revealed a correlation between ctFEs and metastases and disease progression. For example, as reported in Examples 2 and 3, when the method is applied to matched, de-identified clinical data for a cohort of 1,000 patients, high ctFEs were found to (i) correlate well with estimates derived from low-pass, whole genome sequencing, (ii) be a highly specific predictor of metastases, (iii) be positively correlated with reported “progressive disease” and (iv) be negatively correlated with better clinical outcomes. Figure 7A3 provides an overview of an experimental and analytical workflow used for validation of the off-target tumor estimation routine (OTTER).
[0132] As described in detail above, the identification of actionable genomic alterations in a patient’s cancer genome is a difficult and computationally demanding problem.
[0133] Advantageously, the present disclosure provides various systems and methods that improve the computational elucidation of actionable genomic alterations from a liquid biopsy sample of a cancer patient. Specifically, the present disclosure improves upon the accuracy of circulating tumor fractions estimated from targeted-panel sequencing. Moreover, because the methods described herein eliminate the need to process data from two different sequencing reactions, the disclosure lowers the computational budget for accurately estimating circulating tumor fractions and identifying actionable variants. As described above, the disclosed methods and systems are necessarily computer-implemented due to their complexity and heavy computational requirements, and thus solve a problem in the computing art.
[0134] Advantageously, the methods and systems described herein provide an improvement to the abovementioned technical problem (e.g., performing complex computer- implemented methods for determining accurate circulating tumor fraction estimates). The methods described herein therefore solve a problem in the computing art by improving upon conventional methods for determining tumor fraction estimates for cancer diagnosis, monitoring, and treatment. For example, the application of a maximum likelihood estimation (e.g., an expectation-maximization algorithm) to estimate genomic alterations using on-target and off-target sequence reads in liquid biopsy samples improves upon conventional approaches for precision oncology by providing highly reliable circulating tumor fraction estimates, while allowing concurrent variant detection in targeted panel sequencing of liquid biopsy samples. This in turn lowers the computational budget required for these processes, thereby improving the speed and lowering the power requirements of the computer.
[0135] The methods and systems described herein also improve precision oncology methods for assigning and/or administering treatment because of the improved accuracy of circulating tumor fraction estimations. Accurate ctFEs can be reported as biomarkers and/or used in downstream analysis for identification of therapeutically actionable variants to be included in a clinical report for patient and/or clinician review. Additionally, ctFEs and any therapeutically actionable variants identified using ctFEs can be matched with appropriate therapies and/or clinical trials, allowing for more accurate assignment of treatments. The improved accuracy of biomarker detection increases the chance of efficacy and reduces the risk of patients undergoing unnecessary or potentially harmful regimens due to misdiagnoses.
Definitions
[0136] As used herein, the term “subject” refers to any living or non-living organism including, but not limited to, a human (e.g., a male human, female human, fetus, pregnant female, child, or the like), a non-human mammal, or a non-human animal. Any human or non-human animal can serve as a subject, including but not limited to mammal, reptile, avian, amphibian, fish, ungulate, ruminant, bovine (e.g., cattle), equine (e.g., horse), caprine and ovine (e.g., sheep, goat), swine (e.g., pig), camelid (e.g., camel, llama, alpaca), monkey, ape (e.g., gorilla, chimpanzee), ursid (e.g., bear), poultry, dog, cat, mouse, rat, fish, dolphin, whale and shark. In some embodiments, a subject is a male or female of any age (e.g., a man, a woman, or a child).
[0137] As used herein, the terms “control,” “control sample,” “reference,” “reference sample,” “normal,” and “normal sample” describe a sample from a non-diseased tissue. In some embodiments, such a sample is from a subject that does not have a particular condition (e.g., cancer). In other embodiments, such a sample is an internal control from a subject, e.g., who may or may not have the particular disease (e.g., cancer), but is from a healthy tissue of the subject. For example, where a liquid or solid tumor sample is obtained from a subject with cancer, an internal control sample may be obtained from a healthy tissue of the subject, e.g., a white blood cell sample from a subject without a blood cancer or a solid germline tissue sample from the subject. Accordingly, a reference sample can be obtained from the subject or from a database, e.g., from a second subject who does not have the particular disease (e.g., cancer).
[0138] As used herein the term “cancer,” “cancerous tissue,” or “tumor” refers to an abnormal mass of tissue in which the growth of the mass surpasses, and is not coordinated with, the growth of normal tissue, including both solid masses (e.g., as in a solid tumor) or fluid masses (e.g. , as in a hematological cancer). A cancer or tumor can be defined as “benign” or “malignant” depending on the following characteristics: degree of cellular differentiation including morphology and functionality, rate of growth, local invasion and metastasis. A “benign” tumor can be well differentiated, have characteristically slower growth than a malignant tumor and remain localized to the site of origin. In addition, in some cases a benign tumor does not have the capacity to infiltrate, invade or metastasize to distant sites. A “malignant” tumor can be a poorly differentiated (anaplasia), have characteristically rapid growth accompanied by progressive infiltration, invasion, and destruction of the surrounding tissue. Furthermore, a malignant tumor can have the capacity to metastasize to distant sites. Accordingly, a cancer cell is a cell found within the abnormal mass of tissue whose growth is not coordinated with the growth of normal tissue. Accordingly, a “tumor sample” refers to a biological sample obtained or derived from a tumor of a subject, as described herein.
[0139] Non-limiting examples of cancer types include ovarian cancer, cervical cancer, uveal melanoma, colorectal cancer, chromophobe renal cell carcinoma, liver cancer, endocrine tumor, oropharyngeal cancer, retinoblastoma, biliary cancer, adrenal cancer, neural cancer, neuroblastoma, basal cell carcinoma, brain cancer, breast cancer, non-clear cell renal cell carcinoma, glioblastoma, glioma, kidney cancer, gastrointestinal stromal tumor, medulloblastoma, bladder cancer, gastric cancer, bone cancer, non-small cell lung cancer, thymoma, prostate cancer, clear cell renal cell carcinoma, skin cancer, thyroid cancer, sarcoma, testicular cancer, head and neck cancer (e.g., head and neck squamous cell carcinoma), meningioma, peritoneal cancer, endometrial cancer, pancreatic cancer, mesothelioma, esophageal cancer, small cell lung cancer, Her2 negative breast cancer, ovarian serous carcinoma, HR+ breast cancer, uterine serous carcinoma, uterine corpus endometrial carcinoma, gastroesophageal junction adenocarcinoma, gallbladder cancer, chordoma, and papillary renal cell carcinoma.
[0140] As used herein, the terms “cancer state” or “cancer condition” refer to a characteristic of a cancer patient's condition, e.g., a diagnostic status, a type of cancer, a location of cancer, a primary origin of a cancer, a cancer stage, a cancer prognosis, and/or one or more additional characteristics of a cancer (e.g., tumor characteristics such as morphology, heterogeneity, size, etc.). In some embodiments, one or more additional personal characteristics of the subject are used further describe the cancer state or cancer condition of the subject, e.g., age, gender, weight, race, personal habits (e.g., smoking, drinking, diet), other pertinent medical conditions (e.g., high blood pressure, dry skin, other diseases), current medications, allergies, pertinent medical history, current side effects of cancer treatments and other medications, etc.
[0141] As used herein, the term “liquid biopsy” sample refers to a liquid sample obtained from a subject that includes cell-free DNA. Examples of liquid biopsy samples include, but are not limited to, blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal material, saliva, sweat, tears, pleural fluid, pericardial fluid, or peritoneal fluid of the subject. In some embodiments, a liquid biopsy sample is a cell-free sample, e.g., a cell free blood sample. In some embodiments, a liquid biopsy sample is obtained from a subject with cancer. In some embodiments, a liquid biopsy sample is collected from a subject with an unknown cancer status, e.g., for use in determining a cancer status of the subject. Likewise, in some embodiments, a liquid biopsy is collected from a subject with a non-cancerous disorder, e.g., a cardiovascular disease. In some embodiments, a liquid biopsy is collected from a subject with an unknown status for anon-cancerous disorder, e.g., for use in determining a non-cancerous disorder status of the subject.
[0142] As used herein, the term “cell-free DNA” and “cfDNA” interchangeably refer to DNA fragments that circulate in a subject’s body (e.g., bloodstream) and originate from one or more healthy cells and/or from one or more cancer cells. These DNA molecules are found outside cells, in bodily fluids such as blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal material, saliva, sweat, sweat, tears, pleural fluid, pericardial fluid, or peritoneal fluid of a subject, and are believed to be fragments of genomic DNA expelled from healthy and/or cancerous cells, e.g., upon apoptosis and lysis of the cellular envelope.
[0143] As used herein, the term “locus” refers to a position (e.g., a site) within a genome, e.g., on a particular chromosome. In some embodiments, a locus refers to a single nucleotide position, on a particular chromosome, within a genome. In some embodiments, a locus refers to a group of nucleotide positions within a genome. In some instances, a locus is defined by a mutation (e.g., substitution, insertion, deletion, inversion, or translocation) of consecutive nucleotides within a cancer genome. In some instances, a locus is defined by a gene, a sub- genic structure (e.g., a regulatory element, exon, intron, or combination thereof), or a predefined span of a chromosome. Because normal mammalian cells have diploid genomes, a normal mammalian genome (e.g., a human genome) will generally have two copies of every locus in the genome, or at least two copies of every locus located on the autosomal chromosomes, e.g., one copy on the maternal autosomal chromosome and one copy on the paternal autosomal chromosome.
[0144] As used herein, the term “allele” refers to a particular sequence of one or more nucleotides at a chromosomal locus. In a haploid organism, the subject has one allele at every chromosomal locus. In a diploid organism, the subject has two alleles at every chromosomal locus. [0145] As used herein, the term “base pair” or “bp” refers to a unit consisting of two nucleobases bound to each other by hydrogen bonds. Generally, the size of an organism's genome is measured in base pairs because DNA is typically double stranded. However, some viruses have single-stranded DNA or RNA genomes.
[0146] As used herein, the terms “genomic alteration,” “mutation,” and “variant” refer to a detectable change in the genetic material of one or more cells. A genomic alteration, mutation, or variant can refer to various type of changes in the genetic material of a cell, including changes in the primary genome sequence at single or multiple nucleotide positions, e.g., a single nucleotide variant (SNV), a multi-nucleotide variant (MNV), an indel (e.g., an insertion or deletion of nucleotides), a DNA rearrangement (e.g., an inversion or translocation of a portion of a chromosome or chromosomes), a variation in the copy number of a locus (e.g., an exon, gene, or a large span of a chromosome) (CNV), a partial or complete change in the ploidy of the cell, as well as in changes in the epigenetic information of a genome, such as altered DNA methylation patterns. In some embodiments, a mutation is a change in the genetic information of the cell relative to a particular reference genome, or one or more ‘normal’ alleles found in the population of the species of the subject. For instance, mutations can be found in both germline cells (e.g., non-cancerous, ‘normal’ cells) of a subject and in abnormal cells (e.g., pre-cancerous or cancerous cells) of the subject. As such, a mutation in a germline of the subject (e.g., which is found in substantially all ‘normal cells’ in the subject) is identified relative to a reference genome for the species of the subject. However, many loci of a reference genome of a species are associated with several variant alleles that are significantly represented in the population of the subject and are not associated with a diseased state, e.g., such that they would not be considered ‘mutations.’ By contrast, in some embodiments, a mutation in a cancerous cell of a subject can be identified relative to either a reference genome of the subject or to the subject’s own germline genome. In certain instances, identification of both types of variants can be informative. For instance, in some instances, a mutation that is present in both the cancer genome of the subject and the germline of the subject is informative for precision oncology when the mutation is a so-called ‘driver mutation,’ which contributes to the initiation and/or development of a cancer. However, in other instances, a mutation that is present in both the cancer genome of the subject and the germline of the subject is not informative for precision oncology, e.g., when the mutation is a so-called ‘passenger mutation,’ which does not contribute to the initiation and/or development of the cancer. Likewise, in some instances, a mutation that is present in the cancer genome of the subject but not the germline of the subject is informative for precision oncology, e.g., where the mutation is a driver mutation and/or the mutation facilitates a therapeutic approach, e.g., by differentiating cancer cells from normal cells in a therapeutically actionable way. However, in some instances, a mutation that is present in the cancer genome but not the germline of a subject is not informative for precision oncology, e.g., where the mutation is a passenger mutation and/or where the mutation fails to differentiate the cancer cell from a germline cell in a therapeutically actionable way.
[0147] As used herein, the terms “focal copy number variation,” “focal copy number alteration,” “focal copy number variant,” and the like interchangeably refer to a genomic variation, relative to a reference genome, in the copy number of a small genomic segment. Unless otherwise specified, a small genomic segment is less than 30 Mb. However, in some embodiments, a small genomic segment is less than 25 Mb, less than 20 Mb, less 15 Mb, less than 10 Mb, less than 5 Mb, less than 4 Mb, less than 3 Mb, less than 2 Mb, less than 1 Mb, or smaller. Generally, focal copy number variations range from several hundred bases to tens of Mb. In some embodiments, a focal copy number variation consists of one or a few exons of a gene or several genes. For more information of focal copy number variations see, for example, Nord et ctl, Int. J. Cancer, 126, 1390-1402 (2010), which is hereby incorporated herein by reference in its entirety.
[0148] As used herein, the term “reference allele” refers to the sequence of one or more nucleotides at a chromosomal locus that is either the predominant allele represented at that chromosomal locus within the population of the species (e.g., the “wild-type” sequence), or an allele that is predefined within a reference genome for the species.
[0149] As used herein, the term “variant allele” refers to a sequence of one or more nucleotides at a chromosomal locus that is either not the predominant allele represented at that chromosomal locus within the population of the species (e.g., not the “wild-type” sequence), or not an allele that is predefined within a reference sequence construct (e.g., a reference genome or set of reference genomes) for the species. In some instances, sequence isoforms found within the population of a species that do not affect a change in a protein encoded by the genome, or that result in an amino acid substitution that does not substantially affect the function of an encoded protein, are not variant alleles.
[0150] As used herein, the term “variant allele fraction,” “VAF,” “allelic fraction,” or “AF” refers to the number of times a variant or mutant allele was observed (e.g., a number of reads supporting a candidate variant allele) divided by the total number of times the position was sequenced (e.g., a total number of reads covering a candidate locus).
[0151] As used herein, the terms “variant fragment count” and “variant allele fragment count” interchangeably refer to a quantification, e.g., a raw or normalized count, of the number of sequences representing unique cell-free DNA fragments encompassing a variant allele in a sequencing reaction. That is, a variant fragment count represents a count of sequence reads representing unique molecules in the liquid biopsy sample, after duplicate sequence reads in the raw sequencing data have been collapsed, e.g., through the use of unique molecular indices (UMI) and bagging, etc. as described herein.
[0152] As used herein, the term “germline variants” refers to genetic variants inherited from maternal and paternal DNA. Germline variants may be determined through a matched tumor-normal calling pipeline.
[0153] As used herein, the term “somatic variants” refers to variants arising as a result of dysregulated cellular processes associated with neoplastic cells, e.g., a mutation. Somatic variants may be detected via subtraction from a matched normal sample.
[0154] As used herein, the term “single nucleotide variant” or “SNV” refers to a substitution of one nucleotide to a different nucleotide at a position (e.g., site) of a nucleotide sequence, e.g., a sequence read from an individual. A substitution from a first nucleobase X to a second nucleobase Y may be denoted as “X>Y.” For example, a cytosine to thymine SNV may be denoted as “C>T.”
[0155] As used herein, the term “insertions and deletions” or “indels” refers to a variant resulting from the gain or loss of DNA base pairs within an analyzed region.
[0156] As used herein, the term “copy number variation” or “CNV” refers to the process by which large structural changes in a genome associated with tumor aneuploidy and other dysregulated repair systems are detected. These processes are used to detect large scale insertions or deletions of entire genomic regions. CNV is defined as structural insertions or deletions greater than a certain base pair (“bp”) in size, such as 500 bp.
[0157] As used herein, the term “gene fusion” refers to the product of large-scale chromosomal aberrations resulting in the creation of a chimeric protein. These expressed products can be non-functional, or they can be highly over or underactive. This can cause deleterious effects in cancer such as hyper-proliferative or anti-apoptotic phenotypes. [0158] As used herein, the term “loss of heterozygosity” refers to the loss of one copy of a segment (e.g., including part or all of one or more genes) of the genome of a diploid subject (e.g, a human) or loss of one copy of a sequence encoding a functional gene product in the genome of the diploid subject, in a tissue, e.g., a cancerous tissue, of the subject. As used herein, when referring to a metric representing loss of heterozygosity across the entire genome of the subject, loss of heterozygosity is caused by the loss of one copy of various segments in the genome of the subject. Loss of heterozygosity across the entire genome may be estimated without sequencing the entire genome of a subject, and such methods for such estimations based on gene panel targeting-based sequencing methodologies are described in the art. Accordingly, in some embodiments, a metric representing loss of heterozygosity across the entire genome of a tissue of a subject is represented as a single value, e.g., a percentage or fraction of the genome. In some cases, a tumor is composed of various sub- clonal populations, each of which may have a different degree of loss of heterozygosity across their respective genomes. Accordingly, in some embodiments, loss of heterozygosity across the entire genome of a cancerous tissue refers to an average loss of heterozygosity across a heterogeneous tumor population. As used herein, when referring to a metric for loss of heterozygosity in a particular gene, e.g., a DNA repair protein such as a protein involved in the homologous DNA recombination pathway (e.g., BRCA1 or BRCA2), loss of heterozygosity refers to complete or partial loss of one copy of the gene encoding the protein in the genome of the tissue and/or a mutation in one copy of the gene that prevents translation of a full-length gene product, e.g., a frameshift or truncating (creating a premature stop codon in the gene) mutation in the gene of interest. In some cases, a tumor is composed of various sub-clonal populations, each of which may have a different mutational status in a gene of interest. Accordingly, in some embodiments, loss of heterozygosity for a particular gene of interest is represented by an average value for loss of heterozygosity for the gene across all sequenced sub-clonal populations of the cancerous tissue. In other embodiments, loss of heterozygosity for a particular gene of interest is represented by a count of the number of unique incidences of loss of heterozygosity in the gene of interest across all sequenced sub- clonal populations of the cancerous tissue (e.g., the number of unique frame-shift and/or truncating mutations in the gene identified in the sequencing data).
[0159] As used herein, the term “microsatellites” refers to short, repeated sequences of DNA. The smallest nucleotide repeated unit of a microsatellite is referred to as the “repeated unit” or “repeat unit.” In some embodiments, the stability of a microsatellite locus is evaluated by comparing some metric of the distribution of the number of repeated units at a microsatellite locus to a reference number or distribution.
[0160] As used herein, the term “microsatellite instability” or “MSI” refers to a genetic hypermutability condition associated with various cancers that results from impaired DNA mismatch repair (MMR) in a subject. Among other phenotypes, MSI causes changes in the size of microsatellite loci, e.g., a change in the number of repeated units at microsatellite loci, during DNA replication. Accordingly, the size of microsatellite repeats is varied in MSI cancers as compared to the size of the corresponding microsatellite repeats in the germline of a cancer subject. The term “Microsatellite Instability -High” or “MSI-H” refers to a state of a cancer (e.g., a tumor) that has a significant MMR defect, resulting in microsatellite loci with significantly different lengths than the corresponding microsatellite loci in normal cells of the same individual. The term “Microsatellite Stable” or “MSS” refers to a state of a cancer (e.g., a tumor) without significant MMR defects, such that there is no significant difference between the lengths of the microsatellite loci in cancerous cells and the lengths of the corresponding microsatellite loci in normal (e.g., non-cancerous) cells in the same individual. The term “Microsatellite Equivocal” or “MSE” refers to a state of a cancer (e.g., a tumor) having an intermediate microsatellite length phenotype, that cannot be clearly classified as MSI-H or MSS based on statistical cutoffs used to define those two categories.
[0161] As used herein, the term “gene product” refers to an RNA (e.g. , mRNA or miRNA) or protein molecule transcribed or translated from a particular genomic locus, e.g., a particular gene. The genomic locus can be identified using a gene name, a chromosomal location, or any other genetic mapping metric.
[0162] As used herein, the terms “expression level,” “abundance level,” or simply “abundance” refers to an amount of a gene product, (an RNA species, e.g., mRNA or miRNA, or protein molecule) transcribed or translated by a cell, or an average amount of a gene product transcribed or translated across multiple cells. When referring to mRNA or protein expression, the term generally refers to the amount of any RNA or protein species corresponding to a particular genomic locus, e.g., a particular gene. However, in some embodiments, an expression level can refer to the amount of a particular isoform of an mRNA or protein corresponding to a particular gene that gives rise to multiple mRNA or protein isoforms. The genomic locus can be identified using a gene name, a chromosomal location, or any other genetic mapping metric. [0163] As used herein, the term “ratio” refers to any comparison of a first metric X, or a first mathematical transformation thereof X' (e.g., measurement of a number of units of a genomic sequence in a first one or more biological samples or a first mathematical transformation thereof) to another metric Y or a second mathematical transformation thereof Y' (e.g., the number of units of a respective genomic sequence in a second one or more biological samples or a second mathematical transformation thereof) expressed as XJY, Y/X, logN(X/Y), logN(Y/X), X'/Y, Y/X', logN(X'/Y), or logN(Y/X'), X/Y', Y'/X, logN(X/Y'), logN(YVX) , X'/Y', Y'/X', logN(X'/Y'), or logN(Y'/X'), where N is any real number greater than 1 and where example mathematical transformations of X and Y include, but are not limited to. raising X or Y to a power Z, multiplying X or Y by a constant Q, where Z and Q are any real numbers, and/or taking an M based logarithm of X and/or Y, where M is a real number greater than 1. In one non-limiting example, X is transformed to X' prior to ratio calculation by raising X by the power of two (X2) and Y is transformed to Y' prior to ratio calculation by raising Y by the power of 3.2 (Y3-2) and the ratio of X and Y is computed as log2(X'/Y').
[0164] As used herein, the term “relative abundance” refers to a ratio of a first amount of a compound measured in a sample, e.g., a gene product (an RNA species, e.g., mRNA or miRNA, or protein molecule) or nucleic acid fragments having a particular characteristic (e.g., aligning to a particular locus or encompassing a particular allele), to a second amount of a compound measured in a second sample. In some embodiments, relative abundance refers to a ratio of an amount of species of a compound to a total amount of the compound in the same sample. For instance, a ratio of the amount of mRNA transcripts encoding a particular gene in a sample (e.g., aligning to a particular region of the exome) to the total amount of mRNA transcripts in the sample. In other embodiments, relative abundance refers to a ratio of an amount of a compound or species of a compound in a first sample to an amount of the compound of the species of the compound in a second sample. For instance, a ratio of a normalized amount of mRNA transcripts encoding a particular gene in a first sample to a normalized amount of mRNA transcripts encoding the particular gene in a second and/or reference sample.
[0165] As used herein, the terms “sequencing,” “sequence determination,” and the like refer to any biochemical processes that may be used to determine the order of biological macromolecules such as nucleic acids or proteins. For example, sequencing data can include all or a portion of the nucleotide bases in a nucleic acid molecule such as an mRNA transcript or a genomic locus.
[0166] As used herein, the term “genetic sequence” refers to a recordation of a series of nucleotides present in a subject’s RNA or DNA as determined by sequencing of nucleic acids from the subject.
[0167] As used herein, the term “sequence reads” or “reads” refers to nucleotide sequences produced by any nucleic acid sequencing process described herein or known in the art. Reads can be generated from one end of nucleic acid fragments (“single-end reads”) or from both ends of nucleic acid fragments (e.g., paired-end reads, double-end reads). The length of the sequence read is often associated with the particular sequencing technology. High-throughput methods, for example, provide sequence reads that can vary in size from tens to hundreds of base pairs (bp). In some embodiments, the sequence reads are of a mean, median or average length of about 15 bp to 900 bp long (e.g., about 20 bp, about 25 bp, about 30 bp, about 35 bp, about 40 bp, about 45 bp, about 50 bp, about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110 bp, about 120 bp, about 130, about 140 bp, about 150 bp, about 200 bp, about 250 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500 bp. In some embodiments, the sequence reads are of a mean, median or average length of about 1000 bp, 2000 bp, 5000 bp, 10,000 bp, or 50,000 bp or more. Nanopore® sequencing, for example, can provide sequence reads that can vary in size from tens to hundreds to thousands of base pairs. Illumina® parallel sequencing, for example, can provide sequence reads that do not vary as much, for example, most of the sequence reads can be smaller than 200 bp. A sequence read (or sequencing read) can refer to sequence information corresponding to a nucleic acid molecule (e.g., a string of nucleotides). For example, a sequence read can correspond to a string of nucleotides (e.g., about 20 to about 150) from part of a nucleic acid fragment, can correspond to a string of nucleotides at one or both ends of a nucleic acid fragment, or can correspond to nucleotides of the entire nucleic acid fragment. A sequence read can be obtained in a variety of ways, e.g., using sequencing techniques or using probes, e.g., in hybridization arrays or capture probes, or amplification techniques, such as the polymerase chain reaction (PCR) or linear amplification using a single primer or isothermal amplification.
[0168] As used herein, the term “read segment” refers to any form of nucleotide sequence read including the raw sequence reads obtained directly from a nucleic acid sequencing technique or from a sequence derived therefrom, e.g., an aligned sequence read, a collapsed sequence read, or a stitched sequence read.
[0169] As used herein, the term “read count” refers to the total number of nucleic acid reads generated, which may or may not be equivalent to the number of nucleic acid molecules generated, during a nucleic acid sequencing reaction.
[0170] As used herein, the term “read-depth,” “sequencing depth,” or “depth” can refer to a total number of unique nucleic acid fragments encompassing a particular locus or region of the genome of a subject that are sequenced in a particular sequencing reaction. Sequencing depth can be expressed as “Yx”, e.g., 50x, lOOx, etc., where “Y” refers to the number of unique nucleic acid fragments encompassing a particular locus that are sequenced in a sequencing reaction. In such a case, Y is necessarily an integer, because it represents the actual sequencing depth for a particular locus. Alternatively, read-depth, sequencing depth, or depth can refer to a measure of central tendency (e.g., a mean or mode) of the number of unique nucleic acid fragments that encompass one of a plurality of loci or regions of the genome of a subject that are sequenced in a particular sequencing reaction. For example, in some embodiments, sequencing depth refers to the average depth of every locus across an arm of a chromosome, a targeted sequencing panel, an exome, or an entire genome. In such case, Y may be expressed as a fraction or a decimal, because it refers to an average coverage across a plurality of loci. When a mean depth is recited, the actual depth for any particular locus may be different than the overall recited depth. Metrics can be determined that provide a range of sequencing depths in which a defined percentage of the total number of loci fall. For instance, a range of sequencing depths within which 90% or 95%, or 99% of the loci fall. As understood by the skilled artisan, different sequencing technologies provide different sequencing depths. For instance, low-pass whole genome sequencing can refer to technologies that provide a sequencing depth of less than 5x, less than 4x, less than 3x, or less than 2x, e.g., from about 0.5x to about 3x.
[0171] As used herein, the term “sequencing breadth” refers to what fraction of a particular reference exome (e.g., human reference exome), a particular reference genome (e.g., human reference genome), or part of the exome or genome has been analyzed. Sequencing breadth can be expressed as a fraction, a decimal, or a percentage, and is generally calculated as (the number of loci analyzed / the total number of loci in a reference exome or reference genome). The denominator of the fraction can be a repeat-masked genome, and thus 100% can correspond to all of the reference genome minus the masked parts. A repeat-masked exome or genome can refer to an exome or genome in which sequence repeats are masked (e.g., sequence reads align to unmasked portions of the exome or genome). In some embodiments, any part of an exome or genome can be masked and, thus, sequencing breadth can be evaluated for any desired portion of a reference exome or genome. In some embodiments, “broad sequencing” refers to sequencing/analysis of at least 0.1% of an exome or genome.
[0172] As used herein, the terms “sequence ratio” and “coverage ratio” interchangeably refer to any measurement of a number of units of a genomic sequence in a first one or more biological samples (e.g., a test and/or tumor sample) compared to the number of units of the respective genomic sequence in a second one or more biological samples (e.g., a reference and/or control sample). In some embodiments, a sequence ratio is a copy ratio, a log2- transformed copy ratio (e.g., log2 copy ratio), a coverage ratio, a base fraction, an allele fraction (e.g. , a variant allele fraction), and/or a tumor ploidy . In some embodiments sequence ratio is a logN-transformed copy ratio, where N is any real number greater than 1.
[0173] As used herein, the term “sequencing probe” refers to a molecule that binds to a nucleic acid with affinity that is based on the expected nucleotide sequence of the RNA or DNA present at that locus.
[0174] As used herein, the term “targeted panel” or “targeted gene panel” refers to a combination of probes for sequencing (e.g., by next-generation sequencing) nucleic acids present in a biological sample from a subject (e.g., a tumor sample, liquid biopsy sample, germbne tissue sample, white blood cell sample, or tumor or tissue organoid sample), selected to map to one or more loci of interest on one or more chromosomes. An example set of loci/genes useful for precision oncology, e.g., via solid or liquid biopsy assay, that can be analyzed using a targeted panel is described in Table 1. In some embodiments, in addition to loci that are informative for precision oncology, a targeted panel includes one or more probes for sequencing one or more of a loci associated with a different medical condition, a loci used for internal control purposes, or a loci from a pathogenic organism (e.g., an oncogenic pathogen).
[0175] As used herein, the term, “reference exome” refers to any sequenced or otherwise characterized exome, whether partial or complete, of any tissue from any organism or pathogen that may be used to reference identified sequences from a subject. Typically, a reference exome will be derived from a subject of the same species as the subject whose sequences are being evaluated. Example reference exomes used for human subjects as well as many other organisms are provided in the on-line genome browser hosted by the National Center for Biotechnology Information (“NCBI”). An “exome” refers to the complete transcriptional profile of an organism or pathogen, expressed in nucleic acid sequences. As used herein, a reference sequence or reference exome often is an assembled or partially assembled exomic sequence from an individual or multiple individuals. In some embodiments, a reference exome is an assembled or partially assembled exomic sequence from one or more human individuals. The reference exome can be viewed as a representative example of a species’ set of expressed genes. In some embodiments, a reference exome comprises sequences assigned to chromosomes.
[0176] As used herein, the term “reference genome” refers to any sequenced or otherwise characterized genome, whether partial or complete, of any organism or pathogen that may be used to reference identified sequences from a subject. Typically, a reference genome will be derived from a subject of the same species as the subject whose sequences are being evaluated. Exemplary reference genomes used for human subjects as well as many other organisms are provided in the on-line genome browser hosted by the National Center for Biotechnology Information (“NCBI”) or the University of California, Santa Cruz (UCSC). A “genome” refers to the complete genetic information of an organism or pathogen, expressed in nucleic acid sequences. As used herein, a reference sequence or reference genome often is an assembled or partially assembled genomic sequence from an individual or multiple individuals. In some embodiments, a reference genome is an assembled or partially assembled genomic sequence from one or more human individuals. The reference genome can be viewed as a representative example of a species’ set of genes. In some embodiments, a reference genome comprises sequences assigned to chromosomes. Exemplary human reference genomes include but are not limited to NCBI build 34 (UCSC equivalent: hgl6), NCBI build 35 (UCSC equivalent: hgl7), NCBI build 36.1 (UCSC equivalent: hgl8), GRCh37 (UCSC equivalent: hgl9), and GRCh38 (UCSC equivalent: hg38). For a haploid genome, there can be only one nucleotide at each locus. For a diploid genome, heterozygous loci can be identified; each heterozygous locus can have two alleles, where either allele can allow a match for alignment to the locus.
[0177] As used herein, the term “bioinformatics pipeline” refers to a series of processing stages used to determine characteristics of a subject’s genome or exome based on sequencing data of the subject’s genome or exome. A bioinformatics pipeline may be used to determine characteristics of a germline genome or exome of a subject and/or a cancer genome or exome of a subject. In some embodiments, the pipeline extracts information related to genomic alterations in the cancer genome of a subject, which is useful for guiding clinical decisions for precision oncology, from sequencing results of a biological sample, e.g., a tumor sample, liquid biopsy sample, reference normal sample, etc., from the subject. Certain processing stages in a bioinformatics may be ‘connected,’ meaning that the results of a first respective processing stage are informative and/or essential for execution of a second, downstream processing stage. For instance, in some embodiments, a bioinformatics pipeline includes a first respective processing stage for identifying genomic alterations that are unique to the cancer genome of a subject and a second respective processing stage that uses the quantity and/or identity of the identified genomic alterations to determine a metric that is informative for precision oncology, e.g., a tumor mutational burden. In some embodiments, the bioinformatics pipeline includes a reporting stage that generates a report of relevant and/or actionable information identified by upstream stages of the pipeline, which may or may not further include recommendations for aiding clinical therapy decisions.
[0178] As used herein, the term “limit of detection” or “LOD” refers to the minimal quantity of a feature that can be identified with a particular level of confidence. Accordingly, level of detection can be used to describe an amount of a substance that must be present in order for a particular assay to reliably detect the substance. A level of detection can also be used to describe a level of support needed for an algorithm to reliably identify a genomic alteration based on sequencing data. For example, a minimal number of unique sequence reads to support identification of a sequence variant such as a SNV.
[0179] As used herein, the term “BAM File” or “Binary file containing Alignment Maps” refers to a file storing sequencing data aligned to a reference sequence (e.g., a reference genome or exome). In some embodiments, a BAM file is a compressed binary version of a SAM (Sequence Alignment Map) file that includes, for each of a plurality of unique sequence reads, an identifier for the sequence read, information about the nucleotide sequence, information about the alignment of the sequence to a reference sequence, and optionally metrics relating to the quality of the sequence read and/or the quality of the sequence alignment. While BAM files generally relate to files having a particular format, for simplicity they are used herein to simply refer to a file, of any format, containing information about a sequence alignment, unless specifically stated otherwise. [0180] As used herein, the term “measure of central tendency” refers to a central or representative value for a distribution of values. Non-limiting examples of measures of central tendency include an arithmetic mean, weighted mean, midrange, midhinge, trimean, geometric mean, geometric median, Winsorized mean, median, and mode of the distribution of values.
[0181] As used herein, the term “Positive Predictive Value” or “PPV” means the likelihood that a variant is properly called given that a variant has been called by an assay. PPV can be expressed as (number of true positives)/ (number of false positives + number of true positives).
[0182] As used herein, the term “assay” refers to a technique for determining a property of a substance, e.g., a nucleic acid, a protein, a cell, a tissue, or an organ. An assay (e.g., a first assay or a second assay) can comprise a technique for determining the copy number variation of nucleic acids in a sample, the methylation status of nucleic acids in a sample, the fragment size distribution of nucleic acids in a sample, the mutational status of nucleic acids in a sample, or the fragmentation pattern of nucleic acids in a sample. Any assay known to a person having ordinary skill in the art can be used to detect any of the properties of nucleic acids mentioned herein. Properties of a nucleic acids can include a sequence, genomic identity, copy number, methylation state at one or more nucleotide positions, size of the nucleic acid, presence or absence of a mutation in the nucleic acid at one or more nucleotide positions, and pattern of fragmentation of a nucleic acid (e.g., the nucleotide position(s) at which a nucleic acid fragments). An assay or method can have a particular sensitivity and/or specificity, and their relative usefulness as a diagnostic tool can be measured using ROC- AUC statistics.
[0183] As used herein, the term “classification” can refer to any number(s) or other characters(s) that are associated with a particular property of a sample. For example, in some embodiments, the term “classification” can refer to a type of cancer in a subject, a stage of cancer in a subject, a prognosis for a cancer in a subject, a tumor load, a presence of tumor metastasis in a subject, and the like. The classification can be binary (e.g., positive or negative) or have more levels of classification (e.g., a scale from 1 to 10 or 0 to 1). The terms “cutoff’ and “threshold” can refer to predetermined numbers used in an operation. For example, a cutoff size can refer to a size above which fragments are excluded. A threshold value can be a value above or below which a particular classification applies. Either of these terms can be used in either of these contexts. [0184] As used herein, the term “sensitivity” or “true positive rate” (TPR) refers to the number of true positives divided by the sum of the number of true positives and false negatives. Sensitivity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly has a condition. For example, sensitivity can characterize the ability of a method to correctly identify the number of subjects within a population having cancer. In another example, sensitivity can characterize the ability of a method to correctly identify the one or more markers indicative of cancer.
[0185] As used herein, the term “specificity” or “true negative rate” (TNR) refers to the number of true negatives divided by the sum of the number of true negatives and false positives. Specificity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly does not have a condition. For example, specificity can characterize the ability of a method to correctly identify the number of subjects within a population not having cancer. In another example, specificity characterizes the ability of a method to correctly identify one or more markers indicative of cancer.
[0186] As used herein, an “actionable genomic alteration” or “actionable variant” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), that is known or believed to be associated with a therapeutic course of action that is more likely to produce a positive effect in a cancer patient that has the actionable variant than in a similarly situated cancer patient that does not have the actionable variant. For instance, administration of EGFR inhibitors (e.g., afatinib, erlotinib, gefitinib) is more effective for treating non-small cell lung cancer in patients with an EGFR mutation in exons 19/21 than for treating non-small cell lung cancer in patients that do not have an EGFR mutations in exons 19/21. Accordingly, an EGFR mutation in exon 19/21 is an actionable variant. In some instances, an actionable variant is only associated with an improved treatment outcome in one or a group of specific cancer types. In other instances, an actionable variant is associated with an improved treatment outcome in substantially all cancer types.
[0187] As used herein, a “variant of uncertain significance” or “VUS” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), whose impact on disease development/progression is unknown. [0188] As used herein, a “benign variant” or “likely benign variant” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), that is known or believed to not contribute to disease development/progression.
[0189] As used herein, a “pathogenic variant” or “likely pathogenic variant” refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or ploidy variation), or value of another cancer metric derived from nucleic acid sequencing data (e.g., a tumor mutational burden, MSI status, or tumor fraction), that is known or believed to contribute to disease development/progression.
[0190] As used herein, an “effective amount” or “therapeutically effective amount” is an amount sufficient to affect a beneficial or desired clinical result upon treatment. An effective amount can be administered to a subject in one or more doses. In terms of treatment, an effective amount is an amount that is sufficient to palliate, ameliorate, stabilize, reverse or slow the progression of the disease, or otherwise reduce the pathological consequences of the disease. The effective amount is generally determined by the physician on a case-by-case basis and is within the skill of one in the art. Several factors are typically taken into account when determining an appropriate dosage to achieve an effective amount. These factors include age, sex and weight of the subject, the condition being treated, the severity of the condition and the form and effective concentration of the therapeutic agent being administered.
[0191] The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the description of the invention and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, to the extent that the terms “including,” “includes,” “having,” “has,” “with,” or variants thereof are used in either the detailed description and/or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.”
[0192] As used herein, the term “if’ may be construed to mean “when” or “upon” or “in response to determining” or “in response to detecting,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” may be construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event],” depending on the context.
[0193] It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first subject could be termed a second subject, and, similarly, a second subject could be termed a first subject, without departing from the scope of the present disclosure. The first subject and the second subject are both subjects, but they are not the same subject. Furthermore, the terms “subject,” “user,” and “patient” are used interchangeably herein.
[0194] Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, including example systems, methods, techniques, instruction sequences, and computing machine program products that embody illustrative implementations. However, the illustrative discussions below are not intended to be exhaustive or to limit the implementations to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The features described herein are not limited by the illustrated ordering of acts or events, as some acts can occur in different orders and/or concurrently with other acts or events.
[0195] The implementations provided herein are chosen and described in order to best explain the principles and their practical applications, to thereby enable others skilled in the art to best utilize the various embodiments with various modifications as are suited to the particular use contemplated. In some instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments. In other instances, it will be apparent to one of ordinary skill in the art that the present disclosure may be practiced without one or more of the specific details.
[0196] It will be appreciated that, in the development of any such actual implementation, numerous implementation-specific decisions are made in order to achieve the designer’s specific goals, such as compliance with use case- and business-related constraints, and that these specific goals will vary from one implementation to another and from one designer to another. Moreover, it will be appreciated that though such a design effort might be complex and time-consuming, it will nevertheless be a routine undertaking of engineering for those of ordering skill in the art having the benefit of the present disclosure.
Example System Embodiments
[0197] Now that an overview of some aspects of the present disclosure and some definitions used in the present disclosure have been provided, details of an exemplary system for providing clinical support for personalized cancer therapy using a liquid biopsy assay are now described in conjunction with Figures 1A, IB, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and 1D3. Figures 1A, IB, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and lD3collectively illustrate the topology of an example system for providing clinical support for personalized cancer therapy using a liquid biopsy assay, in accordance with some embodiments of the present disclosure. Advantageously, the example system illustrated in Figures 1A, IB, 1C1, 1D1, 1C2, 1D2,
1E2, 1F2, 1C3, and lD3improves upon conventional methods for providing clinical support for personalized cancer therapy by validating copy number variations, thus identifying focal copy number variations for actionable treatment, validating a somatic sequence variant in a test subject having a cancer condition, and/or determining circulating tumor fraction estimates using on-target and off-target sequence reads.
[0198] Figure 1 A is a block diagram illustrating a system in accordance with some implementations. The device 100 in some implementations includes one or more processing units CPU(s) 102 (also referred to as processors), one or more network interfaces 104, a user interface 106, e.g., including a display 108 and/or an input 110 (e.g., a mouse, touchpad, keyboard, etc.), a non-persistent memory 111, a persistent memory 112, and one or more communication buses 114 for interconnecting these components. The one or more communication buses 114 optionally include circuitry (sometimes called a chipset) that interconnects and controls communications between system components. The non-persistent memory 111 typically includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, ROM, EEPROM, flash memory, whereas the persistent memory 112 typically includes CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. The persistent memory 112 optionally includes one or more storage devices remotely located from the CPU(s) 102. The persistent memory 112, and the non-volatile memory device(s) within the non-persistent memory 112, comprise non- transitory computer readable storage medium. In some implementations, the non-persistent memory 111 or alternatively the non-transitory computer readable storage medium stores the following programs, modules and data structures, or a subset thereof, sometimes in conjunction with the persistent memory 112:
• an operating system 116, which includes procedures for handling various basic system services and for performing hardware dependent tasks;
• a network communication module (or instructions) 118 for connecting the system 100 with other devices and/or a communication network 105;
• a test patient data store 120 for storing one or more collections of features from patients (e.g., subjects);
• a bioinformatics module 140 for processing sequencing data and extracting features from sequencing data, e.g., from liquid biopsy sequencing assays;
• a feature analysis module 160 for evaluating patient features, e.g., genomic alterations, compound genomic features, and clinical features; and
• a reporting module 180 for generating and transmitting reports that provide clinical support for personalized cancer therapy.
[0199] Although Figures 1A, IB, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and 1D3 depict a “system 100,” the figures are intended more as a functional description of the various features that may be present in computer systems than as a structural schematic of the implementations described herein. In practice, and as recognized by those of ordinary skill in the art, items shown separately could be combined and some items could be separated. Moreover, although Figure 1 depicts certain data and modules in non-persistent memory 111, some or all of these data and modules may be in persistent memory 112. For example, in various implementations, one or more of the above identified elements are stored in one or more of the previously mentioned memory devices and correspond to a set of instructions for performing a function described above. The above identified modules, data, or programs (e.g., sets of instructions) need not be implemented as separate software programs, procedures, datasets, or modules, and thus various subsets of these modules and data may be combined or otherwise re-arranged in various implementations.
[0200] In some implementations, the non-persistent memory 111 optionally stores a subset of the modules and data structures identified above. Furthermore, in some embodiments, the memory stores additional modules and data structures not described above. In some embodiments, one or more of the above-identified elements is stored in a computer system, other than that of system 100, that is addressable by system 100 so that system 100 may retrieve all or a portion of such data when needed.
[0201] For purposes of illustration in Figure 1A, system 100 is represented as a single computer that includes all of the functionality for providing clinical support for personalized cancer therapy. However, while a single machine is illustrated, the term “system” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
[0202] For example, in some embodiments, system 100 includes one or more computers. In some embodiments, the functionality for providing clinical support for personalized cancer therapy is spread across any number of networked computers and/or resides on each of several networked computers and/or is hosted on one or more virtual machines at a remote location accessible across the communications network 105. For example, different portions of the various modules and data stores illustrated in Figures 1A, IB, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and lD3can be stored and/or executed on the various instances of a processing device and/or processing server/database in the distributed diagnostic environment 210 illustrated in Figure 2B ( e.g ., processing devices 224, 234, 244, and 254, processing server 262, and database 264).
[0203] The system may operate in the capacity of a server or a client machine in client- server network environment, as a peer machine in a peer-to-peer (or distributed) network environment, or as a server or a client machine in a cloud computing infrastructure or environment. The system may be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, a switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. [0204] In another implementation, the system comprises a virtual machine that includes a module for executing instructions for performing any one or more of the methodologies disclosed herein. In computing, a virtual machine (VM) is an emulation of a computer system that is based on computer architectures and provides functionality of a physical computer. Some such implementations may involve specialized hardware, software, or a combination of hardware and software.
[0205] One of skill in the art will appreciate that any of a wide array of different computer topologies are used for the application and all such topologies are within the scope of the present disclosure.
Test Patient Data Store (120)
[0206] Referring to Figure IB, in some embodiments, the system (e.g., system 100) includes a patient data store 120 that stores data for patients 121-1 to 121-M (e.g., cancer patients or patients being tested for cancer) including one or more sequencing data 122, feature data 125, and clinical assessments 139. These data are used and/or generated by the various processes stored in the bioinformatics module 140 and feature analysis module 160 of system 100, to ultimately generate a report providing clinical support for personalized cancer therapy of a patient. While the feature scope of patient data 121 across all patients may be informationally dense, an individual patient’s feature set may be sparsely populated across the entirety of the collective feature scope of all features across all patients. That is to say, the data stored for one patient may include a different set of features that the data stored for another patient. Further, while illustrated as a single data construct in Figure IB, different sets of patient data may be stored in different databases or modules spread across one or more system memories.
[0207] In some embodiments, sequencing data 122 from one or more sequencing reactions 122 -i, including a plurality of sequence reads 123-1 to 123-K, is stored in the test patient data store 120. The data store may include different sets of sequencing data from a single subject, corresponding to different samples from the patient, e.g., a tumor sample, liquid biopsy sample, tumor organoid derived from a patient tumor, and/or a normal sample, and/or to samples acquired at different times, e.g., while monitoring the progression, regression, remission, and/or recurrence of a cancer in a subject. The sequence reads may be in any suitable file format, e.g., BCL, FASTA, FASTQ, etc. In some embodiments, sequencing data 122 is accessed by a sequencing data processing module 141, which performs various pre-processing, genome alignment, and demultiplexing operations, as described in detail below with reference to bioinformatics module 140. In some embodiments, sequence data that has been aligned to a reference construct, e.g., BAM file 124, is stored in test patient data store 120.
[0208] In some embodiments, the test patient data store 120 includes feature data 125, e.g., that is useful for identifying clinical support for personalized cancer therapy. In some embodiments, the feature data 125 includes personal characteristics 126 of the patient, such as patient name, date of birth, gender, ethnicity, physical address, smoking status, alcohol consumption characteristic, anthropomorphic data, etc.
[0209] In some embodiments, the feature data 125 includes medical history data 127 for the patient, such as cancer diagnosis information (e.g., date of initial diagnosis, date of metastatic diagnosis, cancer staging, tumor characterization, tissue of origin, previous treatments and outcomes, adverse effects of therapy, therapy group history, clinical trial history, previous and current medications, surgical history, etc.), previous or current symptoms, previous or current therapies, previous treatment outcomes, previous disease diagnoses, diabetes status, diagnoses of depression, diagnoses of other physical or mental maladies, and family medical history. In some embodiments, the feature data 125 includes clinical features 128, such as pathology data 128-1, medical imaging data 128-2, and tissue culture and/or tissue organoid culture data 128-3.
[0210] In some embodiments, yet other clinical features, such as previous laboratory testing results, are stored in the test patient data store 120. Medical history data 127 and clinical features may be collected from various sources, including at intake directly from the patient, from an electronic medical record (EMR) or electronic health record (EHR) for the patient, or curated from other sources, such as fields from various testing records (e.g., genetic sequencing reports).
[0211] In some embodiments, the feature data 125 includes genomic features 131 for the patient. Non-limiting examples of genomic features include allelic states 132 (e.g., the identity of alleles at one or more loci, support for wild type or variant alleles at one or more loci, support for SNVs/MNVs at one or more loci, support for indels at one or more loci, and/or support for gene rearrangements at one or more loci), allelic fractions 133 (e.g., ratios of variant to reference alleles (or vice versa), methylation states 134 (e.g., a distribution of methylation patterns at one or more loci and/or support for aberrant methylation patterns at one or more loci), genomic copy numbers 135 (e.g., a copy number value at one or more loci and/or support for an aberrant (increased or decreased) copy number at one or more loci), tumor mutational burden 136 (e.g., a measure of the number of mutations in the cancer genome of the subject), and microsatellite instability status 137 (e.g, a measure of the repeated unit length at one or more microsatellite loci and/or a classification of the MSI status for the patient’s cancer). In some embodiments, one or more of the genomic features 131 are determined by a nucleic acid bioinformatics pipeline, e.g, as described in detail below with reference to Figure 4 (e.g, Figures 4A-E, 4F1, 4F2, and 4F3). In particular, in some embodiments, the feature data 125 include genomic copy numbers 135 (e.g, 135-1 for Patient 1 121-1) variant allele fractions 133, and/or circulating tumor fraction estimates 131-i, as determined using the improved methods for analyzing copy number variations (CNVs) using the copy number variation analysis module 153, validating somatic sequence variants, and/or determining circulating tumor fraction estimates, and as described in further detail below with reference to Figures 1 and 4 (e.g, Figures 1C1, 1D1, 4F1; Figures 1C2, 1D2, and 4F2; and/or Figures 1C3, 1D3, and 4F3). In some embodiments, one or more of the genomic features 131 are obtained from an external testing source, e.g, not connected to the bioinformatics pipeline as described below.
[0212] For example, referring to Figure 1C1, the one or more genomic features 131 include genomic copy numbers 135 comprising liquid biopsy genomic copy numbers 135-cf and optional tumor biopsy genomic copy numbers 135-t, in accordance with some embodiments of the present disclosure. In some embodiments, the liquid biopsy genomic copy numbers 135-cf are determined by a nucleic acid bioinformatics pipeline (e.g, as described in detail below with reference to Figures 4A-E and 4F1) using a plurality of sequence reads 123 obtained from a sequencing of cell-free nucleic acids from a liquid biopsy sample. In some embodiments, the liquid biopsy genomic copy numbers comprise plurality of copy number annotations (e.g, 135-cf-l, 135-cf-2,... , 135-cf-N), where each copy number annotation corresponds to a genomic target (e.g. , a gene or a region of a genome). In some embodiments, a copy number annotation comprises a qualitative status and/or a quantitative copy number. In some alternative embodiments, the optional tumor biopsy genomic copy numbers 135-t are determined by a nucleic acid bioinformatics pipeline using a plurality of sequence reads 123 obtained from a sequencing of nucleic acids from a tumor (e.g, tissue) biopsy. In some embodiments, the optional tumor biopsy genomic copy numbers comprise a plurality of optional copy number annotations (e.g, 135-1-t-l, 135-l-t-2,... , 135-1-t-O), where each copy number annotation corresponds to a genomic target (e.g., a gene or a region of a genome).
[0213] Referring again to Figure IB, in some embodiments, the feature data 125 further includes data 138 from other -omics fields of study. Non-limiting examples of -omics fields of study that may yield feature data useful for providing clinical support for personalized cancer therapy include transcriptomics, epigenomics, proteomics, metabolomics, metabonomics, microbiomics, lipidomics, gly comics, cellomics, and organoidomics.
[0214] In some embodiments, yet other features may include features derived from machine learning approaches, e.g., based at least in part on evaluation of any relevant molecular or clinical features, considered alone or in combination, not limited to those listed above. For instance, in some embodiments, one or more latent features learned from evaluation of cancer patient training datasets improve the diagnostic and prognostic power of the various analysis algorithms in the feature analysis module 160.
[0215] The skilled artisan will know of other types of features useful for providing clinical support for personalized cancer therapy. The listing of features above is merely representative and should not be construed to be limiting.
[0216] In some embodiments, a test patient data store 120 includes clinical assessment data 139 for patients, e.g., based on the feature data 125 collected for the subject. In some embodiments, the clinical assessment data 139 includes a catalogue of actionable variants and characteristics 139-1 (e.g., genomic alterations and compound metrics based on genomic features known or believed to be targetable by one or more specific cancer therapies), matched therapies 139-2 (e.g., the therapies known or believed to be particularly beneficial for treatment of subjects having actionable variants), and/or clinical reports 139-3 generated for the subject, e.g., based on identified actionable variants and characteristics 139-1 and/or matched therapies 139-2.
[0217] In some embodiments, clinical assessment data 139 is generated by analysis of feature data 125 using the various algorithms of feature analysis module 160, as described in further detail below. In some embodiments, clinical assessment data 139 is generated, modified, and/or validated by evaluation of feature data 125 by a clinician, e.g., an oncologist. For instance, in some embodiments, a clinician (e.g., at clinical environment 220) uses feature analysis module 160, or accesses test patient data store 120 directly, to evaluate feature data 125 to make recommendations for personalized cancer treatment of a patient. Similarly, in some embodiments, a clinician (e.g., at clinical environment 220) reviews recommendations determined using feature analysis module 160 and approves, rejects, or modifies the recommendations, e.g., prior to the recommendations being sent to a medical professional treating the cancer patient.
Bioinformatics Module (140)
[0218] Referring again to Figure 1A, the system (e.g., system 100) includes a bioinformatics module 140 that includes a feature extraction module 145 and optional ancillary data processing constructs, such as a sequence data processing module 141 and/or one or more reference sequence constructs 158 (e.g., a reference genome, exome, or targeted- panel construct that includes reference sequences for a plurality of loci targeted by a sequencing panel).
[0219] In some embodiments, bioinformatics module 140 includes a sequence data processing module 141 that includes instructions for processing sequence reads, e.g., raw sequence reads 123 from one or more sequencing reactions 122-i, prior to analysis by the various feature extraction algorithms, as described in detail below. In some embodiments, sequence data processing module 141 includes one or more pre-processing algorithms 142 that prepare the data for analysis. In some embodiments, the pre-processing algorithms 142 include instructions for converting the file format of the sequence reads from the output of the sequencer (e.g., a BCL file format) into a file format compatible with downstream analysis of the sequences (e.g., a FASTQ or FASTA file format). In some embodiments, the pre-processing algorithms 142 include instructions for evaluating the quality of the sequence reads (e.g., by interrogating quality metrics like Phred score, base-calling error probabilities, Quality (Q) scores, and the like) and/or removing sequence reads that do not satisfy a threshold quality (e.g., an inferred base call accuracy of at least 80%, at least 90%, at least 95%, at least 99%, at least 99.5%, at least 99.9%, or higher). In some embodiments, the pre processing algorithms 142 include instructions for filtering the sequence reads for one or more properties, e.g., removing sequences failing to satisfy a lower or upper size threshold or removing duplicate sequence reads.
[0220] In some embodiments, sequence data processing module 141 includes one or more alignment algorithms 143, for aligning pre-processed sequence reads 123 to a reference sequence construct 158, e.g., a reference genome, exome, or targeted-panel construct. Many algorithms for aligning sequencing data to a reference construct are known in the art, for example, BWA, Blat, SHRiMP, LastZ, and MAQ. One example of a sequence read alignment package is the Burrows-Wheeler Alignment tool (BWA), which uses a Burrows- Wheeler Transform (BWT) to align short sequence reads against a large reference construct, allowing for mismatches and gaps. Li and Durbin, Bioinformatics, 25(14): 1754-60 (2009), the content of which is incorporated herein by reference, in its entirety, for all purposes. Sequence read alignment packages import raw or pre-processed sequence reads 122, e.g., in BCL, FASTA, or FASTQ file formats, and output aligned sequence reads 124, e.g., in SAM or BAM file formats.
[0221] In some embodiments, sequence data processing module 141 includes one or more demultiplexing algorithms 144, for dividing sequence read or sequence alignment files generated from sequencing reactions of pooled nucleic acids into separate sequence read or sequence alignment files, each of which corresponds to a different source of nucleic acids in the nucleic acid sequencing pool. For instance, because of the cost of sequencing reactions, it is common practice to pool nucleic acids from a plurality of samples into a single sequencing reaction. The nucleic acids from each sample are tagged with a sample-specific and/or molecule-specific sequence tag (e.g., a UMI), which is sequenced along with the molecule.
In some embodiments, demultiplexing algorithms 144 sort these sequence tags in the sequence read or sequence alignment files to demultiplex the sequencing data into separate files for each of the samples included in the sequencing reaction.
[0222] Bioinformatics module 140 includes a feature extraction module 145, which includes instructions for identifying diagnostic features, e.g., genomic features 131, from sequencing data 122 of biological samples from a subject, e.g., one or more of a solid tumor sample, a liquid biopsy sample, or a normal tissue (e.g., control) sample. For instance, in some embodiments, a feature extraction algorithm compares the identity of one or more nucleotides at a locus from the sequencing data 122 to the identity of the nucleotides at that locus in a reference sequence construct (e.g., a reference genome, exome, or targeted-panel construct) to determine whether the subject has a variant at that locus. In some embodiments, a feature extraction algorithm evaluates data other than the raw sequence, to identify a genomic alteration in the subject, e.g., an allelic ratio, a relative copy number, a repeat unit distribution, etc.
[0223] For instance, in some embodiments, feature extraction module 145 includes one or more variant identification modules that include instructions for various variant calling processes. In some embodiments, variants in the germline of the subject are identified, e.g., using a germline variant identification module 146. In some embodiments, variants in the cancer genome, e.g., somatic variants, are identified, e.g., using a somatic variant identification module 150. While separate germline and somatic variant identification modules are illustrated in Figure 1A, in some embodiments they are integrated into a single module. In some embodiments, the variant identification module includes instructions for identifying one or more of nucleotide variants (e.g., single nucleotide variants (SNV) and multi-nucleotide variants (MNV)) using one or more SNV/MNV calling algorithms (e.g., algorithms 147 and/or 151), indels (e.g., insertions or deletions of nucleotides) using one or more indel calling algorithms (e.g., algorithms 148 and/or 152), and genomic rearrangements (e.g., inversions, translocation, and fusions of nucleotide sequences) using one or more genomic rearrangement calling algorithms (e.g., algorithms 149 and/or 153).
[0224] For example, referring to Figures 1C2 and 1D2, in some embodiments, feature extraction module 145 comprises, in the variant identification module 146, a variant thresholding module 146-a, a sequence variant data store 146-r, and a variant validation module 146-o. In some such embodiments, the sequence variant data store 146-r comprises one or more candidate variants for a test subject identified by aligning to a reference sequence a plurality of sequence reads obtained from sequencing a liquid biopsy sample of the test subject, the one or more candidate variants corresponding to a respective one or more loci in the reference sequence. The plurality of sequence reads aligned to the reference sequence is used to identify a variant allele fragment count for each candidate variant. The sequence variant data store 146-r further comprises, in some embodiments, a plurality of variants from a first set of nucleic acids obtained from a cohort of subjects (e.g. , from a tumor tissue biopsy for each subject in a baseline cohort of subjects). The variant thresholding module 146-a performs a function for each candidate variant in the one or more candidate variants where, for each corresponding locus 146-b (e.g., 146-b-l,... , 146-b-P), a dynamic variant count threshold 146-d (e.g., 146-d-l) is obtained based on a pre-test odds of a positive variant call for the locus, based on the prevalence of variants in the genomic region that includes the locus, using the plurality of variants for the baseline cohort. The variant thresholding module 146-a compares the variant allele fragment count 146-c (e.g., 146-c-l) for the candidate variant against the dynamic variant count threshold 146-d for the locus corresponding to the candidate variant. In some embodiments, the variant validation module 146-0 determines whether the candidate variant is validated or rejected as a somatic sequence variant based on the comparison. For example, when the variant allele fragment count for the candidate variant satisfies the dynamic variant count threshold for the locus, the somatic sequence variant is validated, and when the variant allele fragment count for the candidate variant does not satisfy the dynamic variant count threshold for the locus, the somatic sequence variant is rejected.
[0225] In some embodiments, the dynamic variant count threshold is determined based on a distribution of variant detection sensitivities as a function of circulating variant allele fraction from the cohort of subjects (e.g., the baseline cohort). For example, referring to Figure 1C2, in some such embodiments, the variant thresholding module 146-a takes as input one or more variant allele fractions 133 from the genomic features module 131. In some such embodiments, the variant allele fractions 133 comprises a plurality of variant allele fractions obtained from tumor tissue biopsies 133-t (e.g.. 133-t-l, 133-t-2... , 133-t-O) for the cohort of subjects. In some embodiments, the variant allele fractions comprise a plurality of variant allele fractions obtained from liquid biopsy samples 133-cf (e.g., 133-cf-l, 133-cf-2..., 133- cf-N) for the cohort of subjects. In some embodiments, the circulating variant allele fraction is obtained by comparing the liquid biopsy variant allele fractions 133-cf to the tumor biopsy variant allele fraction 133-t.
[0226] Additional embodiments for using variant allele fractions (e.g., variant allele frequencies) to identify somatic variants are detailed below (see, Example Methods: Variant Identification).
[0227] A SNV/MNV algorithm 147 may identify a substitution of a single nucleotide that occurs at a specific position in the genome. For example, at a specific base position, or locus, in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position and the two possible nucleotide variations, C or A, are said to be alleles for this position. SNPs underlie differences in human susceptibility to a wide range of diseases (e.g., sickle-cell anemia, b-thalassemia and cystic fibrosis result from SNPs). The severity of illness and the way the body responds to treatments are also manifestations of genetic variations. For example, a single-base mutation in the APOE (apolipoprotein E) gene is associated with a lower risk for Alzheimer's disease. A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. A somatic single-nucleotide variation (e.g., caused by cancer) may also be called a single-nucleotide alteration. An MNP (Multiple-nucleotide polymorphisms) module may identify the substitution of consecutive nucleotides at a specific position in the genome. [0228] An indel calling algorithm 148 may identify an insertion or deletion of bases in the genome of an organism classified among small genetic variations. While indels usually measure from 1 to 10 000 base pairs in length, a microindel is defined as an indel that results in a net change of 1 to 50 nucleotides. Indels can be contrasted with a SNP or point mutation. An indel inserts and/or deletes nucleotides from a sequence, while a point mutation is a form of substitution that replaces one of the nucleotides without changing the overall number in the DNA. Indels, being insertions and/or deletions, can be used as genetic markers in natural populations, especially in phylogenetic studies. Indel frequency tends to be markedly lower than that of single nucleotide polymorphisms (SNP), except near highly repetitive regions, including homopolymers and microsatellites.
[0229] A genomic rearrangement algorithm 149 may identify hybrid genes formed from two previously separate genes. It can occur as a result of translocation, interstitial deletion, or chromosomal inversion. Gene fusion can play an important role in tumorigenesis. Fusion genes can contribute to tumor formation because fusion genes can produce much more active abnormal protein than non-fusion genes. Often, fusion genes are oncogenes that cause cancer; these include BCR-ABL, TEL- AML 1 (ALL with t(12 ; 21)), AML1-ETO (M2 AML with t(8 ; 21)), and TMPRSS2-ERG with an interstitial deletion on chromosome 21, often occurring in prostate cancer. In the case of TMPRSS2-ERG, by disrupting androgen receptor (AR) signaling and inhibiting AR expression by oncogenic ETS transcription factor, the fusion product regulates prostate cancer. Most fusion genes are found from hematological cancers, sarcomas, and prostate cancer. BCAM-AKT2 is a fusion gene that is specific and unique to high-grade serous ovarian cancer. Oncogenic fusion genes may lead to a gene product with a new or different function from the two fusion partners. Alternatively, a proto oncogene is fused to a strong promoter, and thereby the oncogenic function is set to function by an upregulation caused by the strong promoter of the upstream fusion partner. The latter is common in lymphomas, where oncogenes are juxtaposed to the promoters of the immunoglobulin genes. Oncogenic fusion transcripts may also be caused by trans-splicing or read-through events. Since chromosomal translocations play such a significant role in neoplasia, a specialized database of chromosomal aberrations and gene fusions in cancer has been created. This database is called Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer.
[0230] In some embodiments, feature extraction module 145 includes instructions for identifying one or more complex genomic alterations (e.g.. features that incorporate more than a change in the primary sequence of the genome) in the cancer genome of the subject. For instance, in some embodiments, feature extraction module 145 includes modules for identifying one or more of copy number variation (e.g., copy number variation analysis module 153), microsatellite instability status (e.g., microsatellite instability analysis module 154), tumor mutational burden (e.g., tumor mutational burden analysis module 155), tumor ploidy (e.g., tumor ploidy analysis module 156), and homologous recombination pathway deficiencies (e.g., homologous recombination pathway analysis module 157).
[0231] For example, referring to Figure 1D1, the copy number variation analysis module 153 performs a method that validates a copy number annotation of a genomic segment in a test subject, in accordance with some embodiments of the present disclosure. The method comprises obtaining an input data store 153-r (e.g., a dataset), where the input data store includes a bin-level sequence ratio data structure 153-r- 1 containing a plurality of bin-level sequence ratios; a segment-level sequence ratio data structure 153-r-2 containing a plurality of segment-level sequence ratios; and a segment-level dispersion measure data structure 153- r-3 containing a plurality of segment-level measures of dispersion. In some embodiments, the method further comprises passing the data in the input data store 153-r to an amplification/deletion filter construct 153-a, thus applying the dataset to a plurality of filters. The amplification/deletion filter construct 153-a comprises a plurality of filters, including an optional measure of central tendency bin-level sequence ratio filter 153-a-l; an optional segment-level measure of dispersion confidence filter 153-a-2; an optional measure of central tendency-plus-deviation bin-level sequence ratio filter 153-a-3; and/or an optional segment- level sequence ratio filter 153-a-4. In some embodiments, the copy number variation analysis module further provides an output via the validation construct 153-o, where, when a filter in the amplification/deletion filter construct 153-a is fired, the copy number annotation of the genomic segment is rejected, and when no filter in the amplification/deletion filter construct 153-a is fired, the copy number annotation of the genomic segment is validated. In some embodiments, copy number annotations validated using the copy number variation analysis module 153 in the feature extraction module 145 are used to populate the plurality of genomic copy numbers 135 in the one or more genomic features 131 of the test patient data store 120.
[0232] As another example, referring to Figure 1D3, in some embodiments, feature extraction module 145 comprises a tumor fraction estimation module 145-tf. In some embodiments, the tumor fraction estimation module 145-tf comprises a sequence ratio data structure 145-tf-r including a plurality of sequence ratios (e.g., coverage ratios) obtained from a sequencing of a test liquid biopsy sample of a subject. In some embodiments, the sequence ratio data structure 145-tf-r includes the sequence ratios that are used as input to determine tumor fraction estimates for the test liquid biopsy sample. In some embodiments, the tumor fraction estimation module 145-tf also comprises a tumor purity algorithm construct 145-tf-a that executes, for example, a maximum likelihood estimation (e.g., an expectation- maximization algorithm) to calculate an estimate of the circulating tumor fraction. The tumor purity algorithm construct 145-tf-a comprises an optional input data filtration construct 145- tf-k (e.g., for removing one or more inputs passed from the sequence ratio data structure based on a minimum probe threshold or a position on a sex chromosome) and a plurality of model parameters 145-tf-d (e.g., 145-tf-d-l, 145-tf-d-2,...) used for executing the algorithm. In some embodiments, model parameters include expected sequence ratios for a set of copy states at a given tumor purity; a distance (e.g. , an error) from a test sequence ratio to the closest expected sequence ratio at the given tumor purity; a minimum distance (e.g., a minimum error) from a test sequence ratio to the closest expected sequence ratio at the given tumor purity (e.g., an assigned test copy state selected from a minimal distance expected copy state); and/or a tumor purity score (e.g., a sum of weighted errors).
[0233] In some embodiments, referring to Figure 1C3, the tumor fraction estimation module 145-tf is used to obtain one or more circulating tumor fraction estimates 131-i that are included as feature data 125 in a test patient data store 120. For example, in some embodiments, a plurality of circulating tumor fraction estimates is obtained from a test liquid biopsy sample of a subject 131-i-cf (e.g., 131-i-cf-l, 131 -i-cf-2... , 131-i-cf-N). In some embodiments, the plurality of circulating tumor fraction estimates is obtained from a single patient at different collection times.
[0234] Further details and specific embodiments regarding methods for analysis and validation of copy number variation, validation of a somatic sequence variant, and/or determination of a circulating tumor fraction estimate are provided below with reference to Figures 4, 5, and 6 (e.g., Figures 4F1, 5A1-5E1, and 6A1-6C1; Figures 4F2, 5A2-5B2, and 6A2, and/or Figures 4F3, 5A3-5B3, and 6A3-6C3).
Feature Analysis Module (160)
[0235] Referring again to Figure 1A, the system (e.g., system 100) includes a feature analysis module 160 that includes one or more genomic alteration interpretation algorithms 161, one or more optional clinical data analysis algorithms 165, an optional therapeutic curation algorithm 165, and an optional recommendation validation module 167. In some embodiments, feature analysis module 160 identifies actionable variants and characteristics 139-1 and corresponding matched therapies 139-2 and/or clinical trials using one or more analysis algorithms (e.g., algorithms 162, 163, 164, and 165) to evaluate feature data 125.
The identified actionable variants and characteristics 139-1 and corresponding matched therapies 139-2, which are optionally stored in test patient data store 120, are then curated by feature analysis module 160 to generate a clinical report 139-3, which is optionally validated by a user, e.g., a clinician, before being transmitted to a medical professional, e.g., an oncologist, treating the patient.
[0236] In some embodiments, the genomic alteration interpretation algorithms 161 include instructions for evaluating the effect that one or more genomic features 131 of the subject, e.g., as identified by feature extraction module 145, have on the characteristics of the patient’s cancer and/or whether one or more targeted cancer therapies may improve the clinical outcome for the patient. For example, in some embodiments, one or more genomic variant analysis algorithms 163 evaluate various genomic features 131 by querying a database, e.g., a look-up-table (“LUT”) of actionable genomic alterations, targeted therapies associated with the actionable genomic alterations, and any other conditions that should be met before administering the targeted therapy to a subject having the actionable genomic alteration. For instance, evidence suggests that depatuxizumab mafodotin (an anti-EGFR mAh conjugated to monomethyl auristatin F) has improved efficacy for the treatment of recurrent glioblastomas having EGFR focal amplifications van den Bent M. et al, Cancer Chemother Pharmacol., 80(6): 1209-17 (2017). Accordingly, the actionable genomic alteration LUT would have an entry for the focal amplification of the EGFR gene indicating that depatuxizumab mafodotin is a targeted therapy for glioblastomas (e.g., recurrent glioblastomas) having a focal gene amplification. In some instances, the LUT may also include counter indications for the associated targeted therapy, e.g., adverse drug interactions or personal characteristics that are counter-indicated for administration of the particular targeted therapy.
[0237] In some embodiments, a genomic alteration interpretation algorithm 161 determines whether a particular genomic feature 131 should be reported to a medical professional treating the cancer patient. In some embodiments, genomic features 131 (e.g., genomic alterations and compound features) are reported when there is clinical evidence that the feature significantly impacts the biology of the cancer, impacts the prognosis for the cancer, and/or impacts pharmacogenomics, e.g., by indicating or counter-indicating particular therapeutic approaches. For instance, a genomic alteration interpretation algorithm 161 may classify a particular CNV feature 135 as “Reportable,” e.g., meaning that the CNV has been identified as influencing the character of the cancer, the overall disease state, and/or pharmacogenomics, as “Not Reportable,” e.g., meaning that the CNV has not been identified as influencing the character of the cancer, the overall disease state, and/or pharmacogenomics, as “No Evidence,” e.g., meaning that no evidence exists supporting that the CNV is “Reportable” or “Not Reportable,” or as “Conflicting Evidence,” e.g., meaning that evidence exists supporting both that the CNV is “Reportable” and that the CNV is “Not Reportable.”
[0238] In some embodiments, the genomic alteration interpretation algorithms 161 include one or more pathogenic variant analysis algorithms 162, which evaluate various genomic features to identify the presence of an oncogenic pathogen associated with the patient’s cancer and/or targeted therapies associated with an oncogenic pathogen infection in the cancer. For instance, RNA expression patterns of some cancers are associated with the presence of an oncogenic pathogen that is helping to drive the cancer. See, for example, U.S. Patent Application Serial No. 16/802,126, filed February 26, 2020, the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some instances, the recommended therapy for the cancer is different when the cancer is associated with the oncogenic pathogen infection than when it is not. Accordingly, in some embodiments, e.g., where feature data 125 includes RNA abundance data for the cancer of the patient, one or more pathogenic variant analysis algorithms 162 evaluate the RNA abundance data for the patient’s cancer to determine whether a signature exists in the data that indicates the presence of the oncogenic pathogen in the cancer. Similarly, in some embodiments, bioinformatics module 140 includes an algorithm that searches for the presence of pathogenic nucleic acid sequences in sequencing data 122. See, for example, U.S. Provisional Patent Application Serial No. 62/978,067, filed February 18, 2020, the content of which is hereby incorporated by reference, in its entirety, for all purposes. Accordingly, in some embodiments, one or more pathogenic variant analysis algorithms 162 evaluates whether the presence of an oncogenic pathogen in a subject is associated with an actionable therapy for the infection. In some embodiments, system 100 queries a database, e.g., a look-up-table (“LUT”), of actionable oncogenic pathogen infections, targeted therapies associated with the actionable infections, and any other conditions that should be met before administering the targeted therapy to a subject that is infected with the oncogenic pathogen. In some instances, the LUT may also include counter indications for the associated targeted therapy, e.g., adverse drug interactions or personal characteristics that are counter-indicated for administration of the particular targeted therapy.
[0239] In some embodiments, the genomic alteration interpretation algorithms 161 include one or more multi-feature analysis algorithms 164 that evaluate a plurality of features to classify a cancer with respect to the effects of one or more targeted therapies. For instance, in some embodiments, feature analysis module 160 includes one or more classifiers trained against feature data, one or more clinical therapies, and their associated clinical outcomes for a plurality of training subjects to classify cancers based on their predicted clinical outcomes following one or more therapies.
[0240] In some embodiments, the classifier is implemented as an artificial intelligence engine and may include gradient boosting models, random forest models, neural networks (NN), regression models, Naive Bayes models, and/or machine learning algorithms (MLA). An MLA or a NN may be trained from a training data set that includes one or more features 125, including personal characteristics 126, medical history 127, clinical features 128, genomic features 131, and/or other -omic features 138. MLAs include supervised algorithms (such as algorithms where the features/classifications in the data set are annotated) using linear regression, logistic regression, decision trees, classification and regression trees, naive Bayes, nearest neighbor clustering; unsupervised algorithms (such as algorithms where no features/classification in the data set are annotated) using Apriori, means clustering, principal component analysis, random forest, adaptive boosting; and semi-supervised algorithms (such as algorithms where an incomplete number of features/classifications in the data set are annotated) using generative approach (such as a mixture of Gaussian distributions, mixture of multinomial distributions, hidden Markov models), low density separation, graph-based approaches (such as mincut, harmonic function, manifold regularization), heuristic approaches, or support vector machines.
[0241] NNs include conditional random fields, convolutional neural networks, attention based neural networks, deep learning, long short term memory networks, or other neural models where the training data set includes a plurality of tumor samples, RNA expression data for each sample, and pathology reports covering imaging data for each sample. [0242] While MLA and neural networks identify distinct approaches to machine learning, the terms may be used interchangeably herein. Thus, a mention of MLA may include a corresponding NN or a mention of NN may include a corresponding MLA unless explicitly stated otherwise. Training may include providing optimized datasets, labeling these traits as they occur in patient records, and training the MLA to predict or classify based on new inputs. Artificial NNs are efficient computing models which have shown their strengths in solving hard problems in artificial intelligence. They have also been shown to be universal approximators, that is, they can represent a wide variety of functions when given appropriate parameters.
[0243] In some embodiments, system 100 includes a classifier training module that includes instructions for training one or more untrained or partially trained classifiers based on feature data from a training dataset. In some embodiments, system 100 also includes a database of training data for use in training the one or more classifiers. In other embodiments, the classifier training module accesses a remote storage device hosting training data. In some embodiments, the training data includes a set of training features, including but not limited to, various types of the feature data 125 illustrated in Figure IB. In some embodiments, the classifier training module uses patient data 121, e.g., when test patient data store 120 also stores a record of treatments administered to the patient and patient outcomes following therapy.
[0244] In some embodiments, feature analysis module 160 includes one or more clinical data analysis algorithms 165, which evaluate clinical features 128 of a cancer to identify targeted therapies which may benefit the subject. For example, in some embodiments, e.g., where feature data 125 includes pathology data 128-1, one or more clinical data analysis algorithms 165 evaluate the data to determine whether an actionable therapy is indicated based on the histopathology of a tumor biopsy from the subject, e.g., which is indicative of a particular cancer type and/or stage of cancer. In some embodiments, system 100 queries a database, e.g., a look-up-table (“LUT”), of actionable clinical features (e.g., pathology features), targeted therapies associated with the actionable features, and any other conditions that should be met before administering the targeted therapy to a subject associated with the actionable clinical features 128 (e.g., pathology features 128-1). In some embodiments, system 100 evaluates the clinical features 128 (e.g., pathology features 128-1) directly to determine whether the patient’s cancer is sensitive to a particular therapeutic agent. Further details on example methods, systems, and algorithms for classifying cancer and identifying targeted therapies based on clinical data, such as pathology data 128-1, imaging data 138-2, and/or tissue culture/organoid data 128-3 are discussed, for example, in U.S. Patent Application No. 16/830,186, filed on March 25, 2020, U.S. Patent Application No. 16/789,363, filed on Feb. 12, 2020, and U.S. Provisional Application No. 63/007,874, filed on April 9, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0245] In some embodiments, feature analysis module 160 includes a clinical trials module that evaluates test patient data 121 to determine whether the patient is eligible for inclusion in a clinical trial for a cancer therapy, e.g., a clinical trial that is currently recruiting patients, a clinical trial that has not yet begun recruiting patients, and/or an ongoing clinical trial that may recruit additional patients in the future. In some embodiments, a clinical trial module evaluates test patient data 121 to determine whether the results of a clinical trial are relevant for the patient, e.g., the results of an ongoing clinical trial and/or the results of a completed clinical trial. For instance, in some embodiments, system 100 queries a database, e.g., a look-up-table (“LUT”) of clinical trials, e.g., active and/or completed clinical trials, and compares patient data 121 with inclusion criteria for the clinical trials, stored in the database, to identify clinical trials with inclusion criteria that closely match and/or exactly match the patient’s data 121. In some embodiments, a record of matching clinical trials, e.g., those clinical trials that the patient may be eligible for and/or that may inform personalized treatment decisions for the patient, are stored in clinical assessment database 139.
[0246] In some embodiments, feature analysis module 160 includes a therapeutic curation algorithm 166 that assembles actionable variants and characteristics 139-1, matched therapies 139-2, and/or relevant clinical trials identified for the patient, as described above. In some embodiments, a therapeutic curation algorithm 166 evaluates certain criteria related to which actionable variants and characteristics 139-1, matched therapies 139-2, and/or relevant clinical trials should be reported and/or whether certain matched therapies, considered alone or in combination, may be counter-indicated for the patient, e.g., based on personal characteristics 126 of the patient and/or known drug-drug interactions. In some embodiments, the therapeutic curation algorithm then generates one or more clinical reports 139-3 for the patient. In some embodiments, the therapeutic curation algorithm generates a first clinical report 139-3-1 that is to be reported to a medical professional treating the patient and a second clinical report 139-3-2 that will not be communicated to the medical professional, but may be used to improve various algorithms within the system. [0247] In some embodiments, feature analysis module 160 includes a recommendation validation module 167 that includes an interface allowing a clinician to review, modify, and approve a clinical report 139-3 prior to the report being sent to a medical professional, e.g., an oncologist, treating the patient.
[0248] In some embodiments, each of the one or more feature collections, sequencing modules, bioinformatics modules (including, e.g., alteration module(s), structural variant calling and data processing modules), classification modules and outcome modules are communicatively coupled to a data bus to transfer data between each module for processing and/or storage. In some alternative embodiments, each of the feature collection, alteration module(s), structural variant and feature store are communicatively coupled to each other for independent communication without sharing the data bus.
[0249] Further details on systems and exemplary embodiments of modules and feature collections are discussed in PCT Application PCT/US 19/69149, titled “A METHOD AND PROCESS FOR PREDICTING AND ANALYZING PATIENT COHORT RESPONSE, PROGRESSION, AND SURVIVAL,” filed December 31, 2019, which is hereby incorporated herein by reference in its entirety.
Example Methods
[0250] Now that details of a system 100 for providing clinical support for personalized cancer therapy, e.g., with improved validation of copy number variation, improved validation of somatic sequence variants, and/or improved determination of circulating tumor fraction estimates have been disclosed, details regarding processes and features of the system, in accordance with various embodiments of the present disclosure, are disclosed below. Specifically, example processes are described below with reference to Figures 2A, 3, 4, 5, 6 and 7 (e.g., Figures 2A, 3, 4A-E; Figures 4F1, 5A1-5E1, 6A1-6C1, and 7A1-7C1; Figures 4F2, 5A2-5B2, 6A2, and 7A2-7B2; and/or Figures 4F3, 5A3-5B3, 6A3-6C3, and 7A3). In some embodiments, such processes and features of the system are carried out by modules 118, 120, 140, 160, and/or 170, as illustrated in Figure 1A. Referring to these methods, the systems described herein (e.g., system 100) include instructions for determining and validating focal copy number variations that are improved compared to conventional methods for copy number analysis, instructions for validating somatic variants that are improved compared to conventional methods for somatic variant detection, and/or instructions for determining accurate circulating tumor fraction estimates that are improved compared to conventional methods for obtaining circulating tumor fraction estimates. Figure 2B: Distributed Diagnostic and Clinical Environment
[0251] In some aspects, the methods described herein for providing clinical support for personalized cancer therapy are performed across a distributed diagnostic/clinical environment, e.g., as illustrated in Figure 2B. However, in some embodiments, the improved methods described herein for supporting clinical decisions in precision oncology using liquid biopsy assays (e.g., by validating a copy number variation in a test subject, validating a somatic sequence variant in a test subject having a cancer condition, determining accurate circulating tumor fraction estimates, etc.) are performed at a single location, e.g, at a single computing system or environment, although ancillary procedures supporting the methods described herein, and/or procedures that make further use of the results of the methods described herein, may be performed across a distributed diagnostic/clinical environment.
[0252] Figure 2B illustrates an example of a distributed diagnostic/clinical environment 210. In some embodiments, the distributed diagnostic/clinical environment is connected via communication network 105. In some embodiments, one or more biological samples, e.g., one or more liquid biopsy samples, solid tumor biopsy, normal tissue samples, and/or control samples, are collected from a subject in clinical environment 220, e.g., a doctor’s office, hospital, or medical clinic, or at a home health care environment (not depicted). Advantageously, while solid tumor samples should be collected within a clinical setting, liquid biopsy samples can be acquired in a less invasive fashion and are more easily collected outside of a traditional clinical setting. In some embodiments, one or more biological samples, or portions thereof, are processed within the clinical environment 220 where collection occurred, using a processing device 224, e.g., a nucleic acid sequencer for obtaining sequencing data, a microscope for obtaining pathology data, a mass spectrometer for obtaining proteomic data, etc. In some embodiments, one or more biological samples, or portions thereof are sent to one or more external environments, e.g., sequencing lab 230, pathology lab 240, and/or molecular biology lab 250, each of which includes a processing device 234, 244, and 254, respectively, to generate biological data 121 for the subject. Each environment includes a communications device 222, 232, 242, and 252, respectively, for communicating biological data 121 about the subject to a processing server 262 and/or database 264, which may be located in yet another environment, e.g., processing/storage center 260. Thus, in some embodiments, different portions of the systems and methods described herein are fulfilled by different processing devices located in different physical environments. [0253] Accordingly, in some embodiments, a method for providing clinical support for personalized cancer therapy, e.g., with improved validation of copy number variations, improved validation of somatic sequence variants, and/or improved determination of circulating tumor fraction estimates, is performed across one or more environments, as illustrated in Figure 2B. For instance, in some such embodiments, a liquid biopsy sample is collected at clinical environment 220 or in a home healthcare environment. The sample, or a portion thereof, is sent to sequencing lab 230 where raw sequence reads 123 of nucleic acids in the sample are generated by sequencer 234. The raw sequencing data 123 is communicated, e.g., from communications device 232, to database 264 at processing/storage center 260, where processing server 262 extracts features from the sequence reads by executing one or more of the processes in bioinformatics module 140, thereby generating genomic features 131 for the sample. Processing server 262 may then analyze the identified features by executing one or more of the processes in feature analysis module 160, thereby generating clinical assessment 139, including a clinical report 139-3. A clinician may access clinical report 139-3, e.g., at processing/storage center 260 or through communications network 105, via recommendation validation module 167. After final approval, clinical report 139-3 is transmitted to a medical professional, e.g., an oncologist, at clinical environment 220, who uses the report to support clinical decision making for personalized treatment of the patient’s cancer.
Figure 2A: Example Workflow for Precision Oncology
[0254] Figure 2A is a flowchart of an example workflow 200 for collecting and analyzing data in order to generate a clinical report 139 to support clinical decision making in precision oncology. Advantageously, the methods described herein improve this process, for example, by improving various stages within feature extraction 206, including validating copy number variations, validating somatic sequence variants, and/or determining circulating tumor fraction estimates.
[0255] Briefly, the workflow begins with patient intake and sample collection 201, where one or more liquid biopsy samples, one or more tumor biopsy, and one or more normal and/or control tissue samples are collected from the patient (e.g., at a clinical environment 220 or home healthcare environment, as illustrated in Figure 2B). In some embodiments, personal data 126 corresponding to the patient and a record of the one or more biological samples obtained (e.g., patient identifiers, patient clinical data, sample type, sample identifiers, cancer conditions, etc.) are entered into a data analysis platform, e.g., test patient data store 120. Accordingly, in some embodiments, the methods disclosed herein include obtaining one or more biological samples from one or more subjects, e.g., cancer patients. In some embodiments, the subject is a human, e.g., a human cancer patient.
[0256] In some embodiments, one or more of the biological samples obtained from the patient are a biological liquid sample, also referred to as a liquid biopsy sample. In some embodiments, one or more of the biological samples obtained from the patient are selected from blood, plasma, serum, urine, vaginal fluid, fluid from a hydrocele (e.g., of the testis), vaginal flushing fluids, pleural fluid, ascitic fluid, cerebrospinal fluid, saliva, sweat, tears, sputum, bronchoalveolar lavage fluid, discharge fluid from the nipple, aspiration fluid from different parts of the body (e.g., thyroid, breast), etc. In some embodiments, the liquid biopsy sample includes blood and/or saliva. In some embodiments, the liquid biopsy sample is peripheral blood. In some embodiments, blood samples are collected from patients in commercial blood collection containers, e.g., using a PAXgene® Blood DNA Tubes. In some embodiments, saliva samples are collected from patients in commercial saliva collection containers, e.g., using an Oragene® DNA Saliva Kit.
[0257] In some embodiments, the liquid biopsy sample has a volume of from about 1 mL to about 50 mL. For example, in some embodiments, the liquid biopsy sample has a volume of about 1 mL, about 2 mL, about 3 mL, about 4 mL, about 5 mL, about 6 mL, about 7 mL, about 8 mL, about 9 mL, about 10 mL, about 11 mL, about 12 mL, about 13 mL, about 14 mL, about 15 mL, about 16 mL, about 17 mL, about 18 mL, about 19 mL, about 20 mL, or greater.
[0258] Liquid biopsy samples include cell free nucleic acids, including cell-free DNA (cfDNA). As described above, cfDNA isolated from cancer patients includes DNA originating from cancerous cells, also referred to as circulating tumor DNA (ctDNA), cfDNA originating from germline (e.g., healthy or non-cancerous) cells, and cfDNA originating from hematopoietic cells (e.g., white blood cells). The relative proportions of cancerous and non- cancerous cfDNA present in a liquid biopsy sample varies depending on the characteristics (e.g., the type, stage, lineage, genomic profile, etc.) of the patient’s cancer. As used herein, the ‘tumor burden’ of the subject refers to the percentage cfDNA that originated from cancerous cells.
[0259] As described herein, cfDNA is a particularly useful source of biological data for various implementations of the methods and systems described herein, because it is readily obtained from various body fluids. Advantageously, use of bodily fluids facilitates serial monitoring because of the ease of collection, as these fluids are collectable by non-invasive or minimally invasive methodologies. This is in contrast to methods that rely upon solid tissue samples, such as biopsies, which often times require invasive surgical procedures. Further, because bodily fluids, such as blood, circulate throughout the body, the cfDNA population represents a sampling of many different tissue types from many different locations.
[0260] In some embodiments, a liquid biopsy sample is separated into two different samples. For example, in some embodiments, a blood sample is separated into a blood plasma sample, containing cfDNA, and a huffy coat preparation, containing white blood cells.
[0261] In some embodiments, a plurality of liquid biopsy samples is obtained from a respective subject at intervals over a period of time (e.g., using serial testing). For example, in some such embodiments, the time between obtaining liquid biopsy samples from a respective subject is at least 1 day, at least 2 days, at least 1 week, at least 2 weeks, at least 1 month, at least 2 months, at least 3 months, at least 4 months, at least 6 months, or at least 1 year.
[0262] In some embodiments, one or more biological samples collected from the patient is a solid tissue sample, e.g., a solid tumor sample or a solid normal tissue sample. Methods for obtaining solid tissue samples, e.g., of cancerous and/or normal tissue are known in the art and are dependent upon the type of tissue being sampled. For example, bone marrow biopsies and isolation of circulating tumor cells can be used to obtain samples of blood cancers, endoscopic biopsies can be used to obtain samples of cancers of the digestive tract, bladder, and lungs, needle biopsies (e.g., fine-needle aspiration, core needle aspiration, vacuum-assisted biopsy, and image-guided biopsy, can be used to obtain samples of subdermal tumors, skin biopsies, e.g., shave biopsy, punch biopsy, incisional biopsy, and excisional biopsy, can be used to obtain samples of dermal cancers, and surgical biopsies can be used to obtain samples of cancers affecting internal organs of a patient. In some embodiments, a solid tissue sample is a formalin-fixed tissue (FFT). In some embodiments, a solid tissue sample is a macro-dissected formalin fixed paraffin embedded (FFPE) tissue. In some embodiments, a solid tissue sample is a fresh frozen tissue sample. [0263] In some embodiments, a dedicated normal sample is collected from the patient, for co-processing with a liquid biopsy sample. Generally, the normal sample is of a non- cancerous tissue, and can be collected using any tissue collection means described above. In some embodiments, buccal cells collected from the inside of a patient’s cheeks are used as a normal sample. Buccal cells can be collected by placing an absorbent material, e.g., a swab, in the subject’s mouth and rubbing it against their cheek, e.g., for at least 15 second or for at least 30 seconds. The swab is then removed from the patient’s mouth and inserted into a tube, such that the tip of the tube is submerged into a liquid that serves to extract the buccal cells off of the absorbent material. An example of buccal cell recovery and collection devices is provided in U.S. Patent No. 9,138,205, the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some embodiments, the buccal swab DNA is used as a source of normal DNA in circulating heme malignancies.
[0264] The biological samples collected from the patient are, optionally, sent to various analytical environments (e.g., sequencing lab 230, pathology lab 240, and/or molecular biology lab 250) for processing (e.g., data collection) and/or analysis (e.g., feature extraction). Wet lab processing 204 may include cataloguing samples (e.g., accessioning), examining clinical features of one or more samples (e.g., pathology review), and nucleic acid sequence analysis (e.g., extraction, library prep, capture + hybridize, pooling, and sequencing). In some embodiments, the workflow includes clinical analysis of one or more biological samples collected from the subject, e.g., at a pathology lab 240 and/or a molecular and cellular biology lab 250, to generate clinical features such as pathology features 128-3, imaging data 128-3, and/or tissue culture / organoid data 128-3.
[0265] In some embodiments, the pathology data 128-1 collected during clinical evaluation includes visual features identified by a pathologist’s inspection of a specimen (e.g., a solid tumor biopsy), e.g., of stained H&E or IHC slides. In some embodiments, the sample is a solid tissue biopsy sample. In some embodiments, the tissue biopsy sample is a formalin-fixed tissue (FFT), e.g., a formalin-fixed paraffin-embedded (FFPE) tissue. In some embodiments, the tissue biopsy sample is an FFPE or FFT block. In some embodiments, the tissue biopsy sample is a fresh-frozen tissue biopsy. The tissue biopsy sample can be prepared in thin sections (e.g., by cutting and/or affixing to a slide), to facilitate pathology review (e.g., by staining with immunohistochemistry stain for IHC review and/or with hematoxylin and eosin stain for H&E pathology review). For instance, analysis of slides for H&E staining or IHC staining may reveal features such as tumor infiltration, programmed death-ligand 1 (PD-L1) status, human leukocyte antigen (HLA) status, or other immunological features.
[0266] In some embodiments, a liquid sample ( e.g . , blood) collected from the patient (e.g., in EDTA-containing collection tubes) is prepared on a slide (e.g., by smearing) for pathology review. In some embodiments, macrodissected FFPE tissue sections, which may be mounted on a histopathology slide, from solid tissue samples (e.g. , tumor or normal tissue) are analyzed by pathologists. In some embodiments, tumor samples are evaluated to determine, e.g., the tumor purity of the sample, the percent tumor cellularity as a ratio of tumor to normal nuclei, etc. For each section, background tissue may be excluded or removed such that the section meets a tumor purity threshold, e.g., where at least 20% of the nuclei in the section are tumor nuclei, or where at least 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or more of the nuclei in the section are tumor nuclei.
[0267] Conversion of solid tumor test to liquid biopsy test. In one embodiment, the solid tissue sample is insufficient for NGS testing (for example, the sample is too small or too degraded, the amount or quality of nucleic acids extracted from the sample does not result in quality NGS results that would result in reliable determination of variants and/or other genetic characteristics of the sample), and the physician or patient may decide to convert the solid tissue test that was ordered to a liquid biopsy test to be performed on a liquid biopsy sample collected from the same patient. The resulting report and/or display of the results on a portal may include an “xF Conversion Badge” to distinguish any order that has been converted from solid tissue test to a liquid biopsy test (compared to, for example, a liquid biopsy test that was not initially ordered as a solid tissue test). This will allow a user to identify which orders have been converted by this process, and distinguish between orders that were intentionally placed for the liquid biopsy panel.
[0268] In some embodiments, pathology data 128-1 is extracted, in addition to or instead of visual inspection, using computational approaches to digital pathology, e.g., providing morphometric features extracted from digital images of stained tissue samples. A review of digital pathology methods is provided in Bera, K. et cil, Nat. Rev. Clin. Oncol., 16:703-15 (2019), the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some embodiments, pathology data 128-1 includes features determined using machine learning algorithms to evaluate pathology data collected as described above. [0269] Further details on methods, systems, and algorithms for using pathology data to classify cancer and identify targeted therapies are discussed, for example, in are discussed, for example, in U.S. Patent Application No. 16/830,186, filed on March 25, 2020, and U.S. Provisional Application No. 63/007,874, filed on April 9, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0270] In some embodiments, imaging data 128-2 collected during clinical evaluation includes features identified by review of in-vitro and/or in-vivo imaging results (e.g., of a tumor site), for example a size of a tumor, tumor size differentials over time (such as during treatment or during other periods of change). In some embodiments, imaging data 128-2 includes features determined using machine learning algorithms to evaluate imaging data collected as described above.
[0271] Further details on methods, systems, and algorithms for using medical imaging to classify cancer and identify targeted therapies are discussed, for example, in are discussed, for example, in U.S. Patent Application No. 16/830,186, filed on March 25, 2020, and U.S. Provisional Application No. 63/007,874, filed on April 9, 2020, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0272] In some embodiments, tissue culture / organoid data 128-3 collected during clinical evaluation includes features identified by evaluation of cultured tissue from the subject. For instance, in some embodiments, tissue samples obtained from the patients (e.g., tumor tissue, normal tissue, or both) are cultured (e.g., in liquid culture, solid-phase culture, and/or organoid culture) and various features, such as cell morphology, growth characteristics, genomic alterations, and/or drug sensitivity, are evaluated. In some embodiments, tissue culture / organoid data 128-3 includes features determined using machine learning algorithms to evaluate tissue culture / organoid data collected as described above. Examples of tissue organoid (e.g., personal tumor organoid) culturing and feature extractions thereof are described in U.S. Provisional Application Serial No. 62/924,621, filed on October 22, 2019, and U.S. Patent Application Serial No. 16/693,117, filed on November 22, 2019, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0273] Nucleic acid sequencing of one or more samples collected from the subject is performed, e.g., at sequencing lab 230, during wet lab processing 204. An example workflow for nucleic acid sequencing is illustrated in Figure 3. In some embodiments, the one or more biological samples obtained at the sequencing lab 230 are accessioned (302), to track the sample and data through the sequencing process.
[0274] Next, nucleic acids, e.g., RNA and/or DNA are extracted (304) from the one or more biological samples. Methods for isolating nucleic acids from biological samples are known in the art, and are dependent upon the type of nucleic acid being isolated (e.g., cfDNA, DNA, and/or RNA) and the type of sample from which the nucleic acids are being isolated (e.g., liquid biopsy samples, white blood cell buffy coat preparations, formalin-fixed paraffin-embedded (FFPE) solid tissue samples, and fresh frozen solid tissue samples). The selection of any particular nucleic acid isolation technique for use in conjunction with the embodiments described herein is well within the skill of the person having ordinary skill in the art, who will consider the sample type, the state of the sample, the type of nucleic acid being sequenced and the sequencing technology being used.
[0275] For instance, many techniques for DNA isolation, e.g., genomic DNA isolation, from a tissue sample are known in the art, such as organic extraction, silica adsorption, and anion exchange chromatography. Likewise, many techniques for RNA isolation, e.g., mRNA isolation, from a tissue sample are known in the art. For example, acid guanidinium thiocyanate-phenol-chloroform extraction (see, for example, Chomczynski and Sacchi, 2006, Nat Protoc, l(2):581-85, which is hereby incorporated by reference herein), and silica bead/glass fiber adsorption (see, for example, Poeckh, T. et al, 2008, Anal Biochem, 373(2):253-62, which is hereby incorporated by reference herein). The selection of any particular DNA or RNA isolation technique for use in conjunction with the embodiments described herein is well within the skill of the person having ordinary skill in the art, who will consider the tissue type, the state of the tissue, e.g., fresh, frozen, formalin-fixed, paraffin- embedded (FFPE), and the type of nucleic acid analysis that is to be performed.
[0276] In some embodiments where the biological sample is a liquid biopsy sample, e.g., a blood or blood plasma sample, cfDNA is isolated from blood samples using commercially available reagents, including proteinase K, to generate a liquid solution of cfDNA.
[0277] In some embodiments, isolated DNA molecules are mechanically sheared to an average length using an ultrasonicator (for example, a Covaris ultrasonicator). In some embodiments, isolated nucleic acid molecules are analyzed to determine their fragment size, e.g., through gel electrophoresis techniques and/or the use of a device such as a LabChip GX Touch. The skilled artisan will know of an appropriate range of fragment sizes, based on the sequencing technique being employed, as different sequencing techniques have differing fragment size requirements for robust sequencing. In some embodiments, quality control testing is performed on the extracted nucleic acids (e.g., DNA and/or RNA), e.g, to assess the nucleic acid concentration and/or fragment size. For example, sizing of DNA fragments provides valuable information used for downstream processing, such as determining whether DNA fragments require additional shearing prior to sequencing.
[0278] Wet lab processing 204 then includes preparing a nucleic acid library from the isolated nucleic acids (e.g., cfDNA, DNA, and/or RNA). For example, in some embodiments, DNA libraries (e.g., gDNA and/or cfDNA libraries) are prepared from isolated DNA from the one or more biological samples. In some embodiments, the DNA libraries are prepared using a commercial library preparation kit, e.g., the KAPA Hyper Prep Kit, a New England Biolabs (NEB) kit, or a similar kit.
[0279] Conversion of solid tumor test to liquid biopsy test. In one embodiment, the solid tissue sample is insufficient for NGS testing (for example, the sample is too small or too degraded, the amount or quality of nucleic acids extracted from the sample does not result in quality NGS results that would result in reliable determination of variants and/or other genetic characteristics of the sample), and the physician or patient may decide to convert the solid tissue test that was ordered to a liquid biopsy test to be performed on a liquid biopsy sample collected from the same patient. The resulting report and/or display of the results on a portal may include an “xF Conversion Badge” to distinguish any order that has been converted from solid tissue test to a liquid biopsy test (compared to, for example, a liquid biopsy test that was not initially ordered as a solid tissue test). This will allow a user to identify which orders have been converted by this process, and distinguish between orders that were intentionally placed for the liquid biopsy panel.
[0280] In some embodiments, during library preparation, adapters (e.g., UDI adapters, such as Roche SeqCap dual end adapters, or UMI adapters such as full length or stubby Y adapters) are ligated onto the nucleic acid molecules. In some embodiments, the adapters include unique molecular identifiers (UMIs), which are short nucleic acid sequences (e.g., 3- 10 base pairs) that are added to ends of DNA fragments during adapter ligation. In some embodiments, UMIs are degenerate base pairs that serve as a unique tag that can be used to identify sequence reads originating from a specific DNA fragment. In some embodiments, e.g., when multiplex sequencing will be used to sequence DNA from a plurality of samples (e.g., from the same or different subjects) in a single sequencing reaction, a patient-specific index is also added to the nucleic acid molecules. In some embodiments, the patient specific index is a short nucleic acid sequence (e.g., 3-20 nucleotides) that are added to ends of DNA fragments during library construction, that serve as a unique tag that can be used to identify sequence reads originating from a specific patient sample. Examples of identifier sequences are described, for example, in Kivioja et al, Nat. Methods 9(l):72-74 (2011) and Islam et al, Nat. Methods 11(2): 163-66 (2014), the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0281] In some embodiments, an adapter includes a PCR primer landing site, designed for efficient binding of a PCR or second-strand synthesis primer used during the sequencing reaction. In some embodiments, an adapter includes an anchor binding site, to facilitate binding of the DNA molecule to anchor oligonucleotide molecules on a sequencer flow cell, serving as a seed for the sequencing process by providing a starting point for the sequencing reaction. During PCR amplification following adapter ligation, the UMIs, patient indexes, and binding sites are replicated along with the attached DNA fragment. This provides a way to identify sequence reads that came from the same original fragment in downstream analysis.
[0282] In some embodiments, DNA libraries are amplified and purified using commercial reagents, (e.g., Axygen MAG PCR clean up beads). In some such embodiments, the concentration and/or quantity of the DNA molecules are then quantified using a fluorescent dye and a fluorescence microplate reader, standard spectrofluorometer, or filter fluorometer. In some embodiments, library amplification is performed on a device (e.g., an Illumina C- Bot2) and the resulting flow cell containing amplified target-captured DNA libraries is sequenced on a next generation sequencer (e.g., an Illumina HiSeq 4000 or an Alumina NovaSeq 6000) to a unique on-target depth selected by the user. In some embodiments,
DNA library preparation is performed with an automated system, using a liquid handling robot (e.g., a SciClone NGSx).
[0283] In some embodiments, where feature data 125 includes methylation states 132 for one or more genomic locations, nucleic acids isolated from the biological sample (e.g., cfDNA) are treated to convert unmethylated cytosines to uracils, e.g., prior to generating the sequencing library. Accordingly, when the nucleic acids are sequenced, all cytosines called in the sequencing reaction were necessarily methylated, since the unmethylated cytosines were converted to uracils and accordingly would have been called as thymidines, rather than cytosines, in the sequencing reaction. Commercial kits are available for bisulfite-mediated conversion of methylated cytosines to uracils, for instance, the EZ DNA MethylationTM- Gold, EZ DNA Methylation™-Direct, and EZ DNA Methylation™-Lightning kit (available from Zymo Research Corp (Irvine, CA)). Commercial kits are also available for enzymatic conversion of methylated cytosines to uracils, for example, the APOBEC-Seq kit (available from NEBiolabs, Ipswich, MA).
[0284] In some embodiments, wet lab processing 204 includes pooling (308) DNA molecules from a plurality of libraries, corresponding to different samples from the same and/or different patients, to forming a sequencing pool of DNA libraries. When the pool of DNA libraries is sequenced, the resulting sequence reads correspond to nucleic acids isolated from multiple samples. The sequence reads can be separated into different sequence read files, corresponding to the various samples represented in the sequencing read based on the unique identifiers present in the added nucleic acid fragments. In this fashion, a single sequencing reaction can generate sequence reads from multiple samples. Advantageously, this allows for the processing of more samples per sequencing reaction.
[0285] In some embodiments, wet lab processing 204 includes enriching (310) a sequencing library, or pool of sequencing libraries, for target nucleic acids, e.g., nucleic acids encompassing loci that are informative for precision oncology and/or used as internal controls for the sequencing or bioinformatics processes. In some embodiments, enrichment is achieved by hybridizing target nucleic acids in the sequencing library to probes that hybridize to the target sequences, and then isolating the captured nucleic acids away from off-target nucleic acids that are not bound by the capture probes. In some embodiments, one or more off-target nucleic acids will remain in the final sequencing pool.
[0286] Advantageously, enriching for target sequences prior to sequencing nucleic acids significantly reduces the costs and time associated with sequencing, facilitates multiplex sequencing by allowing multiple samples to be mixed together for a single sequencing reaction, and significantly reduces the computation burden of aligning the resulting sequence reads, as a result of significantly reducing the total amount of nucleic acids analyzed from each sample.
[0287] In some embodiments, the enrichment is performed prior to pooling multiple nucleic acid sequencing libraries. However, in other embodiments, the enrichment is performed after pooling nucleic acid sequencing libraries, which has the advantage of reducing the number of enrichment assays that have to be performed. [0288] In some embodiments, the enrichment is performed prior to generating a nucleic acid sequencing library. This has the advantage that fewer reagents are needed to perform both the enrichment (because there are fewer target sequences at this point, prior to library amplification) and the library production (because there are fewer nucleic acid molecules to tag and amplify after the enrichment). However, this raises the possibility of pull-down bias and/or that small variations in the enrichment protocol will result in less consistent results.
[0289] In some embodiments, nucleic acid libraries are pooled (two or more DNA libraries may be mixed to create a pool) and treated with reagents to reduce off-target capture, for example Human COT-1 and/or IDT xGen Universal Blockers. Pools may be dried in a vacufuge and resuspended. DNA libraries or pools may be hybridized to a probe set (for example, a probe set specific to a panel that includes loci from at least 100, 600, 1,000, 10,000, etc. of the 19,000 known human genes) and amplified with commercially available reagents (for example, the KAPA HiFi HotStart Ready Mix). For example, in some embodiments, a pool is incubated in an incubator, PCR machine, water bath, or other temperature-modulating device to allow probes to hybridize. Pools may then be mixed with Streptavidin-coated beads or another means for capturing hybridized DNA-probe molecules, such as DNA molecules representing exons of the human genome and/or genes selected for a genetic panel.
[0290] Pools may be amplified and purified more than once using commercially available reagents, for example, the KAPA HiFi Library Amplification kit and Axygen MAG PCR clean up beads, respectively. The pools or DNA libraries may be analyzed to determine the concentration or quantity of DNA molecules, for example by using a fluorescent dye (for example, PicoGreen pool quantification) and a fluorescence microplate reader, standard spectrofluorometer, or filter fluorometer. In one example, the DNA library preparation and/or capture is performed with an automated system, using a liquid handling robot (for example, a SciClone NGSx).
[0291] In some embodiments, e.g. , where a whole genome sequencing method will be used, nucleic acid sequencing libraries are not target-enriched prior to sequencing, in order to obtain sequencing data on substantially all of the competent nucleic acids in the sequencing library. Similarly, in some embodiments, e.g., where a whole genome sequencing method will be used, nucleic acid sequencing libraries are not mixed, because of bandwidth limitations related to obtaining significant sequencing depth across an entire genome. However, in other embodiments, e.g., where a low pass whole genome sequencing (LPWGS) methodology will be used, nucleic acid sequencing libraries can still be pooled, because very low average sequencing coverage is achieved across a respective genome, e.g., between about 0.5x and about 5x.
[0292] In some embodiments, a plurality of nucleic acid probes (e.g., a probe set) is used to enrich one or more target sequences in a nucleic acid sample (e.g., an isolated nucleic acid sample or a nucleic acid sequencing library), e.g., where one or more target sequences is informative for precision oncology. For instance, in some embodiments, one or more of the target sequences encompasses a locus that is associated with an actionable allele. That is, variations of the target sequence are associated with targeted therapeutic approaches. In some embodiments, one or more of the target sequences and/or a property of one or more of the target sequences is used in a classifier trained to distinguish two or more cancer states.
[0293] In some embodiments, the probe set includes probes targeting one or more gene loci, e.g., exon or intron loci. In some embodiments, the probe set includes probes targeting one or more loci not encoding a protein, e.g., regulatory loci, miRNA loci, and other non coding loci, e.g., that have been found to be associated with cancer. In some embodiments, the plurality of loci includes at least 25, 50, 100, 150, 200, 250, 300, 350, 400, 500, 750,
1000, 2500, 5000, or more human genomic loci.
[0294] In some embodiments, the probe set includes probes targeting one or more of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 5 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 10 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 25 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 50 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 75 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting at least 100 of the genes listed in Table 1. In some embodiments, the probe set includes probes targeting all of the genes listed in Table 1. [0295] Table 1. An example panel of 105 genes that are informative for precision oncology.
[0296] In some embodiments, the probe set includes probes targeting one or more of the genes listed in List 1, provided below. In some embodiments, the probe set includes probes targeting at least 5 of the genes listed in List 1. In some embodiments, the probe set includes probes targeting at least 10 of the genes listed in List 1. In some embodiments, the probe set includes probes targeting at least 25 of the genes listed in List 1. In some embodiments, the probe set includes probes targeting at least 50 of the genes listed in List 1. In some embodiments, the probe set includes probes targeting at least 70 of the genes listed in List 1. In some embodiments, the probe set includes probes targeting all of the genes listed in List 1.
[0297] In some embodiments, the probe set includes probes targeting one or more of the genes listed in List 2, provided below. In some embodiments, the probe set includes probes targeting at least 5 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting at least 10 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting at least 25 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting at least 50 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting at least 75 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting at least 100 of the genes listed in List 2. In some embodiments, the probe set includes probes targeting all of the genes listed in List 2.
[0298] In some embodiments, panels of genes including one or more genes from the following lists are used for analyzing specimens, sequencing, and/or identification. In some embodiments, panels of genes for analyzing specimens, sequencing, and/or identification include one or more genes from List 1 or List 2. In some embodiments, panels of genes for analyzing specimens, sequencing, and/or identification include one or more genes from:
[0299] List 1 : AKT1 (14q32.33), ALK (2p23.2-23.1), APC (5q22.2), AR (Xql2), ARAF (Xpll.3), ARID 1 A (lp36.11), ATM (llq22.3), BRAF (7q34), BRCA1 (17q21.31), BRCA2 (13ql3.1), CCND1 (llql3.3), CCND2 (12pl3.32), CCNE1 (19ql2), CDH1 (16q22.1),
CDK4 (12ql4.1), CDK6 (7q21.2), CDKN2A (9p21.3), CTNNB1 (3p22.1), DDR2 (lq23.3), EGFR (7pl 1.2), ERBB2 (17ql2), ESR1 (6q25.1-25.2), EZH2 (7q36.1), FBXW7 (4q31.3), FGFR1 (8pl 1.23), FGFR2 (10q26.13), FGFR3 (4pl6.3), GAT A3 (10pl4), GNA11 (19pl3.3), GNAQ (9q21.2), GNAS (20ql3.32), HNF1A (12q24.31), HRAS (llpl5.5), IDH1 (2q34), IDH2 (15q26.1), JAK2 (9p24.1), JAK3 (19pl3.11), KIT (4ql2), KRAS (12pl2.1), MAP2K1 (15q22.31), MAP2K2 (19pl3.3), MAPK1 (22qll.22), MAPK3 (16pll.2), MET (7q31.2), MLH1 (3p22.2), MPL (lp34.2), MTOR (lp36.22), MYC (8q24.21), NF1 (17qll.2), NFE2L2 (2q31.2), NOTCH1 (9q34.3), NPM1 (5q35.1), NRAS (lpl3.2), NTRKl (lq23.1), NTRK3 (15q25.3), PDGFRA (4ql2), PIK3CA (3q26.32), PTEN (10q23.31), PTPN11 (12q24.13), RAF1 (3p25.2), RBI (13ql4.2), RET (lOql 1.21), RHEB (7q36.1), RHOA (3p21.31), RIT1 (lq22), ROS1 (6q22.1), SMAD4 (18q21.2), SMO (7q32.1), STK11 (19pl3.3), TERT (5pl5.33), TP53 (17pl3.1), TSC1 (9q34.13), and VHL (3p25.3).
[0300] List 2: ABL1, ACVR1B, AKT1, AKT2, AKT3, ALK, ALOX12B, AMERl (FAM123B), APC, AR, ARAF, ARFRP1, ARID 1 A, ASXL1, ATM, ATR, ATRX, AURKA, AURKB, AXIN1, AXL, BAP1, BARD1, BCL2, BCL2L1, BCL2L2, BCL6, BCOR, BCORL1, BRAF, BRCA1, BRCA2, BRD4, BRIP1, BTG1, BTG2, BTK, Cllorf30 (EMSY), C17orf39 (GID4), CALR, CARD11, CASP8, CBFB, CBL, CCND1, CCND2, CCND3, CCNE1, CD22, CD274 (PD-L1), CD70, CD79A, CD79B, CDC73, CDH1, CDK12, CDK4, CDK6, CDK8, CDKN1A, CDKN1B, CDKN2A, CDKN2B, CDKN2C, CEBPA, CHEK1, CHEK2, CIC, CREBBP, CRKL, CSF1R, CSF3R, CTCF, CTNNA1, CTNNB1, CUL3, CUL4A, CXCR4, CYP17A1, DAXX, DDR1, DDR2, DIS3, DNMT3A, DOT1L, EED,
EGFR, EP300, EPHA3, EPHB1, EPHB4, ERBB2, ERBB3, ERBB4, ERCC4, ERG, ERRFIl, ESR1, EZH2, FAM46C, FANCA, FANCC, FANCG, FANCL, FAS, FBXW7, FGF10, FGF12, FGF14, FGF19, FGF23, FGF3, FGF4, FGF6, FGFR1, FGFR2, FGFR3, FGFR4, FH, FLCN, FLT1, FLT3, FOXL2, FUBP1, GABRA6, GAT A3, GATA4, GATA6, GNA11, GNA13, GNAQ, GNAS, GRM3, GSK3B, H3F3A, HDAC1, HGF, HNF1A, HRAS,
HSD3B1, ID3, IDH1, IDH2, IGF1R, IKBKE, IKZF1, INPP4B, IRF2, IRF4, IRS2, JAK1, JAK2, JAK3, JETN, KDM5A, KDM5C, KDM6A, KDR, KEAP1, KEL, KIT, KLHL6, KMT2A, KMT2D (MLL2), KRAS, LTK, LYN, MAF, MAP2K1 (MEK1), MAP2K2 (MEK2), MAP2K4, MAP3K1, MAP3K13, MAPK1, MCL1, MDM2, MDM4, MED 12, MEF2B, MEN1, MERTK, MET, MITF, MKNK1, MLH1, MPL, MRE11A, MSH2, MSH3, MSH6, MST1R, MTAP, MTOR, MUTYH, MYC, MYCL (MYCL1), MYCN, MYD88,
NBN, NF1, NF2, NFE2L2, NFKBIA, NKX2-1, NOTCH1, NOTCH2, NOTCH3, NPM1, NRAS, NSD3 (WHSC1L1), NT5C2, NTRKl, NTRK2, NTRK3, P2RY8, PALB2, PARK2, PARPl, PARP2, PARP3, PAX5, PBRM1, PDCD1 (PD-1), PDCD1LG2 (PD-L2), PDGFRA, PDGFRB, PDK1, PIK3C2B, PIK3C2G, PIK3CA, PIK3CB, PIK3R1, PIM1, PMS2, POLD1, POLE, PPARG, PPP2R1A, PPP2R2A, PRDMl, PRKARIA, PRKCI, PTCH1, PTEN, PTPN11, PTPRO, QKI, RAC1, RAD21, RAD51, RAD51B, RAD51C, RAD51D, RAD52, RAD54L, RAFl, RARA, RBI, RBM10, REL, RET, RICTOR, RNF43, ROS1, RPTOR, SDHA, SDHB, SDHC, SDHD, SETD2, SF3B1, SGK1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SNCAIP, SOCS1, SOX2, SOX9, SPEN, SPOP, SRC, STAG2, STAT3, STK11, SUFU, SYK, TBX3, TEK, TERC, TERT, TET2, ncRNA, Promoter, TGFBR2,
TIP ARP, TNFAIP3, TNFRSF14, TP53, TSC1, TSC2, TYR03, U2AF1, VEGFA, VHL, WHSC1, WT1, XPOl, XRCC2, ZNF217, and ZNF703.
[0301] Generally, probes for enrichment of nucleic acids ( e.g ., cfDNA obtained from a liquid biopsy sample) include DNA, RNA, or a modified nucleic acid structure with a base sequence that is complementary to a locus of interest. For instance, a probe designed to hybridize to a locus in a cfDNA molecule can contain a sequence that is complementary to either strand, because the cfDNA molecules are double stranded. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 10, at least 11, at least 12, at least 13, at least 14, or at least 15 consecutive bases of a locus of interest. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 20, 25, 30, 40, 50, 75, 100, 150, 200, or more consecutive bases of a locus of interest.
[0302] Targeted panels provide several benefits for nucleic acid sequencing. For example, in some embodiments, algorithms for discriminating between, e.g., a first and second cancer condition can be trained on smaller, more informative data sets (e.g., fewer genes), which leads to more computationally efficient training of classifiers that discriminate between the first and second cancer states. Such improvements in computational efficiency, owing to the reduced size of the discriminating gene set, can advantageously either be used to speed up classifier training or be used to improve the performance of such classifiers (e.g., through more extensive training of the classifier).
[0303] In some embodiments, the gene panel is a whole-exome panel that analyzes the exomes of a biological sample. In some embodiments, the gene panel is a whole-genome panel that analyzes the genome of a specimen. In some embodiments, the gene panel is optimized for use with liquid biopsy samples (e.g., to provide clinical decision support for solid tumors). See, for example, Table 1 above.
[0304] In some embodiments, the probes include additional nucleic acid sequences that do not share any homology to the locus of interest. For example, in some embodiments, the probes also include nucleic acid sequences containing an identifier sequence, e.g., a unique molecular identifier (UMI), e.g., that is unique to a particular sample or subject. Examples of identifier sequences are described, for example, in Kivioja el al, 2011, Nat. Methods 9(1), pp. 72-74 and Islam et al, 2014, Nat. Methods 11(2), pp. 163-66, which are incorporated by reference herein. Similarly, in some embodiments, the probes also include primer nucleic acid sequences useful for amplifying the nucleic acid molecule of interest, e.g., using PCR.
In some embodiments, the probes also include a capture sequence designed to hybridize to an anti-capture sequence for recovering the nucleic acid molecule of interest from the sample.
[0305] Likewise, in some embodiments, the probes each include a non-nucleic acid affinity moiety covalently attached to nucleic acid molecule that is complementary to the locus of interest, for recovering the nucleic acid molecule of interest. Non-limited examples of non-nucleic acid affinity moieties include biotin, digoxigenin, and dinitrophenol. In some embodiments, the probe is attached to a solid-state surface or particle, e.g., a dipstick or magnetic bead, for recovering the nucleic acid of interest. In some embodiments, the methods described herein include amplifying the nucleic acids that bound to the probe set prior to further analysis, e.g., sequencing. Methods for amplifying nucleic acids, e.g., by PCR, are well known in the art.
[0306] Sequence reads are then generated (312) from the sequencing library or pool of sequencing libraries. Sequencing data may be acquired by any methodology known in the art. For example, next generation sequencing (NGS) techniques such as sequencing-by synthesis technology (Illumina), pyrosequencing (454 Life Sciences), ion semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing (Pacific Biosciences), sequencing by ligation (SOLiD sequencing), nanopore sequencing (Oxford Nanopore Technologies), or paired-end sequencing. In some embodiments, massively parallel sequencing is performed using sequencing-by-synthesis with reversible dye terminators. In some embodiments, sequencing is performed using next generation sequencing technologies, such as short-read technologies. In other embodiments, long-read sequencing or another sequencing method known in the art is used.
[0307] Next-generation sequencing produces millions of short reads (e.g., sequence reads) for each biological sample. Accordingly, in some embodiments, the plurality of sequence reads obtained by next-generation sequencing of cfDNA molecules are DNA sequence reads. In some embodiments, the sequence reads have an average length of at least fifty nucleotides. In other embodiments, the sequence reads have an average length of at least 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, or more nucleotides.
[0308] In some embodiments, sequencing is performed after enriching for nucleic acids (e.g., cfDNA, gDNA, and/or RNA) encompassing a plurality of predetermined target sequences, e.g., human genes and/or non-coding sequences associated with cancer. Advantageously, sequencing a nucleic acid sample that has been enriched for target nucleic acids, rather than all nucleic acids isolated from a biological sample, significantly reduces the average time and cost of the sequencing reaction. Accordingly, in some embodiments, the methods described herein include obtaining a plurality of sequence reads of nucleic acids that have been hybridized to a probe set for hybrid-capture enrichment (e.g., of one or more genes listed in Table 1).
[0309] In some embodiments, panel-targeting sequencing is performed to an average on- target depth of at least 500x, at least 750x, at least lOOOx, at least 2500x, at least 500x, at least 10,000x, or greater depth. In some embodiments, samples are further assessed for uniformity above a sequencing depth threshold (e.g., 95% of all targeted base pairs at 300x sequencing depth). In some embodiments, the sequencing depth threshold is a minimum depth selected by a user or practitioner.
[0310] In some embodiments, the sequence reads are obtained by a whole genome or whole exome sequencing methodology. In some such embodiments, whole exome capture is performed with an automated system, using a liquid handling robot (for example, a SciClone NGSx). Whole genome sequencing, and to some extent whole exome sequencing, is typically performed at lower sequencing depth than smaller target-panel sequencing reactions, because many more loci are being sequenced. For example, in some embodiments, whole genome or whole exome sequencing is performed to an average sequencing depth of at least 3x, at least 5x, at least lOx, at least 15x, at least 20x, or greater. In some embodiments, low-pass whole genome sequencing (LPWGS) techniques are used for whole genome or whole exome sequencing. LPWGS is typically performed to an average sequencing depth of about 0.25x to about 5x, more typically to an average sequencing depth of about 0.5x to about 3x.
[0311] Because of the differences in the sequencing methodologies, data obtained from targeted-panel sequencing is better suited for certain analyses than data obtained from whole genome/ whole exome sequencing, and vice versa. For instance, because of the higher sequencing depth achieved by targeted-panel sequencing, the resulting sequence data is better suited for the identification of variant alleles present at low allelic fractions in the sample, e.g., less than 20%. By contrast, data generated from whole genome/whole exome sequencing is better suited for the estimation of genome-wide metrics, such as tumor mutational burden, because the entire genome is better represented in the sequencing data. Accordingly, in some embodiments, a nucleic acid sample, e.g., a cfDNA, gDNA, or mRNA sample, is evaluated using both targeted-panel sequencing and whole genome/whole exome sequencing (e.g., LPWGS).
[0312] In some embodiments, the raw sequence reads resulting from the sequencing reaction are output from the sequencer in a native file format, e.g., a BCL file. In some embodiments, the native file is passed directly to a bioinformatics pipeline (e.g., variant analysis 206), components of which are described in detail below. In other embodiments, pre-processing is performed prior to passing the sequences to the bioinformatics platform.
For instance, in some embodiments, the format of the sequence read file is converted from the native file format (e.g., BCL) to a file format compatible with one or more algorithms used in the bioinformatics pipeline (e.g., FASTQ or FASTA). In some embodiments, the raw sequence reads are filtered to remove sequences that do not meet one or more quality thresholds. In some embodiments, raw sequence reads generated from the same unique nucleic acid molecule in the sequencing read are collapsed into a single sequence read representing the molecule, e.g., using UMIs as described above. In some embodiments, one or more of these pre-processing activities is performed within the bioinformatics pipeline itself.
[0313] In one example, a sequencer may generate a BCL file. A BCL file may include raw image data of a plurality of patient specimens which are sequenced. BCL image data is an image of the flow cell across each cycle during sequencing. A cycle may be implemented by illuminating a patient specimen with a specific wavelength of electromagnetic radiation, generating a plurality of images which may be processed into base calls via BCL to FASTQ processing algorithms which identify which base pairs are present at each cycle. The resulting FASTQ file includes the entirety of reads for each patient specimen paired with a quality metric, e.g., in a range from 0 to 64 where a 64 is the best quality and a 0 is the worst quality. In embodiments where both a liquid biopsy sample and a normal tissue sample are sequenced, sequence reads in the corresponding FASTQ files may be matched, such that a liquid biopsy -normal analysis may be performed.
[0314] FASTQ format is a text-based format for storing both a biological sequence, such as a nucleotide sequence, and its corresponding quality scores. These FASTQ files are analyzed to determine what genetic variants or copy number changes are present in the sample. Each FASTQ file contains reads that may be paired-end or single reads, and may be short-reads or long-reads, where each read represents one detected sequence of nucleotides in a nucleic acid molecule that was isolated from the patient sample or a copy of the nucleic acid molecule, detected by the sequencer. Each read in the FASTQ file is also associated with a quality rating. The quality rating may reflect the likelihood that an error occurred during the sequencing procedure that affected the associated read. In some embodiments, the results of paired-end sequencing of each isolated nucleic acid sample are contained in a split pair of FASTQ files, for efficiency. Thus, in some embodiments, forward (Read 1) and reverse (Read 2) sequences of each isolated nucleic acid sample are stored separately but in the same order and under the same identifier.
[0315] In various embodiments, the bioinformatics pipeline may filter FASTQ data from the corresponding sequence data file for each respective biological sample. Such filtering may include correcting or masking sequencer errors and removing (trimming) low quality sequences or bases, adapter sequences, contaminations, chimeric reads, overrepresented sequences, biases caused by library preparation, amplification, or capture, and other errors.
[0316] While workflow 200 illustrates obtaining a biological sample, extracting nucleic acids from the biological sample, and sequencing the isolated nucleic acids, in some embodiments, sequencing data used in the improved systems and methods described herein (e.g., which include improved methods for validating copy number variations, improved methods for validating a somatic sequence variant in a test subject having a cancer condition, and/or improved methods for determining accurate circulating tumor fraction estimates) is obtained by receiving previously generated sequence reads, in electronic form.
[0317] Referring again to Figure 2A, nucleic acid sequencing data 122 generated from the one or more patient samples is then evaluated (e.g., via variant analysis 206) in a bioinformatics pipeline, e.g., using bioinformatics module 140 of system 100, to identify genomic alterations and other metrics in the cancer genome of the patient. An example overview for a bioinformatics pipeline is described below with respect to Figure 4 (e.g., Figure 4A-E, 4F1-3, and/or 4G1-3). Advantageously, in some embodiments, the present disclosure improves bioinformatics pipelines, like pipeline 206, by improving methods and systems for the validation of copy number variations, the validation of somatic sequence variants, and/or the determination of circulating tumor fraction estimates.
[0318] Figure 4A illustrates an example bioinformatics pipeline 206 (e.g., as used for feature extraction in the workflows illustrated in Figures 2A and 3) for providing clinical support for precision oncology. As shown in Figure 4A, sequencing data 122 obtained from the wet lab processing 204 (e.g., sequence reads 314) is input into the pipeline.
[0319] In various embodiments, the bioinformatics pipeline includes a circulating tumor DNA (ctDNA) pipeline for analyzing liquid biopsy samples. The pipeline may detect SNVs, INDELs, copy number amplifications/deletions and genomic rearrangements (for example, fusions). The pipeline may employ unique molecular index (UMI)-based consensus base calling as a method of error suppression as well as a Bayesian tri-nucleotide context-based position level error suppression. In various embodiments, it is able to detect variants having a 0.1%, 0.15%, 0.2%, 0.25%, 0.3%, 0.4%, or 0.5% variant allele fraction.
[0320] In some embodiments, the sequencing data is processed (e.g., using sequence data processing module 141) to prepare it for genomic feature identification 385. For instance, in some embodiments as described above, the sequencing data is present in a native file format provided by the sequencer. Accordingly, in some embodiments, the system (e.g., system 100) applies a pre-processing algorithm 142 to convert the file format (318) to one that is recognized by one or more upstream processing algorithms. For example, BCL file outputs from a sequencer can be converted to a FASTQ file format using the bcl2fastq or bcl2fastq2 conversion software (Illumina®). FASTQ format is a text-based format for storing both a biological sequence, such as nucleotide sequence, and its corresponding quality scores. These FASTQ files are analyzed to determine what genetic variants, copy number changes, etc., are present in the sample.
[0321] In some embodiments, other preprocessing functions are performed, e.g., filtering sequence reads 122 based on a desired quality, e.g., size and/or quality of the base calling. In some embodiments, quality control checks are performed to ensure the data is sufficient for variant calling. For instance, entire reads, individual nucleotides, or multiple nucleotides that are likely to have errors may be discarded based on the quality rating associated with the read in the FASTQ file, the known error rate of the sequencer, and/or a comparison between each nucleotide in the read and one or more nucleotides in other reads that has been aligned to the same location in the reference genome. Filtering may be done in part or in its entirety by various software tools, for example, a software tool such as Skewer. See, Jiang, H. et al, BMC Bioinformatics 15(182): 1-12 (2014). FASTQ files may be analyzed for rapid assessment of quality control and reads, for example, by a sequencing data QC software such as AfterQC, Kraken, RNA-SeQC, FastQC, or another similar software program. For paired end reads, reads may be merged.
[0322] In some embodiments, when both a liquid biopsy sample and a normal tissue sample from the patient are sequenced, two FASTQ output files are generated, one for the liquid biopsy sample and one for the normal tissue sample. A ‘matched’ (e.g., panel-specific) workflow is run to jointly analyze the liquid biopsy -normal matched FASTQ files. When a matched normal sample is not available from the patient, FASTQ files from the liquid biopsy sample are analyzed in the ‘tumor-only’ mode. See, for example, Figure 4B. If two or more patient samples are processed simultaneously on the same sequencer flow cell, e.g., a liquid biopsy sample and a normal tissue sample, a difference in the sequence of the adapters used for each patient sample barcodes nucleic acids extracted from both samples, to associate each read with the correct patient sample and facilitate assignment to the correct FASTQ file. [0323] For efficiency, in some embodiments, the results of paired-end sequencing of each isolate are contained in a split pair of FASTQ files. Forward (Read 1) and reverse (Read 2) sequences of each tumor and normal isolate are stored separately but in the same order and under the same identifier. See, for example, Figure 4C. In various embodiments, the bioinformatics pipeline may filter FASTQ data from each isolate. Such filtering may include correcting or masking sequencer errors and removing (trimming) low quality sequences or bases, adapter sequences, contaminations, chimeric reads, overrepresented sequences, biases caused by library preparation, amplification, or capture, and other errors. See, for example, Figure 4D.
[0324] Similarly, in some embodiments, sequencing (312) is performed on a pool of nucleic acid sequencing libraries prepared from different biological samples, e.g., from the same or different patients. Accordingly, in some embodiments, the system demultiplexes (320) the data (e.g., using demultiplexing algorithm 144) to separate sequence reads into separate files for each sequencing library included in the sequencing pool, e.g, based on UMI or patient identifier sequences added to the nucleic acid fragments during sequencing library preparation, as described above. In some embodiments, the demultiplexing algorithm is part of the same software package as one or more pre-processing algorithms 142. For instance, the bcl2fastq or bcl2fastq2 conversion software (Illumina®) include instructions for both converting the native file format output from the sequencer and demultiplexing sequence reads 122 output from the reaction.
[0325] The sequence reads are then aligned (322), e.g., using an alignment algorithm 143, to a reference sequence construct 158, e.g, a reference genome, reference exome, or other reference construct prepared for a particular targeted-panel sequencing reaction. For example, in some embodiments, individual sequence reads 123, in electronic form (e.g., in FASTQ files), are aligned against a reference sequence construct for the species of the subject (e.g., a reference human genome) by identifying a sequence in a region of the reference sequence construct that best matches the sequence of nucleotides in the sequence read. In some embodiments, the sequence reads are aligned to a reference exome or reference genome using known methods in the art to determine alignment position information. The alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read. Alignment position information may also include sequence read length, which can be determined from the beginning position and end position. A region in the reference genome may be associated with a gene or a segment of a gene. Any of a variety of alignment tools can be used for this task.
[0326] For instance, local sequence alignment algorithms compare subsequences of different lengths in the query sequence (e.g., sequence read) to subsequences in the subject sequence (e.g., reference construct) to create the best alignment for each portion of the query sequence. In contrast, global sequence alignment algorithms align the entirety of the sequences, e.g., end to end. Examples of local sequence alignment algorithms include the Smith- Waterman algorithm (see, for example, Smith and Waterman, J Mol. Biol.,
147(1): 195-97 (1981), which is incorporated herein by reference), Lalign (see, for example, Huang and Miller, Adv. Appl. Math, 12:337-57 (1991), which is incorporated by reference herein), and PattemHunter (see, for example, Ma B. el al, Bioinformatics, 18(3):440-45 (2002), which is incorporated by reference herein).
[0327] In some embodiments, the read mapping process starts by building an index of either the reference genome or the reads, which is then used to retrieve the set of positions in the reference sequence where the reads are more likely to align. Once this subset of possible mapping locations has been identified, alignment is performed in these candidate regions with slower and more sensitive algorithms. See, for example, Hatem etal, 2013, “Benchmarking short sequence mapping tools,” BMC Bioinformatics 14: p. 184; and Flicek and Bimey, 2009, “Sense from sequence reads: methods for alignment and assembly,” Nat Methods 6(Suppl. 11), S6-S12, each of which is hereby incorporated by reference. In some embodiments, the mapping tools methodology makes use of a hash table or a Burrows- Wheeler transform (BWT). See, for example, Li and Homer, 2010, “A survey of sequence alignment algorithms for next-generation sequencing,” Brief Bioinformatics 11, pp. 473-483, which is hereby incorporated by reference.
[0328] Other software programs designed to align reads include, for example, Novoalign (Novocraft, Inc.), Bowtie, Burrows Wheeler Aligner (BWA), and/or programs that use a Smith- Waterman algorithm. Candidate reference genomes include, for example, hgl9, GRCh38, hg38, GRCh37, and/or other reference genomes developed by the Genome Reference Consortium. In some embodiments, the alignment generates a SAM file, which stores the locations of the start and end of each read according to coordinates in the reference genome and the coverage (number of reads) for each nucleotide in the reference genome. [0329] For example, in some embodiments, each read of a FASTQ file is aligned to a location in the human genome having a sequence that best matches the sequence of nucleotides in the read. There are many software programs designed to align reads, for example, Novoalign (Novocraft, Inc.), Bowtie, Burrows Wheeler Aligner (BWA), programs that use a Smith-Waterman algorithm, etc. Alignment may be directed using a reference genome (for example, hgl9, GRCh38, hg38, GRCh37, other reference genomes developed by the Genome Reference Consortium, etc.) by comparing the nucleotide sequences in each read with portions of the nucleotide sequence in the reference genome to determine the portion of the reference genome sequence that is most likely to correspond to the sequence in the read.
In some embodiments, one or more SAM files are generated for the alignment, which store the locations of the start and end of each read according to coordinates in the reference genome and the coverage (number of reads) for each nucleotide in the reference genome.
The SAM files may be converted to BAM files. In some embodiments, the BAM files are sorted, and duplicate reads are marked for deletion, resulting in de-duplicated BAM files.
[0330] In some embodiments, adapter-trimmed FASTQ files are aligned to the 19th edition of the human reference genome build (HG19) using Burrows-Wheeler Aligner (BWA, Li and Durbin, Bioinformatics, 25(14): 1754-60 (2009). Following alignment, reads are grouped by alignment position and UMI family and collapsed into consensus sequences, for example, using fgbio tools (e.g., available on the internet at fulcrumgenomics.github.io/fgbio/). Bases with insufficient quality or significant disagreement among family members (for example, when it is uncertain whether the base is an adenine, cytosine, guanine, etc.) may be replaced by N's to represent a wildcard nucleotide type. PHRED scores are then scaled based on initial base calling estimates combined across all family members. Following single-strand consensus generation, duplex consensus sequences are generated by comparing the forward and reverse oriented PCR products with mirrored UMI sequences. In various embodiments, a consensus can be generated across read pairs. Otherwise, single-strand consensus calls will be used. Following consensus calling, filtering is performed to remove low-quality consensus fragments. The consensus fragments are then re-aligned to the human reference genome using BWA. A BAM output file is generated after the re-alignment, then sorted by alignment position, and indexed.
[0331] In some embodiments, where both a liquid biopsy sample and a normal tissue sample are analyzed, this process produces a liquid biopsy BAM file (e.g., Liquid BAM 124- 1-i-cf) and a normal BAM file (e.g., Germline BAM 124-1-i-g), as illustrated in Figure 4A. In various embodiments, BAM files may be analyzed to detect genetic variants and other genetic features, including single nucleotide variants (SNVs), copy number variants (CNVs), gene rearrangements, etc.
[0332] In some embodiments, the sequencing data is normalized, e.g. , to account for pull down, amplification, and/or sequencing bias (e.g., mappability, GC bias etc.). See, for example, Schwartz et cil, PLoS ONE 6(l):el6685 (2011) and Benjamini and Speed, Nucleic Acids Research 40(10):e72 (2012), the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
[0333] In some embodiments, SAM files generated after alignment are converted to BAM files 124. Thus, after preprocessing sequencing data generated for a pooled sequencing reaction, BAM files are generated for each of the sequencing libraries present in the master sequencing pools. For example, as illustrated in Figure 4A, separate BAM files are generated for each of three samples acquired from subject 1 at time i (e.g., tumor BAM 124-1-i-t corresponding to alignments of sequence reads of nucleic acids isolated from a solid tumor sample from subject 1, Liquid BAM 124-1-i-cf corresponding to alignments of sequence reads of nucleic acids isolated from a liquid biopsy sample from subject 1, and Germline BAM 124-1-i-g corresponding to alignments of sequence reads of nucleic acids isolated from a normal tissue sample from subject 1), and one or more samples acquired from one or more additional subjects at time j (e.g., Tumor BAM 124-2-j-t corresponding to alignments of sequence reads of nucleic acids isolated from a solid tumor sample from subject 2). In some embodiments, BAM files are sorted, and duplicate reads are marked for deletion, resulting in de-duplicated BAM files. For example, tools like SamBAMBA mark and filter duplicate alignments in the sorted BAM files.
[0334] Many of the embodiments described below, in conjunction with Figure 4 (e.g., Figure 4A-E, 4F1-3, and/or 4G1-3), relate to analyses performed using sequencing data from cfDNA of a cancer patient, e.g., obtained from a liquid biopsy sample of the patient. Generally, these embodiments are independent and, thus, not reliant upon any particular sequencing data generation methods, e.g., sample preparation, sequencing, and/or data pre processing methodologies. However, in some embodiments, the methods described below include one or more features 204 of generating sequencing data, as illustrated in Figures 2A and 3. [0335] Alignment files prepared as described above (e.g., BAM files 124) are then passed to a feature extraction module 145, where the sequences are analyzed (324) to identify genomic alterations (e.g., SNVs/MNVs, indels, genomic rearrangements, copy number variations, etc.) and/or determine various characteristics of the patient’s cancer (e.g., MSI status, TMB, tumor ploidy, HRD status, tumor fraction, tumor purity, methylation patterns, etc.). Many software packages for identifying genomic alterations are known in the art, for example, freebayes, PolyBayse, samtools, GATK, pindel, SAMtools, Breakdancer, Cortex, Crest, Deify, Gridss, Hydra, Lumpy, Manta, and Socrates. For a review of many of these variant calling packages see, for example, Cameron, D.L. et cil, Nat. Commun., 10(3240):1- 11 (2019), the content of which is hereby incorporated by reference, in its entirety, for all purposes. Generally, these software packages identify variants in sorted SAM or BAM files 124, relative to one or more reference sequence constructs 158. The software packages then output a file e.g., a raw VCF (variant call format), listing the variants (e.g., genomic features 131) called and identifying their location relevant to the reference sequence construct (e.g., where the sequence of the sample nucleic acids differ from the corresponding sequence in the reference construct). In some embodiments, system 100 digests the contents of the native output file to populate feature data 125 in test patient data store 120. In other embodiments, the native output file serves as the record of these genomic features 131 in test patient data store 120.
[0336] Generally, the systems described herein can employ any combination of available variant calling software packages and internally developed variant identification algorithms. In some embodiments, the output of a particular algorithm of a variant calling software is further evaluated, e.g., to improve variant identification. Accordingly, in some embodiments, system 100 employs an available variant calling software package to perform some of all of the functionality of one or more of the algorithms shown in feature extraction module 145.
[0337] In some embodiments, as illustrated in Figure 1 A, separate algorithms (or the same algorithm implemented using different parameters) are applied to identify variants unique to the cancer genome of the patient and variants existing in the germline of the subject. In other embodiments, variants are identified indiscriminately and later classified as either germline or somatic, e.g., based on sequencing data, population data, or a combination thereof. In some embodiments, variants are classified as germline variants, and/or non- actionable variants, when they are represented in the population above a threshold level, e.g., as determined using a population database such as ExAC or gnomAD. For instance, in some embodiments, variants that are represented in at least 1% of the alleles in a population are annotated as germline and/or non-actionable. In other embodiments, variants that are represented in at least 2%, at least 3%, at least 4%, at least 5%, at least 7.5%, at least 10%, or more of the alleles in a population are annotated as germline and/or non-actionable. In some embodiments, sequencing data from a matched sample from the patient, e.g. , a normal tissue sample, is used to annotate variants identified in a cancerous sample from the subject. That is, variants that are present in both the cancerous sample and the normal sample represent those variants that were in the germline prior to the patient developing cancer and can be annotated as germline variants.
[0338] In various aspects, the detected genetic variants and genetic features are analyzed as a form of quality control. For example, a pattern of detected genetic variants or features may indicate an issue related to the sample, sequencing procedure, and/or bioinformatics pipeline (e.g., example, contamination of the sample, mislabeling of the sample, a change in reagents, a change in the sequencing procedure and/or bioinformatics pipeline, etc.).
[0339] Figure 4E illustrates an example workflow for genomic feature identification (324). This particular workflow is only an example of one possible collection and arrangement of algorithms for feature extraction from sequencing data 124. Generally, any combination of the modules and algorithms of feature extraction module 145, e.g., illustrated in Figure 1 A, can be used for a bioinformatics pipeline, and particularly for a bioinformatics pipeline for analyzing liquid biopsy samples. For instance, in some embodiments, an architecture useful for the methods and systems described herein includes at least one of the modules or variant calling algorithms shown in feature extraction module 145. In some embodiments, an architecture includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, or more of the modules or variant calling algorithms shown in feature extraction module 145. Further, in some embodiments, feature extraction modules and/or algorithms not illustrated in Figure 1A find use in the methods and systems described herein.
Variant Identification
[0340] In some embodiments, variant analysis of aligned sequence reads, e.g., in SAM or BAM format, includes identification of single nucleotide variants (SNVs), multiple nucleotide variants (MNVs), indels (e.g., nucleotide additions and deletions), and/or genomic rearrangements (e.g., inversions, translocations, and gene fusions) using variant identification module 146, e.g., which includes a SNV/MNV calling algorithm (e.g., SNV/MNV calling algorithm 147), an indel calling algorithm (e.g., indel calling algorithm 148), and/or one or more genomic rearrangement calling algorithms (e.g., genomic rearrangement calling algorithm 149). An overview of an example method for variant identification is shown in Figure 4E. Essentially, the module first identifies a difference between the sequence of an aligned sequence read 124 and the reference sequence to which the sequence read is aligned (e.g., an SNV/MNV, an indel, or a genomic rearrangement) and makes a record of the variant, e.g., in a variant call format (VCF) file. For instance, software packages such as freebayes and pindel are used to call variants using sorted BAM files and reference BED files as the input. For a review of variant calling packages see, for example, Cameron, D.L. et al, Nat. Commun., 10(3240): 1-11 (2019). A raw VCF file (variant call format) file is output, showing the locations where the nucleotide base in the sample is not the same as the nucleotide base in that position in the reference sequence construct.
[0341] In some embodiments, as illustrated in Figure 4E, raw VCF data is then normalized, e.g., by parsimony and left alignment. For example, software packages such as vcfbreakmulti and vt are used to normalize multi-nucleotide polymorphic variants in the raw VCF file and a variant normalized VCF file is output. See, for example, E. Garrison, “Vcflib: A C++ library for parsing and manipulating VCF files, GitHub, available on the internet at ai th ub. com/eka/vcfl ib (2012), the content of which is hereby incorporated by reference, in its entirety, for all purposes. In some embodiments, a normalization algorithm is included within the architecture of a broader variant identification software package.
[0342] An algorithm is then used to annotate the variants in the (e.g. , normalized) VCF file, e.g., determines the source of the variation, e.g., whether the variant is from the germline of the subject (e.g., a germline variant), a cancerous tissue (e.g., a somatic variant), a sequencing error, or of an undeterminable source. In some embodiments, an annotation algorithm is included within the architecture of a broader variant identification software package. However, in some embodiments, an external annotation algorithm is applied to (e.g., normalized) VCF data obtained from a conventional variant identification software package. The choice to use a particular annotation algorithm is well within the purview of the skilled artisan, and in some embodiments is based upon the data being annotated.
[0343] For example, in some embodiments, where both a liquid biopsy sample and a normal tissue sample of the patient are analyzed, variants identified in the normal tissue sample inform annotation of the variants in the liquid biopsy sample. In some embodiments, where a particular variant is identified in the normal tissue sample, that variant is annotated as a germline variant in the liquid biopsy sample. Similarly, in some embodiments, where a particular variant identified in the liquid biopsy sample is not identified in the normal tissue sample, the variant is annotated as a somatic variant when the variant otherwise satisfies any additional criteria placed on somatic variant calling, e.g., a threshold variant allele fraction (VAF) in the sample.
[0344] By contrast, in some embodiments, where only a liquid biopsy sample is being analyzed, the annotation algorithm relies on other characteristics of the variant in order to annotate the origin of the variant. For instance, in some embodiments, the annotation algorithm evaluates the VAF of the variant in the sample, e.g., alone or in combination with additional characteristics of the sample, e.g., tumor fraction. Accordingly, in some embodiments, where the VAF is within a first range encompassing a value that corresponds to a 1 : 1 distribution of variant and reference alleles in the sample, the algorithm annotates the variant as a germline variant, because it is presumably represented in cfDNA originating from both normal and cancer tissues. Similarly, in some embodiments, where the VAF is below a baseline variant threshold, the algorithm annotates the variant as undeterminable, because there is not sufficient evidence to distinguish between the possibility that the variant arose as a result of an amplification or sequencing error and the possibility that the variant originated from a cancerous tissue. Similarly, in some embodiments, where the VAF falls between the first range and the baseline variant threshold, the algorithm annotates the variant as a somatic variant.
[0345] In some embodiments, the baseline variant threshold is a value from 0.01% VAF to 0.5% VAF. In some embodiments, the baseline variant threshold is a value from 0.05% VAF to 0.35% VAF. In some embodiments, the baseline variant threshold is a value from 0.1% VAF to 0.25% VAF. In some embodiments, the baseline variant threshold is about 0.01% VAF, 0.015% VAF, 0.02% VAF, 0.025% VAF, 0.03% VAF, 0.035% VAF, 0.04% VAF, 0.045% VAF, 0.05% VAF, 0.06% VAF, 0.07% VAF, 0.075% VAF, 0.08% VAF, 0.09% VAF, 0.1% VAF, 0.15% VAF, 0.2% VAF, 0.25% VAF, 0.3% VAF, 0.35% VAF,
0.4% VAF, 0.45% VAF, 0.5% VAF, or greater. In some embodiments, the baseline variant threshold is different for variants located in a first region, e.g., a region identified as a mutational hotspot and/or having high genomic complexity, than for variants located in a second region, e.g., a region that is not identified as a mutational hotspot and/or having average genomic complexity. For example, in some embodiments, the baseline variant threshold is a value from 0.01% to 0.25% for variants located in the first region and is a value from 0.1% to 0.5% for variants located in the second region. [0346] In some embodiments, the first region is a region of interest in the genome that may have been manually selected based on criteria (for example, selection may be based on a known likelihood that a region is associated with variants) and the second region is a region that did not meet the selection criteria. In some embodiments, the baseline variant threshold is a value from 0.01% to 0.5% for variants located in the first region and is a value from 1% to 5% for variants located in the second region. In some embodiments, the first region is a region of interest in the genome that may have been manually selected based on criteria (for example, selection may be based on a known likelihood that a region is associated with variants) and the second region is a region selected based on a second set of criteria.
[0347] In some embodiments, a baseline variant threshold is influenced by the sequencing depth of the reaction, e.g., a locus-specific sequencing depth and/or an average sequencing depth (e.g., across a targeted panel and/or complete reference sequence construct). In some embodiments, the baseline variant threshold is dependent upon the type of variant being detected. For example, in some embodiments, different baseline variant thresholds are set for SNPs/MNVs than for indels and/or genomic rearrangements. For instance, while an apparent SNP may be introduced by amplification and/or sequencing errors, it is much less likely that a genomic rearrangement is introduced this way and, thus, a lower baseline variant threshold may be appropriate for genomic rearrangements than for SNPs/MNVs.
[0348] In some embodiments, one or more additional criteria are required to be satisfied before a variant can be annotated as a somatic variant. For instance, in some embodiments, a threshold number of unique sequence reads encompassing the variant must be present to annotate the variant as somatic. In some embodiments, the threshold number of unique sequence reads is 2, 3, 4, 5, 7, 10, 12, 15, or greater. In some embodiments, the threshold number of unique sequence reads is only applied when certain conditions are met, e.g., when the variant allele is located in a region of a certain genomic complexity. In some embodiments, the certain genomic complexity is a low genomic complexity. In some embodiments, the certain genomic complexity is an average genomic complexity. In some embodiments, the certain genomic complexity is a high genomic complexity.
[0349] In some embodiments, a threshold sequencing coverage, e.g., a locus-specific and/or an average sequencing depth (e.g., across a targeted panel and/or complete reference sequence construct) must be satisfied to annotate the variant as somatic. In some embodiments, the threshold sequencing coverage is 50X, 100X, 150X, 200X, 250X, 300X, 350X, 400X or greater. In some embodiments, the variant is located in a microsatellite instable (MSI) region. In some embodiments, the variant is not located in a microsatellite instable (MSI) region. In some embodiments, the variant has sufficient signal-to-noise ratio.
[0350] In some embodiments, bases contributing to the variant satisfy a threshold mapping quality to annotate the variant as somatic. In some embodiments, alignments contributing to the variant must satisfy a threshold alignment quality to annotate the variant as somatic. In some embodiments, a threshold value is determined for a variant detected in a somatic (cancer) sample by analyzing the threshold metric (for example, the baseline variant threshold is determined by analyzing VAF, or the threshold sequencing coverage is determined by analyzing coverage) associated with that variant in a group of germline (normal) samples that were each processed by the same sample processing and sequencing protocol as the somatic sample (process-matched). This may be used to ensure the variants are not caused by observed artifact generating processes.
[0351] In some embodiments, the threshold value is set above the median base fraction of the threshold metric value associated with the variant in more than a specified percentage of process-matched germline samples, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more standard deviations above the median base fraction of the threshold metric value associated with 25%, 30, 40, 50, 60, 70, 75, or more of the processed-matched germline samples. For example, in one embodiment, the threshold value is set to a value 5 standard deviations above the median base fraction of the threshold metric value associated with the variant in more than 50% of the process matched germline samples.
[0352] In some embodiments, variants around homopolymer and multimer regions known to generate artifacts may be specifically filtered to avoid such artifacts. For example, in some embodiments, strand specific filtering is performed in the direction of the read in order to minimize stranded artifacts. Similarly, in some embodiments, variants that do not exceed the stranded minimum deviation for their specific locus within a known artifact generating region may be filtered to avoid artifacts.
[0353] Variants may be filtered using dynamic methods, such as through the application of Bayes’ Theorem through a likelihood ratio test. In some such embodiments, the threshold is dynamically calibrated to account for variants with low support (e.g., due to low tumor fraction, low circulating tumor fraction, and/or low sequencing depths). The dynamic threshold may be based on, for example, factors such as sample specific error rate, the error rate from a healthy reference pool (e.g., a pool of process matched healthy control samples for validation of variants detected in tumor samples), and information from internal human solid tumors (e.g., for validation of variants detected in liquid biopsy samples). Accordingly, in some embodiments, the dynamic filtering method employs a tri-nucleotide context-based Bayesian model. That is, in some embodiments, the threshold for filtering any particular putative variant is dynamically calibrated using a context-based Bayesian model that considers one or more of a sample-specific sequencing error rate, a process-matched control sequencing error rate, and/or a variant-specific frequency (e.g., determined from similar cancers). In this fashion, a minimum number of alternative alleles required to positively identify a true variant is determined for individual alleles and/or loci.
[0354] In some embodiments, the dynamic threshold is selected from a Bayesian probability model, where the selection is based on one or more error rates and/or information from one or more baseline variant distributions. For example, in some embodiments, the dynamic threshold is selected based on a variant detection specificity that is calculated using a distribution of variant detection sensitivities, where the distribution of variant detection sensitivities is a function of circulating variant allele fraction from a plurality of baseline and/or reference alleles (e.g., from a cohort of subjects). Filtration of variants using a dynamic threshold (e.g., to validate the presence of a somatic variant) is performed by comparing the number of unique sequence reads encompassing the variant (e.g., a variant allele fragment count for the variant) against the dynamic threshold.
[0355] As described herein, in some embodiments, the methods described herein (e.g., methods 400-2, 450, and 500-2 as illustrated in Figures 4 and 5) include one or more data collection steps, in addition to data analysis and downstream steps. For example, as described herein, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include collection of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). Likewise, as described herein, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include extraction of cfDNA from the liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). Similarly, as described herein, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include nucleic acid sequencing of cfDNA from the liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject).
[0356] However, in other embodiments, the methods described herein begin with obtaining nucleic acid sequencing results, e.g., raw or collapsed sequence reads of cfDNA from a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject), from which the statistics needed for somatic variant identification (e.g., variant allele count 133-ac and/or variant allele fraction 133-af) can be determined. For example, in some embodiments, sequencing data 122 for a patient 121 is accessed and/or downloaded over network 105 by system 100.
[0357] Similarly, in some embodiments, the methods described herein begin with obtaining the genomic features needed for somatic variant identification (e.g., variant allele count 133-ac and/or variant allele fraction 133-af) for a sequencing of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). For example, in some embodiments, variant allele counts 133-cf-ac and/or variant allele fractions 133-cf-af for sequencing data 122 of patient 121 is accessed and/or downloaded over network 105 by system 100.
[0358] One goal of the liquid biopsy assays described herein is to detect variant alterations at low circulating fractions, which requires that low levels of support be sufficient to call a variant. Therefore, consistent thresholds to filter variants that do not take into account variant context and local sequence specific error cannot be used.
[0359] In some embodiments, a dynamic variant filtering method is applied which uses an application of Bayes' Theorem through the likelihood ratio test. The dynamic threshold is based on sample specific error rate, the error rate from a healthy reference pool, and from internal human solid tumors. The basic application of the likelihood ratio test is as follows: post-test-odds = pre-test-odds * sensitivity / (1 - specificity)
[0360] Given a fixed value for post-test-adds, the specificity can be solved for. The specificity represents the minimum acceptable quantile of an error distribution (e.g., a BetaBinomial, Beta, and Poisson error distribution). The above equation can be refactored to the one below: specificity = 1 - pre-test-odds * sensitivity / post-test-odds [0361] Specificity can then be plugged into the quantile error (e.g., BetaBinomial, Beta, or Poisson) function to derive the minimum number of alternative alleles that can be observed at a given depth to validate a candidate somatic variant.
[0362] In some embodiments, the post-test odds are post-test probability / (1 - post-test probability). The post-test probability is the probability of having a positive variant given Bayes Theorem. The post-test-odds is pre-defmed.
[0363] In some embodiments, the pre-test odds are pre-test probability / (1 - pre-test probability). The pre-test probability is the probability of having a positive variant given the patient's cancer-type and the prevalence of variant alterations within a genomic region encompassing a candidate somatic sequence variant in a reference population having the same cancer type.
[0364] In some embodiments, a pre-test-odds multiplier is applied to the pre-test odds for a resistance mutation that would develop and/or become more prominent within a heterogeneous population of cancer cells in response to therapeutic treatment. The multiplier is applied to specific genomic regions (e.g., exon windows) containing the resistance mutation position. In some embodiments, the multiplier is only applied in specified cancer contexts. For example, in some embodiments, a multiplier is applied to a pre-test odds for a genomic region containing a mutation that is resistant to at least one cancer therapy used to treat the type of cancer the subject has. For example, if a given mutation is known to have resistance to a therapy used to treat breast cancer, but not to any of the therapies used to treat brain cancer, a multiplier will be applied to the pre-test odds for the genomic region encompassing the mutation if the subject has breast cancer, but not if the subject has brain cancer.
[0365] In some embodiments, sensitivity is the fraction of variants detected by the liquid biopsy assay at a given variant allele fraction (e.g., 0.1%, 0.25%, 0.5%, etc.).
[0366] Calculating the pre-test probability. In some embodiments, the pre-test probability is calculated using historical data for a set of reference subjects having the same type of cancer, e.g., from sequencing of solid tumor samples. In this fashion, it is possible to accurately assess the prevalence of specific variants within the population of advanced human tumors. In some embodiments, the set of reference subjects is at least 10 reference subjects.
In some embodiments, the set of reference subjects is at least 50 reference subjects. In some embodiments, the set of reference subjects is at least 100 reference subjects. In some embodiments, the set of reference subjects is at least 500 reference subjects. In some embodiments, the set of reference subjects is at least 1000 reference subjects. In some embodiments, the set of reference subjects is at least 5000 reference subjects. In some embodiments, the set of reference subjects is at least 10000 reference subjects.
[0367] In some embodiments, variant prevalence is calculated by indexing genomic regions (e.g., exons) in the reference sample and counting the number of variants in each genomic region (e.g., exon) for each cancer-type. The number of patients who have at least one variant in the genomic region (e.g., the exon) / the number of patients equals the variant prevalence. The pre-test-odds are calculated from the prevalence by pre-test-odds = prevalence / (1 - prevalence).
[0368] In some embodiments, for a cancer where the number of patients in the reference is too low to calculate prevalence, a default pan cancer cancer-type is used. Where no prevalence can be calculated, the mean variant prevalence across cancer-types is used.
[0369] In some embodiments, pre-test-odds are not calculated each time an input sample is run. Rather, in some embodiments, it is read from a pre-existing file, which will be evaluated and regenerated if deemed necessary.
[0370] Calculating the pre-test-odds multiplier. Resistance mutations have historically low prevalence and variant allele fraction and may incorrectly be filtered by the dynamic variant filtering method due to low pre-test-odds. The resistance mutations develop in response to therapeutic treatment, and detecting resistance mutations early provides insights into the current treatment strategy. Low variant allele frequency, low prevalence resistance mutations in historic solid tumor samples have been identified. The high sensitivity of the liquid biopsy assay described herein permits the early detection of these resistance mutations in circulating DNA. Examples of such resistance mutations include PIK3CA p.E545K in breast cancer, EGFR p.T790M in non-small cell lung cancer, and AR p.H875Y for prostate cancer.
[0371] In some embodiments, to estimate the pre-test-odds-multiplier required to pass resistance mutations down to low variant allele fractions (e.g., 0.1% or 0.25% VAF), the average depth for each variant position is utilized from the reference pool (e.g., the reference pool used to determine the pre-test odds) depth, at a high minimum average depth (e.g., of 2500X). For each resistance mutation, the number of alternate alleles required to achieve a 0.1% or 0.25% VAF were calculated. The total alternate alleles and depth for each resistance mutation was input to the Dynamic Variant Filtering method, and multipliers were applied until those resistance mutations passed the filtering strategy.
[0372] In some embodiments, the minimum multiplier required to pass resistance mutations is determined when the input sample alternate allele count is greater than the background alternate allele count (as outlined in Calculating Testing Sample Alt Allele Count and Calculating Background Alt Allele Count below). In some embodiments, the multiplier is selected based on the multiplier required to pass the variant at a low variant allele fraction (e.g. , 0.1 % VAF or 0.25% VAF). In some embodiments, a maximum value for the multiplier is applied, in order to prevent excessive artifacts from passing the filter. Large multipliers may permit false positive variants to pass the Dynamic Variant Filtering method, however, large multipliers are necessary to pass resistance mutations that have historically low prevalence. In some embodiments, the maximum multiplier is between 750 and 1500. In some embodiments, the maximum multiplier is between 900 and 1100. In some embodiments, the maximum multiplier is between 1000 and 1050.
[0373] In some embodiments, the usage of the pre-test-odds-multiplier is limited by cancer-type context and genomic region (e.g., exon-window). In some embodiments, therefore, the multipliers will not be applied to all genomic regions (e.g., exon-windows) given a specified cancer-type, nor all cancer-types given a specific genomic region (e.g., exon- window).
[0374] Calculating testing sample variant allele count. In some embodiments, the filtering method (the statistical method used for the Dynamic Variant Filtering method) is selected from a beta-binomial distribution model, a beta distribution model, and a Poisson distribution model. In some embodiments, the model is a beta-binomial model. In some embodiments, when applying a quantile beta-binomial distribution, the sum of the input sample alternate reads is divided by the input sample sequencing depth at each variant position, and then multiplied by the reference pool depth (the sequencing depth at genomic positions for a pool of reference, e.g., healthy normal, controls).
[0375] Calculating background variant allele count. In some embodiments, the background variant allele count calculation takes into account the background error from a pool of reference (e.g., healthy normal subjects), the input sample error, and the prevalence of historical variants in the reference cancer subjects. The quantile beta-binomial model considers (i) reference pool depth (the sequencing depth at genomic positions for a pool of reference, e.g., healthy normal, controls), background posterior error average from the input sample, and alpha calculated from the pre-test-odds, sensitivity, and the post-test-odds (e.g., where alpha is equal to 1 - specificity = pre-test-odds * sensitivity / post-test-odds. The pre- test-odds calculated for a specific genomic region (e.g., exon window) and cancer-type will yield a unique alpha for each variant, given that the variants do not fall in the same genomic region (e.g., exon window)).
[0376] In some embodiments, the background posterior error incorporates a trinucleotide error average (e.g., a reaction-specific sequencing error rate), the reference pool error (e.g., a locus-specific, process-matched sequencing error rate; e.g., a sum of alternate reads for each position / depth from a pool of healthy normal controls), and a shrinkage weight parameter.
In some embodiments, the trinucleotide error average is an aggregate of the input sample background average, where the input sample background average equals the error counts for each position divided by the position-specific sequencing depth. In some embodiments, the sample background average is then aggregated for each trinucleotide context. The trinucleotide average is used to calculate the shrinkage weight parameter. In some embodiments, the shrinkage weight parameter equals the trinucleotide error average divided by the sum of the trinucleotide error average and the reference pool error. In instances when the shrinkage weight parameter is undefined, it is changed to 1. In some embodiments, the final calculation of the background posterior error is calculated as: background posterior error = shrinkage weight parameter * trinucleotide error average + (1 - shrinkage weight parameter) * healthy subject error.
[0377] In some embodiments, a reference pool error can be used in place of an input sample background average, for calculating the background posterior average error rate.
[0378] In some embodiments, the alpha for the beta-binomial distribution is calculated using the pre-test-odds, sensitivity, and post-test-odds, where: alpha = 1 - specificity = pre-test-odds * sensitivity / post-test-odds
[0379] Accordingly, in some embodiments, the background posterior average, the reference pool depth, and the alpha are used in calculating the input to the quantile beta- binomial function. The alpha is used in calculating the mean value of the beta-binomial distribution, which equals 1 - alpha / 2. The size of the quantile beta-binomial is the matrix of the reference pool depth. The shape 1 parameter for the quantile beta-binomial function is the reference pool depth multiplied by the background posterior average error rate, and the shape 2 parameter of the quantile beta-binomial function is the shape 1 parameter subtracted from reference pool depth.
[0380] The output from the quantile BetaBinomial function is the minimum value a variant needs to be called. Any variant that has a normalized allele count below the quantile(BetaBinomial) output will be filtered due to the high background error observed at that position.
[0381] For example, Figure 4F2 illustrates a flow chart of a method 400-2 for validating a somatic sequence variant in a test subject having a cancer condition, in accordance with some embodiments of the present disclosure.
[0382] In some embodiments, the method includes obtaining (402-2) cell-free DNA sequencing data 122 from a sequencing reaction of a liquid biopsy sample of a test subject 121 (e.g., sequence reads 123-1-1-1,... ,123-1-1-K for sequence run 122-1-1 for aliquid biopsy sample from patient 121-1, as illustrated in Figure IB) As described herein, in some embodiments, the obtaining includes a step of sequencing cell-free nucleic acids from a liquid biopsy sample. Example methods for sequencing cell-free nucleic acids are described herein.
[0383] Sequence reads 123 from the sequencing data 122 are then aligned (404-2) to a human reference sequence (e.g. , a human genome or a portion of a human genome, e.g. , 1 %, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 75%, 90%, 95%, 99%, or more of the human genome, or to a map of a human reference genome or a set of human reference genomes, or a portion thereof), thereby generating a plurality of aligned reads 124.
Optionally, the pre-aligned sequence reads 123 and/or aligned sequence reads 124 are pre- processed (408-2) using any of the methods disclosed above (e.g., normalization, bias correction, etc.). In some embodiments, as described herein, device 100 obtains previously aligned sequence reads.
[0384] The aligned sequences reads 124 are then evaluated to identify mismatches with the reference construct (e.g., reference genome or set of reference genomes), thereby identifying one or more candidate somatic sequence variants 132-c at respective genomic loci. The number of aligned sequence reads containing the sequence variant at the locus are determined, thereby defining a variant allele fragment count 132-c-ac (e.g., variant allele fragment count 132-c-l-ac as illustrated in Figure 1C2). In some embodiments, the number of aligned sequence reads containing the locus of the candidate variant allele (regardless of the identity of the allele represented in the sequence read) are also determined, thereby defining a variant allele locus count 132-c-lc (e.g., variant allele locus count 132-c-l-lc as illustrated in Figure 1C2). Accordingly, in some embodiments, the variant allele fragment count 132-c-ac can be compared to the variant allele locus count 132-c-lc to determine a variant allele fraction 132-c-vf (e.g., variant allele fraction 132-c-l-vf as illustrated in Figure 1C2) for the candidate variant allele. This represents a measure of the portion of sequence reads encompassing the nucleotide(s) that is altered in the candidate variant allele that include the candidate variant. In some embodiments, as described below, this measure can be used to define a sensitivity for the detection of the candidate variant based on a distribution of detection sensitivities corresponding to detection of a variant within a genomic region encompassing the locus in reference samples with defined variant allele fractions.
[0385] Method 400-2 then includes obtaining (412-2) a dynamic variant count threshold 191 for the candidate variant allele. As described herein, in some embodiments, the dynamic variant count threshold is based upon a prevalence of sequence variations in a genomic region encompassing the locus of the candidate variant allele in cancer patients sharing one or more similarities with the test subject. For example, in some embodiments, this prevalence defines a pre-test odds that the test subject has a sequence variant within the genomic region encompassing the locus at which the candidate sequence variant is located. In some embodiments, this pre-test odds is used in an application of Bayes theorem to derive a minimal amount of support required of the sequencing reaction to validate the presence of the candidate sequence variant in a cancerous tissue of the subject at a desired confidence level. Information about Bayes theorem and Bayesian inference can be found, for instance, in Section 8.7 of Stuart, A. and Ord, K. (1994), Kendall's Advanced Theory of Statistics:
Volume I — Distribution Theory, Edward Arnold; and Gelman, A. et cil, (2013), Bayesian Data Analysis, Third Edition, Chapman and Hall/CRC, ISBN 978-1-4398-4095-5, the disclosure of both of which are incorporated herein by reference for their teachings of how to implement Bayes theorem and Bayesian inference.
[0386] In some embodiments, the prevalence of sequence variants in the genomic region encompassing the locus of the candidate variant allele is determined from a population of reference cancer subjects having the same type of cancer. In some embodiments, the population of reference cancer subjects is further defined by a matching personal characteristic, e.g., an age, gender, race, smoking status, or any other personal characteristic. In some embodiments, the population of reference subjects is further defined by a plurality of matching personal characteristics, e.g., at least 2, 3, 4, 5, 6, 7, 8, 9, 10, or more person characteristics, in addition to cancer type.
[0387] For instance, in some embodiments, the prevalence of sequence variants is determined from variant prevalence training data 192, as illustrated in Figure IF. The variant prevalence training data 192 includes data on the variants found in a cancerous tissue from a plurality of reference subjects 193. For example, training data 192 for reference subject 1 193-1 includes a cancer type 194-1 and a list of somatic sequence variants 195-1, including individual variants 196-1-1 . . . 196-1-S. To determine a prevalence for a particular candidate sequence variant detected for a test subject, a genomic region encompassing the locus of the candidate sequence variant is defined (e.g., the exon of a gene in which a candidate sequence variant is detected). Then, it is determined what portion of reference subjects 193, that have the same cancer as the test subject, have a sequence variant located within the defined genomic region (e.g., the exon of the gene).
[0388] In some embodiments, e.g., when only a limited set of defined candidate variants will be validated, sequence variant prevalence is predetermined and stored in a database, e.g., in non-persistent memory 111, or in an addressable remote server, as a look-up table. In other embodiments, system 100 determines a sequence variant prevalence for a genomic region and matching patient profile upon identification of a candidate sequence variant, e.g., by filtering variant prevalence training data 192 for the relevant genomic region and matching reference subjects.
[0389] Generally, the genomic region encompassing the candidate sequence variant is larger than a single nucleotide. For example, in some embodiments, the genomic region includes at least 10 nucleotides, at least 50 nucleotides, at least 100 nucleotides, at least 250 nucleotides, at least 500 nucleotides, at least 1000 nucleotides, at least 2500 nucleotides, or more nucleotides. In some embodiments, the genomic region is no larger than 10,000 nucleotides, not larger than 7500 nucleotides, no larger than 5000 nucleotides, no larger than 2500 nucleotides, or fewer nucleotides. In some embodiments, the genomic region is from 10 nucleotides to 10,000 nucleotides. In some embodiments, the genomic region is from 25 nucleotides to 5000 nucleotides. In some embodiments, the genomic region is from 50 nucleotides to 2500 nucleotides.
[0390] In some embodiments, when the candidate sequence variant falls within a protein coding sequence, the genomic region is defined as the exon in which the candidate sequence variant is located. In some embodiments, the genomic region is defined as several adjacent exons, including the exon in which the candidate sequence variant is located. In some embodiments, when the candidate sequence variant falls within a protein coding sequence, the genomic region is defined as all exons of the gene in which the candidate sequence variant is located. In some embodiments, when the candidate sequence variant falls within a protein coding sequence, the genomic region is defined as the entire gene in which the candidate sequence variant is located. Similarly, in some embodiments, when the candidate sequence variant falls within an intronic sequence of a gene, the genomic region is defined as the entire intron in which the candidate sequence variant is located, or several adjacent introns including the intron in which the candidate sequence variant is located.
[0391] In some embodiments, the genomic region encompassing the candidate sequence variant is a fixed window encompassing, e.g., surrounding, the candidate sequence variant. For example, in some embodiments, when the candidate sequence variant falls within anon- coding portion of the genome, the genomic region is defined as a fixed window surrounding the candidate sequence variant. However, in some embodiments, when the sequence variant falls within a non-coding genetic element, e.g., a promoter, enhancer, etc., the genomic region is defined as the entirety of the genetic element.
[0392] In some implementations, the genomic region encompassing the candidate sequence variant is dependent upon the sequence context of the locus. For example, when the candidate sequence variant falls within a coding sequence, the exon or several adjacent exons defines the genomic region, but when the candidate sequence variant falls within a non-coding sequence, the genomic region is defined by a fixed window encompassing the candidate sequence variant.
[0393] In some embodiments, the genomic region encompassing the candidate sequence variant is dependent upon a known or inferred effect of the sequence variant. For instance, as described in more detail below, in some embodiments, when the candidate sequence variant causes, or is inferred to cause, a partial or complete loss of function mutation in a gene, the genomic region is defined by all exons of the gene in which the candidate sequence variant is located. Similarly, as described in more detail below, in some embodiments, when the candidate sequence variant causes, or is inferred to cause, a gain of function mutation in a gene having one or more hotspots for gain of function mutations, the genomic region is defined as those exons of the gene encompassing the one or more hotspots. [0394] In some embodiments, when the candidate sequence variant falls within a genomic region associated with a known therapeutic resistance gene for the cancer of the subject, the pre-test odds determined based on the historical prevalence data is multiplied by a pre-test-odds multiplier (e.g., as described above).
[0395] In some embodiments, the Bayesian analysis is further informed by defining the specificity of variant detection based on an apparent variant allele fraction in the sample. For example, in some embodiments, the variant allele fraction for the candidate sequence variant is determined by a comparison of the variant allele fragment count 132-c-ac to the variant allele locus count 132-c-lc (e.g., a ratio of the variant allele fragment count to the variant allele locus count), thereby determining a variant allele fraction 132-c-vf. In some embodiments, the variant allele fraction is then compared to a distribution of variant detection specificities established based on a set of training samples (e.g., sensitivity distribution training data) with known variant allele fractions. For example, in some embodiments, nucleic acids from each of a plurality of training samples 181 having a known variant allele fraction 184 for one or more variant alleles 183 is sequenced according to a processed- matched sequencing reaction (e.g., using a substantially identical or identical sequencing reaction), and it is determined whether each sequence variant can be detected, e.g., defining a detection status 185 for each locus/variant 183. Over a large number of training samples, a specificity of detection of variants having different variant allele fractions can be determined. In some embodiments, the specificity is determined on a locus-by -locus basis, such that the specificity of detection is specific for the genomic region or locus encompassing the candidate sequence variant. In some embodiments, the specificity is determined globally, e.g., not on a locus-by-locus basis.
[0396] A correlation can then be established between the measured detection specificity and the variant allele fraction (e.g., variant detection sensitivity distribution 186). In some embodiments, the correlation is a linear or non-linear fit between measured detection specificities and variant allele fractions. In other embodiments, the correlation is determined by binning specificities (e.g., in bins 187) as a function of ranges of variant allele fractions 188, and determining a measure of central tendency (e.g., a mean) for the specificities 189 in the bin. The variant allele fraction 132-c-ac determined for the candidate sequence variant is then compared to the established correlation (e.g., variant detection sensitivity distribution 186) to define the specificity of detection for the candidate sequence variant. [0397] In some embodiments, the Bayesian analysis is further informed by accounting for the sequencing error rate for the variant allele and, accordingly, the probability that the candidate sequence variant is a product of a sequencing error, rather than a genomic variant. In some embodiments, a reaction-specific error rate (e.g., a trinucleotide sequencing error rate) is determined for the sequencing reaction (e.g., using an internal control spiked into the reaction). In some embodiments, a locus-specific error rate is determined from historical sequencing errors at the genomic region, or specific locus, encompassing the candidate sequence variant. In some embodiments, both a reaction-specific sequencing error rate and a locus-specific error rate are used to define a variant count distribution (e.g., variant count distribution 190), representing the number of variant allele counts (e.g., variant allele fragment count 132-c-ac) necessary to validate the presence of the candidate variant sequence in the cancer of the subject at a defined detection sensitivity. In some embodiments, a beta binomial distribution is established based on the reaction-specific sequencing error rate and the locus-specific error rate.
[0398] Method 400-2 then includes applying (414-2) the dynamic variant count threshold (e.g., locus-specific dynamic variant count threshold 191) to the sequencing data, e.g., by determining whether the variant allele fragment count 132-c-ac for the candidate sequence variant satisfies the threshold, and validating the candidate sequence variant (e.g., creating a record 132-v of the validation) when the threshold is satisfied or rejecting the candidate sequence variant when the threshold is not satisfied. In some embodiments, one or more additional filters, relating to global sequencing metrics and/or locus-specific sequencing metrics (e.g., one or more of variant locus coverage filter(s) 463, variant allele fraction filter(s) 465, variant support mapping filter(s) 467, variant support sequencing quality filter(s) 469, and low complexity region filter(s) 471, as illustrated in Figure 1D2) must be satisfied before validating a candidate sequence variant.
[0399] As described in further detail herein, in some embodiments, one or more validated variant statuses 132-v are used to match (424-2) the subject with a targeted therapy and/or a clinical trial. In some embodiments, as described in further detail herein, one or more validated variant statuses 132-v for one or more actionable variants 139-1-1, one or more matched therapies 139-1-2, and/or one or more matched clinical trials are used to generate (426-2) a patient report 139-1-3. In some embodiments, the patient report is transmitted to a medical professional treating the subject. In some embodiments, the patient is then administered (428-2) a personalized course of therapy, e.g., based on a matched therapy and/or clinical trial.
[0400] In some embodiments, the methods of validating a candidate somatic sequence variant using a dynamic threshold described herein fall within the context of a larger variant detection method, e.g., as illustrated by method 450 illustrated in Figures 4G1-4G3. For example, in some embodiments, the method includes obtaining (452) cfDNA sequence reads, as described herein, and aligning (454) those reads to a reference construct (e.g., a reference genome or mapped representation of several reference genomes), to generate aligned sequences 124 (e.g., a plurality of unique sequence reads). In some embodiments, putative somatic sequence variants are identified (456), e.g., those sequence variants having a variant allele fraction that is lower than expected for a germline sequence variant (which should be around 50% after accounting for an estimated circulating tumor fraction for the liquid biopsy sample), e.g., less than 30%, less than 20%, less than 10% etc. One or more candidate somatic sequence variants are then validated by applying one or more filters. For instance, as described herein, a dynamic variant count threshold is determined (459) and then used to apply (460) a dynamic probabilistic variant count filter to sequencing data for the candidate somatic sequence variant. In some embodiments, the method also includes applying (462) a variant loci coverage filter. In some embodiments, the method also includes applying (464) a variant allele fraction filter. In some embodiments, the method also includes applying (466) a variant support mapping filter. In some embodiments, the method also includes applying (468) a variant support sequencing quality filter. In some embodiments, the method also includes applying (470) a low complexity region filter. When all selected candidate somatic sequence variants have been validated or rejected according to these filters (472), the process proceeds with a reporting function.
[0401] In some embodiments, method 450 also includes validating (474) the sequencing data globally, using any of the metrics described herein. In some embodiments, the validation includes applying (476) a loci minimal coverage filter. In some embodiments, the validation includes applying (478) a loci central tendency coverage filter. In some embodiments, the validation includes applying (480) a total sequence read filter. In some embodiments, the validation includes applying (481) a sequence read quality filter. In some embodiments, the validation includes applying a sequencing control filter (482). The entire sequencing reaction is then validated or rejected (483) based on whether the sequencing data passes these global filters. [0402] In some embodiments, method 450 also includes validating (485) one or more germline mutations. In some embodiments, candidate germline sequence variants are identified (484), e.g., those sequence variants having a variant allele fraction that is higher than expected for a somatic sequence variant. In some embodiments, the validation includes applying (486) a germline-specific variant allele fraction filter. In some embodiments, the validation includes applying (487) a variant support mapping filter. In some embodiments, the validation includes applying (488) a variant support sequencing quality filter. When all selected candidate germline sequence variants have been validated or rejected according to these filters (489), the process proceeds with a reporting function.
[0403] As described in further detail herein, in some embodiments, one or more validated variant statuses 132-v are used to match (490) the subject with a targeted therapy and/or a clinical trial. In some embodiments, as described in further detail herein, one or more validated variant statuses 132-v for one or more actionable variants 139-1-1, one or more matched therapies 139-1-2, and/or one or more matched clinical trials are used to generate (492) a patient report 139-1-3. In some embodiments, the patient report is transmitted to a medical professional treating the subject. In some embodiments, the patient is then administered (494) a personalized course of therapy, e.g., based on a matched therapy and/or clinical trial.
[0404] In some embodiments, all, or nearly all, of the aligned sequence reads are evaluated to identify candidate sequence variants (e.g., candidate somatic sequence variants and/or candidate germline sequence variants). In other embodiments, a subset of the aligned sequence reads is evaluated to identify candidate sequence variants. For example, in one embodiment, targeted-panel sequencing reaction is used to generate sequencing data 122 and only sequence reads corresponding to the target panel (on-target reads) are evaluated to identify candidate sequence variants. In some embodiments, targeted-panel sequencing reaction is used to generate sequencing data 122 and a subset of sequence reads corresponding to a subset of the target panel are evaluated to identify candidate sequence variants. In some embodiments, a subset of the sequence reads corresponding to a subset of genes, regardless of whether the sequencing reaction is a targeted-panel sequencing reaction, a whole exome sequencing reaction, or a whole genome sequencing reaction, are evaluated to identify candidate sequence variants. In some embodiments, a subset of sequence reads corresponding to a defined set of regions within the genome, e.g., one or more genes, one or more introns, one or more exons, one or more subregion of an intron and/or exon associated with cancer etiology, etc., are evaluated to identify candidate sequence variants.
[0405] Alternatively, in some embodiments, regardless of what subset of aligned sequence reads are evaluated to identify candidate sequence variants, only a subset of candidate sequence variants is further validated. For example, in some embodiments, only candidate sequence variants corresponding to the target panel (on-target reads) are validated. Similarly, in some embodiments, only candidate sequence variants corresponding to a subset of the target panel are validated. Likewise, in some embodiments, only candidate sequence variants corresponding to a subset of genes, regardless of whether the sequencing reaction is a targeted-panel sequencing reaction, a whole exome sequencing reaction, or a whole genome sequencing reaction, are validated. Similarly, in some embodiments, only candidate variants corresponding to a defined set of regions within the genome, e.g., one or more genes, one or more introns, one or more exons, one or more subregion of an intron and/or exon associated with cancer etiology, etc. , are validated.
[0406] In some embodiments, different sets of sequence variants are evaluated depending on the type of cancer being evaluated. That is, when the subject has a first type of cancer, candidate sequence variants in a first set of genomic loci are evaluated, typically associated with the etiology of the first type cancer and/or a particular course of actionable therapy for the first type cancer, and when the subject has a second type of cancer, candidate sequence variants in a second set of genomic loci are evaluated, typically associated with the etiology of the second type cancer and/or a particular course of actionable therapy for the second type of cancer. These selections may be applied at the level of initial sequence read evaluation (e.g., only sequence reads corresponding to a defined set of loci are evaluated to identify a candidate sequence variant) or the validation level (e.g., sequence reads corresponding to a larger set of loci are evaluated to identify candidate sequence variants, but only those candidates corresponding to a defined set are further validated).
[0407] Similarly, in some embodiments, for one or more target loci falling within a gene exon, only candidate sequence variants that would result in an amino acid change in the amino acid sequence encoded by the gene are evaluated. In some embodiments, any candidate sequence variant resulting in an amino acid change are evaluated. In some embodiments, candidate sequence variants resulting in a defined amino acid change, e.g., an amino acid change associated with cancer etiology and/or a particular actionable cancer therapy, are evaluated. In some embodiments, only a subset of validated sequence variants is included on a clinical report for the sample. That is, in some embodiments, aligned sequence reads corresponding to all or a subset of genomic loci are evaluated to identify candidate sequence variants, all or a subset of identified candidate sequence variants are evaluated for validation, and only a subset of all possibly validated sequence variants are included on a clinical report generated for the sample.
[0408] For example, lists of example candidate sequence variants for evaluation in breast cancer, non-small cell lung cancer, prostate cancer, pan cancer, and cancer of unknown origin are provided below. Standard nomenclature is used to describe chromosomal location and specific amino acid variants, as described further by the Human Genome Variation Society, e.g., at the URL vamomen.hgvs.org/recommendations/protein/variant/substitution/.
[0409] For example, in some embodiments, the subject has breast cancer and candidate variants associated with at least one of the following genes and/or genetic loci are evaluated: ERBB2 (or a genetic locus including a chromosomal position of 17:37880220 and/or 17:37881064), EGFR (or a genetic locus including a chromosomal position of 7:55227926, 7:55242511, and/or 7:55249022), ESR1 (or a genetic locus including a chromosomal position of 6:152419922, 6:152419923 and/or 6:152419926), KRAS (or a genetic locus including a chromosomal position of 12:25380275, 12:25380276, 12:25380277, and/or 12:25380279), MAP2K1 (or a genetic locus including a chromosomal position of 15:66729162 and/or 15:66729163), MET (or a genetic locus including a chromosomal position of 7: 116422117 and/or 7:116423413); MTOR (or a genetic locus including a chromosomal position of 1:11187094, 1:11187096, and/or 1:11187796), NTRK1 (or a genetic locus including a chromosomal position of 1:156846342, 1:156849044 and/or 1:156849144), and PIK3CA (or a genetic locus including a chromosomal position of 3:178936082, 3:178936091, 3:178936092, 3:178936093, 3:178952084, and/or 3:178952085). In some embodiments, the subject has breast cancer and candidate variants associated with at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, or at least 8 of the genes listed above (or loci including the enumerated corresponding chromosomal positions) are evaluated. In some embodiments, the subject has breast cancer and candidate variants associated with any of the genes listed above (or loci including the enumerated corresponding chromosomal positions) are evaluated.
[0410] In some of the embodiments described above where the subject has breast cancer, only a subset of possible candidate sequence variants in the ERBB2 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the ERBB2 gene includes variants resulting in an amino acid change selected from L755*, L755S, L755W, T798I, T798K, and T798R.
[0411] In some of the embodiments described above where the subject has breast cancer, only a subset of possible candidate sequence variants in the EGFR gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the EGFR gene includes variants resulting in an amino acid change selected from G465*,
G465R, D761H, D761N, D761Y, V774L, and V774M.
[0412] In some of the embodiments described above where the subject has breast cancer, only a subset of possible candidate sequence variants in the ESR1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the ESR1 gene includes variants resulting in an amino acid change selected from Y537D,
Y537H, Y537N, Y537C, Y537S, Y537F, D538A, D538G, and D538V.
[0413] In some of the embodiments described above where the subject has breast cancer, only a subset of possible candidate sequence variants in the KRAS gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the KRAS gene includes variants resulting in an amino acid change selected from G60D, Q61H, Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, and Q61K.
[0414] In some of the embodiments described above where the subject has breast cancer, only a subset of possible candidate sequence variants in the MAP2K1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MAP2K1 gene includes variants resulting in an amino acid change selected from P124A, P124S, P124T, P124R, P124L, P124Q.
[0415] In some of the embodiments described above where the subject has breast cancer, only a subset of possible candidate sequence variants in the MET gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MET gene includes variants resulting in an amino acid change selected from FI 2001,
F1200L, FI 200V, Y1230D, Y1230H, and Y1230N.
[0416] In some of the embodiments described above where the subject has breast cancer, only a subset of possible candidate sequence variants in the MTOR gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MTOR gene includes variants resulting in an amino acid change selected from A2034E, A2034G, A2034V, F2108F, F2108I, F2108L, and F2108V. [0417] In some of the embodiments described above where the subject has breast cancer, only a subset of possible candidate sequence variants in the NTRK1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the NTRK1 gene includes variants resulting in an amino acid change selected from G595R, G595W, F646I, F646L, F646V, D679A, D679G, and D679V.
[0418] In some of the embodiments described above where the subject has breast cancer, only a subset of possible candidate sequence variants in the PIK3CA gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the PIK3CA gene includes variants resulting in an amino acid change selected from E542K, E545*, E545K, E545Q, E545A, E545G, E545V, E545D, E545E, H1047D, H1047Y, H1047N, H1047L, H1047P, H1047R.
[0419] Similarly, in some embodiments, the subject has non-small cell lung cancer and candidate variants associated with at least one of the following genes and/or genetic loci are evaluated: ALK (or a genetic locus including a chromosomal position of 2:29443613, 2:29443631, 2:29443695, 2:29443697, 2:29445213, and/or 2:29445258), B2M (or a genetic locus including a chromosomal position of 15:45003745), BRAF (or a genetic locus including a chromosomal position of 7:140453135, 7:140453136, and/or 7:140453137), EGFR (or a genetic locus including a chromosomal position of 7:55227926, 7:55241704, 7:55241705, 7:55241706, 7:55242469, 7:55242511, 7:55249022, 7:55249071, 7:55249091, 7:55249092, 7:55249093, 7:55249094, and/or 7:55259515), ERBB2 (or a genetic locus including a chromosomal position of 17:37880220), KRAS (or a genetic locus including a chromosomal position of 12:25378562, 12:25378643, 12:25380275, 12:25380276, 12:25380277, 12:25380279, 12:25398255, 12:25398280, 12:25398281, 12:25398282, 12:25398283, 12:25398284, and/or 12:25398285), MAP2K1 (or a genetic locus including a chromosomal position of 15:66729162 and/or 15:66729163), MET (or a genetic locus including a chromosomal position of 7: 116422117 and/or 7: 116423413), NTRK1 (or a genetic locus including a chromosomal position of 1:156846342, 1:156849044, and/or 1:156849144), PIK3CA (or agenetic locus including a chromosomal position of 3:178936091, 3:178936092, 3:178936093, 3:178952072, 3:178952084, and/or 3: 178952085), and STK11 (or agenetic locus including a chromosomal position of 19:1218483,
19:1220370, 19:1220487, 19:1220629, and/or 19:1220649). In some embodiments, the subject has non-small cell lung cancer and candidate variants associated with at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, or at least 10 of the genes listed above (or loci including the enumerated corresponding chromosomal positions) are evaluated. In some embodiments, the subject has non-small cell lung cancer and candidate variants associated with any of the genes listed above (or loci including the enumerated corresponding chromosomal positions) are evaluated.
[0420] In some of the embodiments described above where the subject has non-small cell lung cancer, only a subset of possible candidate sequence variants in the ALK gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the ALK gene includes variants resulting in an amino acid change selected from G1202*, G1202R, L1196L, L1196M, LI 196V, F1174F, F1174L, FI 1741, FI 174V, I1171N, I1171S, I1171T, C1156F, C1156S, and C1156Y.
[0421] In some of the embodiments described above where the subject has non-small cell lung cancer, only a subset of possible candidate sequence variants in the BRAF gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the BRAF gene includes variants resulting in an amino acid change selected from V600*, V600A, V600E, V600G, V600L, and V600M.
[0422] In some of the embodiments described above where the subject has non-small cell lung cancer, only a subset of possible candidate sequence variants in the EGFR gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the EGFR gene includes variants resulting in an amino acid change selected from G465*, G465R, L718L, L718M, L718V, L718P, L718Q, L718R, L747I, L747L, L747V, D761H, D761N, D761Y, V774L, V774M, T790K, T790M, T790R, C797G, C797R, C797S, C797F, C797Y, C797*, C797C, C797W, L798F, L798I, L798V, L858P, L858Q, and L858R.
[0423] In some of the embodiments described above where the subject has non-small cell lung cancer, only a subset of possible candidate sequence variants in the ERBB2 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the ERBB2 gene includes variants resulting in an amino acid change selected from L755*, L755S, and L755W.
[0424] In some of the embodiments described above where the subject has non-small cell lung cancer, only a subset of possible candidate sequence variants in the KRAS gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the KRAS gene includes variants resulting in an amino acid change selected from A146T, D119N, Q61H, Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, Q61K, G60V, Q22K, G13G, G13A, G13V, G13D, G13C, G13R, G13S, G12G, G12A, G12V, G12D, G12C, G12R, and G12S.
[0425] In some of the embodiments described above where the subject has non-small cell lung cancer, only a subset of possible candidate sequence variants in the MAP2K1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MAP2K1 gene includes variants resulting in an amino acid change selected from PI 24 A, P124S, P124T, P124R, P124L, and P124Q.
[0426] In some of the embodiments described above where the subject has non-small cell lung cancer, only a subset of possible candidate sequence variants in the MET gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MET gene includes variants resulting in an amino acid change selected from F1200I, F1200L, F1200V, Y1230D, Y1230H, and Y1230N.
[0427] In some of the embodiments described above where the subject has non-small cell lung cancer, only a subset of possible candidate sequence variants in the NTRK1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the NTRK1 gene includes variants resulting in an amino acid change selected from G595R, G595W, F646I, F646L, F646V, D679A, D679G, and D679V.
[0428] In some of the embodiments described above where the subject has non-small cell lung cancer, only a subset of possible candidate sequence variants in the PIK3CA gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the PIK3CA gene includes variants resulting in an amino acid change selected from E545*, E545K, E545Q, E545A, E545G, E545V, E545D, E545E, M1043V, H1047D, H1047Y, H1047N, H1047L, H1047P, and H1047R.
[0429] In some of the embodiments described above where the subject has non-small cell lung cancer, only a subset of possible candidate sequence variants in the STK11 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the STK11 gene includes variants resulting in an amino acid change selected from El 20*, D194Y, S216F, and E223*, as well as nucleotide substitution c.465-2A>T.
[0430] Similarly, in some embodiments, the subject has prostate cancer and candidate variants associated with at least one of the following genes and/or genetic loci are evaluated: AR (or a genetic locus including a chromosomal position of X: 66766292, X: 66931463, X:66931504, X:66937370, X:66937371, X: 66937372, X: 66943543, X:66943549, and/or X: 66943552), EGFR (or a genetic locus including a chromosomal position of 7:55227926, 7:55242511, and/or 7:55249022), ERBB2 (or a genetic locus including a chromosomal position of 17:37880220), KRAS (or a genetic locus including a chromosomal position of 12:25380275, 12:25380276, and/or 12:25380277), MAP2K1 (or a genetic locus including a chromosomal position of 15:66729162 and/or 15:66729163), MET (or a genetic locus including a chromosomal position of 7: 116422117 and/or 7: 116423413), NTRK1 (or a genetic locus including a chromosomal position of 1:156846342, 1:156849044, and/or 1:156849144), and PIK3CA (or a genetic locus including a chromosomal position of 3:178952084 and/or 3:178952085). In some embodiments, the subject has prostate cancer and candidate variants associated with at least 2, at least 3, at least 4, at least 5, at least 6, or at least 7 of the genes listed above (or loci including the enumerated corresponding chromosomal positions) are evaluated. In some embodiments, the subject has prostate cancer and candidate variants associated with any of the genes listed above (or loci including the enumerated corresponding chromosomal positions) are evaluated.
[0431] In some of the embodiments described above where the subject has prostate cancer, only a subset of possible candidate sequence variants in the AR gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the AR gene includes variants resulting in an amino acid change selected from W435L, L702H, L702P, L702R, V716M, W742G, W742R, W742*, W742L, W742S, W742C, H875Y, F877L, T878A, T878P, and T878S.
[0432] In some of the embodiments described above where the subject has prostate cancer, only a subset of possible candidate sequence variants in the EGFR gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the EGFR gene includes variants resulting in an amino acid change selected from G465*, G465R, D761H, D761N, D761Y, V774L, and V774M.
[0433] In some of the embodiments described above where the subject has prostate cancer, only a subset of possible candidate sequence variants in the ERBB2 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the ERBB2 gene includes variants resulting in an amino acid change selected from L755*, L755S, and L755W.
[0434] In some of the embodiments described above where the subject has prostate cancer, only a subset of possible candidate sequence variants in the KRAS gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the KRAS gene includes variants resulting in an amino acid change selected from Q61H, Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, and Q61K.
[0435] In some of the embodiments described above where the subject has prostate cancer, only a subset of possible candidate sequence variants in the MAP2K1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MAP2K1 gene includes variants resulting in an amino acid change selected from PI 24 A, P124S, P124T, P124R, P124L, and P124Q.
[0436] In some of the embodiments described above where the subject has prostate cancer, only a subset of possible candidate sequence variants in the MET gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MET gene includes variants resulting in an amino acid change selected from F 12001, F1200L, FI 200V, Y1230D, Y1230H, and Y1230N.
[0437] In some of the embodiments described above where the subject has prostate cancer, only a subset of possible candidate sequence variants in the NTRK1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the NTRK1 gene includes variants resulting in an amino acid change selected from G595R, G595W, F646I, F646L, F646V, D679A, D679G, and D679V.
[0438] In some of the embodiments described above where the subject has prostate cancer, only a subset of possible candidate sequence variants in the PIK3CA gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the PIK3CA gene includes variants resulting in an amino acid change selected from H1047D, H1047Y, H1047N, H1047L, H1047P, and H1047R.
[0439] In one example, the cancer condition is any type of cancer (for example, pan cancer) and the somatic variants validated by this method include variants associated with any of the following genes: EGFR (or a genetic locus including a chromosomal position of 7:55227926, 7:55242511, and/or 7:55249022), ERBB2 (or a genetic locus including a chromosomal position of 17:37880220), KRAS (or a genetic locus including a chromosomal position of 12:25380275, 12:25380276, and/or 12:25380277), MAP2K1 (or a genetic locus including a chromosomal position of 15:66729162 and/or 15:66729163), MET (or a genetic locus including a chromosomal position of 7:116422117 and/or 7:116423413), NTRK1 (or a genetic locus including a chromosomal position of 1:156846342, 1:156849044, and/or 1:156849144), PIK3CA (or agenetic locus including a chromosomal position of 3:178952084 and/or 3:178952085), and TP53. In some embodiments, the subject has any cancer (e.g., pan cancer) and candidate variants associated with at least 2, at least 3, at least 4, at least 5, at least 6, or at least 7 of the genes listed above (or loci including the enumerated corresponding chromosomal positions) are evaluated. In some embodiments, the subject has any cancer (e.g., pan cancer) and candidate variants associated with any of the genes listed above (or loci including the enumerated corresponding chromosomal positions) are evaluated.
[0440] In some of the embodiments described above where the subject has any type of cancer (e.g., pan-cancer), only a subset of possible candidate sequence variants in the EGFR gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the EGFR gene includes variants resulting in an amino acid change selected from G465*, G465R, D761H, D761N, D761Y, V774L, and V774M.
[0441] In some of the embodiments described above where the subject has any type of cancer (e.g., pan-cancer), only a subset of possible candidate sequence variants in the ERBB2 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the ERBB2 gene includes variants resulting in an amino acid change selected from L755*, L755S, and L755W.
[0442] In some of the embodiments described above where the subject has any type of cancer (e.g., pan-cancer), only a subset of possible candidate sequence variants in the KRAS gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the KRAS gene includes variants resulting in an amino acid change selected from Q61H, Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, and Q61K.
[0443] In some of the embodiments described above where the subject has any type of cancer (e.g., pan-cancer), only a subset of possible candidate sequence variants in the MAP2K1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MAP2K1 gene includes variants resulting in an amino acid change selected from P124A, P124S, P124T, P124R, P124L, and P124Q.
[0444] In some of the embodiments described above where the subject has any type of cancer (e.g., pan-cancer), only a subset of possible candidate sequence variants in the MET gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MET gene includes variants resulting in an amino acid change selected from F1200I, F1200L, F1200V, Y1230D, Y1230H, and Y1230N.
[0445] In some of the embodiments described above where the subject has any type of cancer (e.g., pan-cancer), only a subset of possible candidate sequence variants in the NTRK1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the NTRK1 gene includes variants resulting in an amino acid change selected from G595R, G595W, F646I, F646L, F646V, D679A, D679G, and D679V.
[0446] In some of the embodiments described above where the subject has any type of cancer (e.g., pan-cancer), only a subset of possible candidate sequence variants in the PIK3CA gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the PIK3CA gene includes variants resulting in an amino acid change selected from H1047D, H1047Y, H1047N, H1047L, H1047P, and H1047R.
[0447] Similarly, in some embodiments, the subject has a tumor of unknown origin or a cancer of unknown primary and candidate variants associated with at least one of the following genes and/or genetic loci are evaluated: EGFR (or a genetic locus including a chromosomal position of 7:55227926, 7:55242511, and/or 7:55249022), ERBB2 (or a genetic locus including a chromosomal position of 17:37880220), KRAS (or a genetic locus including a chromosomal position of 12:25380275, 12:25380276, 12:25380277, and/or 12:25398255), MAP2K1 (or a genetic locus including a chromosomal position of 15:66729162 and/or 15:66729163), MET (or a genetic locus including a chromosomal position of 7:116422117 and/or 7:116423413), NRAS (or a genetic locus including a chromosomal position of 1 : 115258748), NTRK1 (or a genetic locus including a chromosomal position of 1:156846342, 1:156849044, and/or 1:156849144), PIK3CA (or a genetic locus including a chromosomal position of 3:178927980, 3:178952084 and/or 3:178952085), and TP53. In some embodiments, the subject has any cancer (e.g., pan cancer) and candidate variants associated with at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, or at least 8 of the genes listed above (or loci including the enumerated corresponding chromosomal positions) are evaluated. In some embodiments, the subject has any cancer (e.g., pan cancer) and candidate variants associated with any of the genes listed above (or loci including the enumerated corresponding chromosomal positions) are evaluated.
[0448] In some of the embodiments described above where the subject has a tumor of unknown origin or a cancer of unknown primary, only a subset of possible candidate sequence variants in the EGFR gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the EGFR gene includes variants resulting in an amino acid change selected from G465*, G465R, D761H, D761N, D761Y, V774L, and V774M.
[0449] In some of the embodiments described above where the subject has a tumor of unknown origin or a cancer of unknown primary, only a subset of possible candidate sequence variants in the ERBB2 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the ERBB2 gene includes variants resulting in an amino acid change selected from L755*, L755S, and L755W.
[0450] In some of the embodiments described above where the subject has a tumor of unknown origin or a cancer of unknown primary, only a subset of possible candidate sequence variants in the KRAS gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the KRAS gene includes variants resulting in an amino acid change selected from Q61H, Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, Q61K, and Q22K.
[0451] In some of the embodiments described above where the subject has a tumor of unknown origin or a cancer of unknown primary, only a subset of possible candidate sequence variants in the MAP2K1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MAP2K1 gene includes variants resulting in an amino acid change selected from PI 24 A, P124S, P124T, P124R, P124L, and P124Q.
[0452] In some of the embodiments described above where the subject has a tumor of unknown origin or a cancer of unknown primary, only a subset of possible candidate sequence variants in the MET gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the MET gene includes variants resulting in an amino acid change selected from F1200I, F1200L, F1200V, Y1230D, Y1230H, and Y1230N.
[0453] In some of the embodiments described above where the subject has a tumor of unknown origin or a cancer of unknown primary, only a subset of possible candidate sequence variants in the NRAS gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the NRAS gene includes variants resulting in an amino acid change of G12S. [0454] In some of the embodiments described above where the subject has a tumor of unknown origin or a cancer of unknown primary, only a subset of possible candidate sequence variants in the NTRK1 gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the NTRK1 gene includes variants resulting in an amino acid change selected from G595R, G595W, F646I, F646L, F646V, D679A, D679G, and D679V.
[0455] In some of the embodiments described above where the subject has a tumor of unknown origin or a cancer of unknown primary, only a subset of possible candidate sequence variants in the PIK3CA gene are evaluated and/or reported. In some embodiments, the subset of possible candidate sequence variants in the PIK3CA gene includes variants resulting in an amino acid change selected from C420R, H1047D, H1047Y, H1047N, H1047L, H1047P, and H1047R.
[0456] In other embodiments, the cancer condition is acute myeloid leukemia, adrenal cancer, b cell lymphoma, basal cell carcinoma, biliary cancer, bladder cancer, brain cancer, breast cancer, cervical cancer, chromophobe renal cell carcinoma, clear cell renal cell carcinoma, colorectal cancer, confirm at path review (cancer type unconfirmed), endocrine tumor, endometrial cancer, esophageal cancer, gastric cancer, gastrointestinal stromal tumor, glioblastoma, head and neck cancer, head and neck squamous cell carcinoma, heme other, high-grade glioma, kidney cancer, liver cancer, low grade glioma, medulloblastoma, melanoma, meningioma, mesothelioma, multiple myeloma, neuroblastoma, non-clear cell renal cell carcinoma, non-small cell lung cancer, oropharyngeal cancer, ovarian cancer, pan cancer, pancreatic cancer, peritoneal cancer, prostate cancer, sarcoma, skin cancer, small cell lung cancer, t cell lymphoma, testicular cancer, thymoma, thyroid cancer, tumor of unknown origin, or uveal melanoma.
[0457] In some embodiments, certain variants pre-identified on a whitelist may be rescued, e.g., not filtered out, when they fail to pass selective filters, e.g., MSI/SN, a Bayesian filtering method, and/or a coverage, VAF or region-based filter. The rationale for whitelisting a variant is to apply less stringent filtering criteria to such a variant so that it can be reviewed and/or reported. In some embodiments, one or more variant on the whitelist is a common pathogenic variant, e.g., with high clinical relevance. In this fashion, when a variant on the whitelist fails to pass certain filters, it will be rescued and not filtered out. As used herein, MSI/SN refers to a variant filter for filtering out potential artifactual variants based on the MSI (microsatellite instable) and SN (signal-to-noise ratio) values calculated by the variant caller VarDict. See, for example, VarDict documentation, available on the internet at github. com/AstraZeneca-NGS/V arDictJava.
[0458] In some embodiments, one or more locus and/or genomic region is blacklisted, preventing somatic variant annotation for variants identified at the locus or region. In some embodiments, the variant has a length of 120, 100, 80, 60, 40, 20, 10, 5 or less base pairs. In various embodiments, any combination of the additional criteria, as well as additional criteria not listed above, may be applied to the variant calling process. Again, in some embodiments, different criteria are applied to the annotation of different types of variants.
[0459] In some embodiments, liquid biopsy assays are used to detect variant alterations present at low circulating fractions in the patient’s blood. In such circumstances, it may be warranted to lower the requirements for positively identifying a variant. That is, in some embodiments, low levels of support may be sufficient to call a variant, dependent upon the reason for using the liquid biopsy assay.
[0460] In some embodiments, SNV/INDEL detection is accomplished using VarDict (available on the internet at github.com/AstraZeneca-NGS/VarDictJava). Both SNVs and INDELs are called and then sorted, deduplicated, normalized and annotated. The annotation uses SnpEff to add transcript information, 1000 genomes minor allele frequencies, COSMIC reference names and counts, ExAC allele frequencies, and Kaviar population allele frequencies. The annotated variants are then classified as germline, somatic, or uncertain using a Bayesian model based on prior expectations informed by databases of germline and cancer variants. In some embodiments, uncertain variants are treated as somatic for filtering and reporting purposes.
[0461] In some embodiments, genomic rearrangements (e.g., inversions, translocations, and gene fusions) are detected following de-multiplexing by aligning tumor FASTQ files against a human reference genome using a local alignment algorithm, such as BWA. In some embodiments, DNA reads are sorted, and duplicates may be marked with a software, for example, SAMBlaster. Discordant and split reads may be further identified and separated. These data may be read into a software, for example, LUMPY, for structural variant detection. In some embodiments, structural alterations are grouped by type, recurrence, and presence and stored within a database and displayed through a fusion viewer software tool. The fusion viewer software tool may reference a database, for example, Ensembl, to determine the gene and proximal exons surrounding the breakpoint for any possible transcript generated across the breakpoint. The fusion viewer tool may then place the breakpoint 5’ or 3’ to the subsequent exon in the direction of transcription. For inversions, this orientation may be reversed for the inverted gene. After positioning of the breakpoint, the translated amino acid sequences may be generated for both genes in the chimeric protein, and a plot may be generated containing the remaining functional domains for each protein, as returned from a database, for example, Uniprot.
[0462] For instance, in an example implementation, gene rearrangements are detected using the SpeedSeq analysis pipeline. Chiang et al, 2015, “SpeedSeq: ultra-fast personal genome analysis and interpretation,” Nat Methods, (12), pg. 966. Briefly, FASTQ files are aligned to hgl9 using BWA. Split reads mapped to multiple positions and read pairs mapped to discordant positions are identified and separated, then utilized to detect gene rearrangements by LUMPY. Layer et al. , 2014, “LUMPY : a probabilistic framework for structural variant discovery,” Genome Biol, (15), pg. 84. Fusions can then be filtered according to the number of supporting reads.
[0463] In some embodiments, putative fusion variants supported by less than a minimum number of unique sequence reads are filtered. In some embodiments, the minimum number of unique sequence reads is 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, or 20 unique sequence reads.
Allelic Fraction Determination
[0464] In some embodiments, the analysis of aligned sequence reads, e.g., in SAM or BAM format, includes determination of variant allele fractions (133) for one or more of the variant alleles 132 identified as described above. In some embodiments, a variant allele fraction module 151 tallies the instances that each allele is represented by a unique sequence read encompassing the variant locus of interest, generating a count for each allele represented at that locus. In some embodiments, these tallies are used to determine the ratio of the variant allele, e.g., an allele other than the most prevalent allele in the subject’s population for a respective locus, to a reference allele. This variant allele fraction 133 can be used in several places in the feature extraction 206 workflow. For instance, in some embodiments, a variant allele fraction is used during annotations of identified variants, e.g., when determining whether the allele originated from a germline cell or a somatic cell. In other instances, a variant allele fraction is used in a process for estimating a tumor fraction for a liquid biopsy sample or a tumor purity for a solid tumor fraction. For instance, variant allele fractions for a plurality of somatic alleles can be used to estimate the percentage of sequence reads originating from one copy of a cancerous chromosome. Assuming a 100% tumor purity and that each cancer cell caries one copy of the variant allele, the overall purity of the tumor can be estimated. This estimate, of course, can be further corrected based on other information extracted from the sequencing data, such as copy number alterations, tumor ploidy aberrations, tumor heterozygosity, etc.
Methylation Determination
[0465] In some embodiments, where nucleic acid sequencing library was processed by bi sulfite treatment or enzymatic methyl-cytosine conversion, as described above, the analysis of aligned sequence reads, e.g., in SAM or BAM format, includes determination of methylation states 132 for one or more loci in the genome of the patient. In some embodiments, methylation sequencing data is aligned to a reference sequence construct 158 in a different fashion than non-methylation sequencing, because non-methylated cytosines are converted to uracils, and the resulting uracils are ultimately sequenced as thymines, whereas methylated cytosine are not converted and sequenced as cytosine. Different approaches, therefore, have to be used to align these modified sequences to a reference sequence construct, such as seeding alignments with shorter regions of identity or converting all cytosines to thymidines in the sequencing data and then aligning the data to reference sequence constructs for both the plus and minus strand of the sequence construct. For review of these approaches, see Zhou Q. et ah, BMC Bioinformatics, 20(47): 1-11 (2019), the content of which is hereby incorporated by reference, in its entirety, for all purposes. Algorithms for calling methylated bases are known in the art. For example, Bismark is able to distinguish between cytosines in CpG, CHG, and CHH contexts. Krueger F. and Andrews SR, Bioinformatics, 27(11): 1571-71 (2011), the content of which is hereby incorporated by reference, in its entirety, for all purposes.
Copy Number Variation:
[0466] In some embodiments, the analysis of aligned sequence reads, e.g., in SAM or BAM format, includes determination of the copy number 135 for one or more locus, using a copy number variation analysis module 153. For example, Figure 4F1 illustrates a workflow of an exemplary method 400-1 for validating copy number variation to be used in generating clinical reports to support clinical decision making in precision oncology, in accordance with some embodiments of the present disclosure. More specifically, method 400-1 describes a bioinformatics pipeline for extraction and identification of genomic copy number variation (e.g., a method for feature extraction 206), in accordance with some embodiments of the present disclosure. [0467] Referring to Block 402-1, the method comprises obtaining a dataset of cell-free DNA sequencing data. The sequencing data can be obtained using any of the methods and/or embodiments disclosed herein, including any of the implementations for wet lab processing 204. In some embodiments, where both a liquid biopsy sample and a normal tissue sample of the patient are analyzed, de-duplicated BAM files and a VCF generated from the variant calling pipeline are used to compute read depth and variation in heterozygous germline SNVs between sequencing reads for each sample. By contrast, in some embodiments, where only a liquid biopsy sample is being analyzed, comparison between a tumor sample and a pool of process-matched normal controls is used.
[0468] Pre-processing and/or alignment can be applied to the cfDNA sequencing data, as described in detail above. For example, referring to Block 404-1, in some embodiments, sequence reads obtained from the cfDNA sequencing data are aligned to a reference human construct, thus generating a plurality of aligned reads 406-1. Referring to Block 408-1, the method further comprises optionally processing the aligned cfDNA sequence reads by, for example, normalization, filtering, and/or quality control, as described in detail above.
[0469] Referring to Block 410-1, in some embodiments, the method further comprises obtaining for validation one or more copy number status annotations (e.g., amplified, neutral, deleted). In some embodiments, the copy number status annotations are obtained via copy number analysis.
[0470] For instance, in an example implementation, copy number variants (CNVs) are analyzed using the CNVkit package. See, Talevich et ctl, PLoS Comput Biol, 12: 1004873 (2016), the content of which is hereby incorporated by reference, in its entirety, for all purposes. CNVkit is used for genomic region binning, coverage calculation, bias correction, normalization to a reference pool, segmentation and visualization. The log2 ratios between the tumor sample and a pool of process matched healthy samples from the CNVkit output are then annotated and filtered using statistical models whereby the amplification status (amplified or not-amplified) of each gene is predicted and non-focal amplifications are removed.
[0471] In some embodiments, copy number variations (CNVs) are analyzed using a combination of an open-source tool, such as CNVkit, and an annotation/filtering algorithm, e.g. , implemented via a python script. CNVkit is used initially to perform genomic region binning, coverage calculation, bias correction, normalization to a reference pool, segmentation and, optionally, visualization. The bin-level copy ratios and segment-level copy ratios, in addition to their corresponding confidence intervals, from the CNVkit output are then used in the annotation and filtering where the copy number state (amplified, neutral, deleted) of each segment and bin are determined and non-focal amplifications/deletions are filtered out based on a set of acceptance criteria. In some embodiments, one or more copy number variations selected from amplifications in the MET, EGFR, ERBB2, CD274,
CCNE1, and MYC genes, and deletions in the BRCA1 and BRCA2 genes are analyzed. However, the methods described herein is not limited to only these reportable genes.
[0472] In some embodiments, CNV analysis is performed using a tumor BAM file, a target region BED file, a pool of process matched normal samples, and inputs for initial reference pool construction. Inputs for initial reference pool construction include one or more of normal BAM files, a human reference genome file, mappable regions of the genome, and a blacklist that contains recurrent problematic areas of the genome.
[0473] CNVkit utilizes both targeted captured sequencing reads and non-specifically captured off-target reads to infer copy number information. The targeted genomic regions specified in the probe target BED file are divided to target bins with an average size of, e.g., 100 base pairs, which can be specified by the user. The genomic regions between the target regions, e.g., excluding regions that cannot be mapped reliably, are automatically divided into off-target (also referred to as anti-target) bins with an average size of, e.g., 150 kbp, which again can be specified by the user. Raw log2-transformed depths are then calculated from the alignments in the input BAM file and written to two tab-delimited .cnn files, one for each of the target and off-target bins.
[0474] A pooled reference is constructed from a panel of process matched normal samples. The raw log2 depths of target and off-target bins in each normal sample are computed as described above, and then each are median-centered and corrected for bias including GC content, genome sequence repetitiveness, target size, and/or spacing. The corrected target and off-target log2 depths are combined, and a weighted average and spread are calculated as Tukey’s biweight location and midvariance in each bin. These values are written to a tab delimited reference .cnn file, which is used to normalize an input tumor sample as follows.
[0475] The raw log2 depths of an input sample are median-centered and bias-corrected as described in the reference construction. The corrected log2 depth of each bin is then subtracted by the corresponding log2 depth in the reference file, resulting in the log2 copy ratios (also referred to as copy ratios or log2 ratios) between the input tumor sample and the reference pool. These values are written to a tab-delimited .cnr file.
[0476] The copy ratios are then segmented, e.g., via a circular binary segmentation (CBS) algorithm or another suitable segmentation algorithm, whereby adjacent bins are grouped to larger genomic regions (segments) of equal copy number. The segment’s copy ratio is calculated as the weighted mean of all bins within the segment. The confidence interval of the segment mean is estimated by bootstrapping the bin-level copy ratios within the segment. The segments’ genomic ranges, copy ratios and confidence intervals are written to a tab- delimited .cns file.
[0477] In some embodiments, copy number analysis includes application of a circular binary segmentation algorithm and selection of segments with highly differential log2 ratios between the cancer sample and its comparator (e.g., a matched normal or normal pool). In some embodiments, approximate integer copy number is assessed from a combination of differential coverage in segmented regions and an estimate of stromal admixture (for example, tumor purity, or the portion of a sample that is cancerous vs. non-cancerous, such as a tumor fraction for a liquid biopsy sample) is generated by analysis of heterozygous germline SNVs. In some embodiments, the integer copy number of a genomic segment in a cancer sample is used to assign a copy number status annotation to the genomic segment (e.g., amplified, neutral, deleted) based on a comparison with the integer copy number of a corresponding genomic segment in a reference pool.
[0478] Validation filters. Referring again to Block 410-1, the annotation/filtering algorithm is subsequently applied to the bin-level copy ratios and segment-level copy ratios, in addition to their corresponding confidence intervals, obtained from the CNVkit output.
The annotation/filtering algorithm comprises a plurality of filters for validation of copy number status annotations 412-1, including an optional median bin-level copy ratio filter 414- 1; an optional segment-level confidence interval filter 416-1; an optional median-plus-median absolute deviation (MAD) bin-level copy ratio filter 418-1; and/or an optional segment-level copy ratio filter. Referring to Block 420-1, the method further comprises validating or rejecting a copy number variation as a focal copy number variation based on the plurality of copy number status annotation validation filters. Specifically, when a filter in the plurality of filters is fired, the copy number annotation of the segment is rejected, and the copy number variation is determined to be a non-focal copy number variation. When no filter in the plurality of filters is fired, the copy number annotation of the segment is validated 422-1 and the copy number variation is determined to be a focal copy number variation.
[0479] The extracted features (e.g., validated status of copy number variation 422-1) can then be used for variant analysis 208 and clinical report generation (e.g., as described in further detail below with reference to Figure 2A). For example, referring to Block 424-1, the method further comprises matching therapies and/or clinical trials based on the status (e.g., validated or rejected) of the respective copy number annotation. Referring to Block 426-1, the method further comprises generating a patient report indicating the CNV status, in addition to matched therapies and/or clinical trials based on the CNV status.
[0480] Specific embodiments and further details regarding systems and methods for validating copy number status annotations are provided in following sections with reference to Figures 5A1-5E1 and 6A1-6C1.
Microsatellite Instability (MSI):
[0481] In some embodiments, analysis of aligned sequence reads, e.g., in SAM or BAM format, includes analysis of the microsatellite instability status 137 of a cancer, using a microsatellite instability analysis module 154. In some embodiments, an MSI classification algorithm classifies a cancer into three categories: microsatellite instability -high (MSI-H), microsatellite stable (MSS), or microsatellite equivocal (MSE). Microsatellite instability is a clinically actionable genomic indication for cancer immunotherapy. In microsatellite instability -high (MSI-H) tumors, defects in DNA mismatch repair (MMR) can cause a hypermutated phenotype where alterations accumulate in the repetitive microsatellite regions of DNA. MSI detection is conventionally performed by subjecting tumor tissue (“solid biopsy”) to clinical next-generation sequencing or specific assays, such as MMR IHC or MSI PCR.
[0482] For example, microsatellite instability status can be assessed by determining the number of repeating units present at a plurality of microsatellite loci, e.g., 5, 10, 15, 20, 25, 30, 40, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, or more loci. In some embodiments, only reads encompassing a microsatellite locus that include a significant number of flanking nucleotides on both ends, e.g., at least 5, 10, 15, or more nucleotides flanking each end, are used for the analysis in order to avoid using reads that do not completely cover the locus. In some embodiments, a minimal number of reads, e.g., at least 5, 10, 20, 30, 40, 50, or more reads have to meet this criteria in order to use a particular microsatellite locus, in order to ensure the accuracy of the determination given the high incidence of polymerase slipping during replication of these repeated sequences.
[0483] In some embodiments, each locus is tested individually for instability, e.g., as measured by a change or variance in the number of nucleotide base repeats, e.g., in cancer- derived nucleotide sequences relative to a normal sample or standard, for example, using the Kolmogorov-Smimov test. For example, if p < 0.05, the locus is considered unstable. The proportion of unstable microsatellite loci may be fed into a logistic regression classifier trained on samples from various cancer types, especially cancer types which have clinically determined MSI statuses, for example, colorectal and endometrial cohorts. For MSI testing where only a liquid biopsy sample is analyzed, the mean and variance for the number of repeats may be calculated for each microsatellite locus. A vector containing the mean and variance data may be put into a classifier (e.g., a support vector machine classification algorithm) trained to provide a probability that the patient is MSI-H, which may be compared to a threshold value. In some embodiments, the threshold value for calling the patient as MSI-H is at least 60% probability, or at least 65% probability, 70% probability, 75% probability, 80% probability, or greater. In some embodiments, a baseline threshold may be established to call the patient as MSS. In some embodiments, the baseline threshold is no more than 40%, or no more than 35% probability, 30% probability, 25% probability, 20% probability, or less. In some embodiments, when the output of the classifier falls within the range between the MSI-H and MSS thresholds, the patient is identified as MSE.
[0484] Other methods for determining the MSI status of a subject are known in the art. For example, in some embodiments, microsatellite instability analysis module 154 employs an MSI evaluation methods described in U.S. Provisional Patent Application Serial No. 62/881,845, filed August 1, 2019, or U.S. Provisional Application Serial No. 62/931,600, filed November 6, 2019, the contents of which are hereby incorporated by reference, in their entireties, for all purposes.
Tumor Mutational Burden (TMB):
[0485] In some embodiments, the analysis of aligned sequence reads, e.g., in SAM or BAM format, includes determination of a mutation burden for the cancer (e.g., a tumor mutational burden 136), using a tumor mutational burden analysis module 155. Generally, a tumor mutational burden is a measure of the mutations in a cancer per unit of the patient’s genome. For example, a tumor mutational burden may be expressed as a measure of central tendency (e.g., an average) of the number of somatic variants per million base pairs in the genome. In some embodiments, a tumor mutational burden refers to only a set of possible mutations, e.g., one or more of SNVs, MNVs, indels, or genomic rearrangements. In some embodiments, a tumor mutational burden refers to only a subset of one or more types of possible mutations, e.g., non-synonymous mutations, meaning those mutations that alter the amino acid sequence of an encoded protein. In other embodiments, for example, a tumor mutational burden refers to the number of one or more types of mutations that occur in protein coding sequences, e.g., regardless of whether they change the amino acid sequence of the encoded protein.
[0486] As an example, in some embodiments, a tumor mutational burden (TMB) is calculated by dividing the number of mutations (e.g., all variants or non-synonymous variants) identified in the sequencing data (e.g., as represented in a VCF file) by the size (e.g., in megabases) of a capture probe panel used for targeted sequencing. In some embodiments, a variant is included in tumor mutation burden calculation only when certain criteria are met. For instance, in some embodiments, a threshold sequence coverage for the locus associated with the variant must be met before the variant is included in the calculation, e.g., at least 25x, 50x, 75x, lOOx, 250x, 500x, or greater. Similarly, in some embodiments, a minimum number of unique sequence reads encompassing the variant allele must be identified in the sequencing data, e.g., at least 4, 5, 6, 7, 8, 9, 10, or more unique sequence reads. In some embodiments, a threshold variant allelic fraction threshold must be satisfied before the variant is included in the calculation, e.g., at least 0.01%, 0.1%, 0.25%, 0.5%, 0.75%, 1%, 1.5%, 2%, 2.5%, 3%, 4%, 5%, or greater. In some embodiments, an inclusion criteria may be different for different types of variants and/or different variants of the same type. For instance, a variant detected in a mutation hotspot within the genome may face less rigorous criteria than a variant detected in a more stable locus within the genome.
[0487] Other methods for calculating tumor mutation burden in liquid biopsy samples and/or solid tissue samples are known in the art. See, for example, Fenizia F et al, Transl Lung Cancer Res., 7(6):668-77 (2018) and Georgiadis A et al, Clin. Cancer Res.,
25 (23): 7024-34 (2019), the disclosures of which are hereby incorporated by reference, in their entireties, for all purposes.
Homologous Recombination Status (HRD):
[0488] In some embodiments, analysis of aligned sequence reads, e.g., in SAM or BAM format, includes analysis of whether the cancer is homologous recombination deficient (HRD status 137-3), using a homologous recombination pathway analysis module 157. [0489] Homologous recombination (HR) is a normal, highly conserved DNA repair process that enables the exchange of genetic information between identical or closely related DNA molecules. It is most widely used by cells to accurately repair harmful breaks (e.g., damage) that occur on both strands of DNA. DNA damage may occur from exogenous (external) sources like UV light, radiation, or chemical damage; or from endogenous (internal) sources like errors in DNA replication or other cellular processes that create DNA damage. Double strand breaks are a type of DNA damage. Using poly (ADP-ribose) polymerase (PARP) inhibitors in patients with HRD compromises two pathways of DNA repair, resulting in cell death (apoptosis). The efficacy of PARP inhibitors is improved not only in ovarian cancers displaying germline or somatic BRCA mutations, but also in cancers in which HRD is caused by other underlying etiologies.
[0490] In some embodiments, HRD status can be determined by inputting features correlated with HRD status into a classifier trained to distinguish between cancers with homologous recombination pathway deficiencies and cancers without homologous recombination pathway deficiencies. For example, in some embodiments, the features include one or more of (i) a heterozygosity status for a first plurality of DNA damage repair genes in the genome of the cancerous tissue of the subject, (ii) a measure of the loss of heterozygosity across the genome of the cancerous tissue of the subject, (iii) a measure of variant alleles detected in a second plurality of DNA damage repair genes in the genome of the cancerous tissue of the subject, and (iv) a measure of variant alleles detected in the second plurality of DNA damage repair genes in the genome of the non-cancerous tissue of the subject. In some embodiments, all four of the features described above are used as features in an HRD classifier. More details about HRD classifiers using these and other features are described in U.S. Patent Application Serial No. 16/789,363, filed February 12, 2020, the content of which is hereby incorporated by reference, in its entirety, for all purposes.
Circulating Tumor Fraction:
[0491] In some embodiments, the analysis of aligned sequence reads, e.g., in SAM or BAM format, includes estimation of a circulating tumor fraction for the liquid biopsy sample. Tumor fraction or circulating tumor fraction is the fraction of cell free nucleic acid molecules in the sample that originates from a cancerous tissue of the subject, rather than from a non- cancerous tissue (e.g., a germline or hematopoietic tissue). Several open source analysis packages have modules for calculating tumor fraction from solid tumor samples. For instance, PureCN (Riester, M., et ciL, Source Code Biol Med, 11:13 (2016)) is designed to estimate tumor purity from targeted short-read sequencing data of solid tumor samples. Similarly, FACETS (Shen R, Seshan VE, Nucleic Acids Res., 44(16):el31 (2016)) is designed to estimate tumor fraction from sequencing data of solid tumor samples. However, estimating tumor fraction from a liquid biopsy sample is more difficult because of the, generally, lower tumor fraction relative to a solid tumor sample and typically small size of a targeted panel used for liquid biopsy sequencing. Indeed, packages such as PureCN and FACETS perform poorly at low tumor fractions and with sequencing data generated using small targeted-panels.
[0492] In some embodiments, circulating tumor fraction is estimated from a targeted- panel sequencing reaction of a liquid biopsy sample using an off-target read methodology, e.g., as described herein with reference to Figures 4 and 5 (e.g., Figures 4F3, 5A3-5B3). Briefly, a circulating tumor fraction estimate is determined from reads in the target captured regions, as well as off-target reads uniformly distributed across the human reference genome. Segments having similar copy ratios, e.g., as assigned via circular binary segmentation (CBS) during CNV analysis, are fit to integer copy states, e.g., via an expectation-maximization algorithm using the sum of squared error of the segment log2 ratios (normalized to genomic interval size) to expected ratios given a putative copy state and tumor fraction. A measure of fit between corresponding segment-level coverage ratios and assigned integer copy states across the plurality of simulated circulating tumor fractions is then used to select the simulated circulating tumor fraction to be used as the circulating tumor fraction for the liquid biopsy sample. In some embodiments, error minimization is used to identify the simulated tumor fraction providing the best fit to the data.
[0493] In some embodiments, circulating tumor fraction is estimated from a targeted- panel sequencing reaction of a liquid biopsy sample using an off-target read methodology, e.g., as described herein with reference to Figures 4 and 5 (e.g., Figures 4F3, 5A3-5B3). Briefly, a circulating tumor fraction estimate is determined from reads in the target captured regions, as well as off-target reads uniformly distributed across the human reference genome. Segments having similar copy ratios, e.g., as assigned via circular binary segmentation (CBS) during CNV analysis, are fit to integer copy states, e.g., via an expectation-maximization algorithm using the sum of squared error of the segment log2 ratios (normalized to genomic interval size) to expected ratios given a putative copy state and tumor fraction. For more information on expectation maximization algorithms see, for example, Sundberg, Rolf (1974). "Maximum likelihood theory for incomplete data from an exponential family". Scandinavian Journal of Statistics. 1 (2): 49-58, the content of which is hereby incorporated by reference in its entirety. A measure of fit between corresponding segment-level coverage ratios and assigned integer copy states across the plurality of simulated circulating tumor fractions is then used to select the simulated circulating tumor fraction to be used as the circulating tumor fraction for the liquid biopsy sample. In some embodiments, error minimization is used to identify the simulated tumor fraction providing the best fit to the data.
[0494] In some embodiments, a measure of fit between corresponding segment-level coverage ratios and assigned integer copy states across the plurality of simulated circulating tumor fractions (e.g., using an error minimization algorithm) provides a number of local optima (e.g., local minima for an error minimization model or local maxima for a fix maximization model) for the best fit between the segment-level coverage ratios and assigned integer copy states. In some such embodiments, a second estimate of circulating tumor fraction is used to select the local optima (e.g., the local minima in best agreement with the second estimate of circulating tumor fraction) to be used as the circulating tumor fraction for the liquid biopsy sample.
[0495] For example, in some embodiments, multiple local optima (e.g., minima) can be disambiguated based on a difference between somatic and germline variant allele fractions. The assumption is that the variant allele fraction (VAF) of germline variants that exhibit loss of heterozygosity (LOH) will increase or decrease by the amount approximately equal to half of the tumor purity (e.g. , the circulating tumor fraction for a liquid biopsy sample). With a matched normal sample (e.g., where sequencing data for both a liquid biopsy sample and a non-cancerous sample from the subject is available, or where sequencing data for both a solid tumor sample and a non-cancerous sample from the subject is available), for a given heterozygous germline variant, the VAF delta can be calculated as delta = abs(VAFtumoi- VAF normal). However, for tumor only sequencing (e.g., where sequencing data is only available for a liquid biopsy sample or a solid tumor sample), the VAFnormai is unknown. In some embodiments, the VAFnormai is assumed to be 50%. To increase statistical power and account for the imprecision in the VAF by sequencing, the delta for all such variants are calculated and the circulating tumor fraction estimate (ctFE) for this method is calculated as ctFE = max(2 x delta) for all variant delta values. While this can be used as a method for ctFE alone, its precision is limited by the number of detected LOH variants. For a small panel, there are few expected LOH variants and thus the ctFE may not be precise on its own. However, it can be used to disambiguate multiple local optima (e.g., minima), especially for high tumor fraction values estimated by the off-target read methodology described herein.
For that, the off-target read methodology ctFE peaks corresponding to all the local optima (e.g., minima) are identified and the one closest to the ctFE estimated by LOH delta is chosen as the most likely global optima (e.g., minima).
[0496] Several other methods may also be used to estimate circulating tumor fractions.
In some embodiments, these methods are used in combination with the off-target tumor estimate method described herein. For example, in some embodiments, one or more of these methodologies is used to generate an estimate of tumor fraction, which is then used to identify the nearest local optima (e.g., minima) obtained from the tumor fraction estimation methods described above, and further herein.
[0497] For example, the ichorCNA package applies a probabilistic model to normalized read coverages from ultra-low pass whole genome sequencing data of cell-free DNA to estimate tumor fraction in the liquid biopsy sample. For more information, see, Adalsteinsson, V.A. et cil, Nat Commun 8:1324 (2017), the content of which is disclosed herein for its description of a probabilistic tumor fraction estimation model in the “methods” section. Similarly, Tiancheng H. et al, describe a Maximum Likelihood model based on the copy number of an allele in the sample and variant allele frequency in paired-control samples. For more information, see, Tiancheng H. et al, Journal of Clinical Oncology 37:15 suppl, el3053-el3053 (2019), the content of which is disclosed herein for its description of a Maximum Likelihood tumor fraction estimation model.
[0498] In some embodiments, a statistic for somatic variant allele fractions determined for the liquid biopsy sample is used as an estimate for the circulating tumor fraction of the liquid biopsy sample. For example, in some embodiments, a measure of central tendency (e.g., a mean or median) for a plurality of variant allele fractions determined for the liquid biopsy sample is used as an estimate of circulating tumor fraction. In some embodiments, a lowest (minimum) variant allele fraction determined for the liquid biopsy sample is used as an estimate of circulating tumor fraction. In some embodiments, a highest (maximum) variant allele fraction determined for the liquid biopsy sample is used as an estimate of circulating tumor fraction. In some embodiments, a range defined by two or more of these statistics is used to limit the range of simulated tumor fraction analysis via the off-target read methodology described herein. For instance, in some embodiments, lower and upper bounds of the simulated tumor fraction analysis are defined by the minimum variant allele fraction and the maximum variant allele fraction determined for a liquid biopsy sample, respectively. In some embodiments, the range is further expanded, e.g., on either or both the lower and upper bounds. For example, in some embodiments, the lower bound of a simulated tumor fraction analysis is defined as 0.5-times the minimum variant allele fraction, 0.75-times the minimum variant allele fraction, 0.9-times the minimum variant allele fraction, 1.1 -times the minimum variant allele fraction, 1.25-times the minimum variant allele fraction, 1.5-times the minimum variant allele fraction, or a similar multiple of the minimum variant allele fraction determined for the liquid biopsy sample. Similarly, in some embodiments, the upper bound of a simulated tumor fraction analysis is defined as 2.5-times the maximum variant allele fraction, 2-times the maximum variant allele fraction, 1.75-times the maximum variant allele fraction, 1.5-times the maximum variant allele fraction, 1.25-times the maximum variant allele fraction, 1.1 -times the maximum variant allele fraction, 0.9-times the maximum variant allele fraction, or a similar multiple of the maximum variant allele fraction determined for the liquid biopsy sample.
[0499] In some embodiments, circulating tumor fraction is estimated based on a distribution of the lengths of cfDNA in the liquid biopsy sample. In some embodiments, sequence reads are binned according to their position within the genome, e.g., as described elsewhere herein. For each bin, the length of each fragment is determined. Each fragment is then classified as belonging to one of a plurality of classes, e.g., one of two classes corresponding to a population of short fragments and a population of long fragments. In some embodiments, the classification is performed using a static length threshold, e.g., that is the same across all the bins. In some embodiments, the classification is performed using a dynamic length threshold. In some embodiments, a dynamic length threshold is determined by comparing the distribution of fragment lengths in liquid biopsy samples from reference subjects that do not have cancer to the distribution of fragment lengths in liquid biopsy samples from reference subjects that have cancer, in a positional fashion.
[0500] For example, in some embodiments, the comparison is done over windows spanning entire chromosomes, e.g., each chromosome defines a comparison window over which a dynamic length threshold is determined. In some embodiments, the comparison is done over a window spanning a single bin, e.g., each bin defines a comparison window over which a dynamic length threshold is determined. In certain embodiments, the bin determination may be made according to various genomic features. For example, the comparison window may be based on a chromosome by chromosome basis, or a chromosomal arm by chromosomal arm basis. In some embodiments, the comparison window is based on a gene level basis. In some embodiments, the comparison window is a fixed size, such as 1 KB, 5 KB, 10 KB, 25 kB, 50kB, lOOkB, 25 KB, 500 KB, 1 MB, 2 MB, 3 MB, or more. In some embodiments, the reference subjects having cancer used to determine the dynamic fragment length is matched to the cancer type of the subject whose liquid biopsy sample is being evaluated.
[0501] Once each fragment is classified as belonging to either the population of short fragments or the population of long fragments, a model trained to estimate circulating tumor fraction based on fragment length distribution data across the genome is applied to the binned data to generate an estimate of the circulating tumor fraction for the liquid biopsy sample. In some embodiments, a comparison of (i) the population of short fractions and (ii) the population of long fragments is made for each bin, e.g., a fraction of the number of short fragments to the number of long fragments in each bin is determined and used as an input for the model. In some embodiments, the model is a probabilistic model (e.g., an application of Bayes theorem), a deep learning model (e.g., a neural network, such as a convolutional neural network), or an admixture model.
[0502] In some embodiments, two or more of the circulating tumor estimation models described herein are used to generate respective tumor fraction estimates, which are combined to form a final tumor fraction estimate. For example, in some embodiments, a measure of central tendency (e.g., a mean) for several tumor fraction estimates is determined and used as the final tumor fraction estimate. In some embodiments, a tumor fraction estimate derived from a plurality of estimation models, e.g., a measure of central tendency for several tumor fraction estimates is used to identify the nearest local optima (e.g., minima) obtained from the tumor fraction estimation methods described above, and further herein.
Quality Control
[0503] In some embodiments, a positive sensitivity control sample is processed and sequenced along with one or more clinical samples. In some embodiments, the control sample is included in at least one flow cell of a multi-flow cell reaction and is processed and sequenced each time a set of samples is sequenced or periodically throughout the course of a plurality of sets of samples. In some embodiments, the control includes a pool of controls. In some embodiments, a quality control analysis requires that read metrics of variants present in the control sample fall within acceptable criteria. In some embodiments, a quality control requires approval by a pathologist before the results are reported. [0504] In some embodiments, the quality control system includes methods that pass samples for reporting if various criteria are met. Similarly, in some embodiments, the system includes methods that allow for more manual review if a sample does not meet the criteria established for automatic pass. In some embodiments, the criteria for pass of panel sequencing results include one or more of the following:
• A criterion for the on-target rate of the sequencing reaction, defined as a comparison (e.g., a ratio) of (i) the number of sequenced nucleotides or reads falling within the targeted panel region of a genome and (ii) the number of sequenced nucleotides or reads falling outside of the targeted panel region of the genome. Generally, an on- target rate threshold will be selected based on the sequencing technology used, the size of the targeted panel, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a minimum on-target rate threshold of at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, or greater. In some embodiments, the on-target rate criteria is implemented as a range of acceptable on-target rates, e.g., requiring that the on-target rate for a reaction is from 30% to 70%, from 30% to 80%, from 40% to 70%, from 40% to 80%, and the like.
• A criterion for the number of total reads generated by the sequencing reaction, including both unique sequence reads and non-unique sequence reads. Generally, a total read number threshold will be selected based on the sequencing technology used, the size of the targeted panel, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a minimum number of total reads threshold of at least 100 million, 110 million, 120 million, 130 million, 140 million, 150 million, 160 million, 170 million, 180 million, 190 million, 200 million, or more total sequence reads. In some embodiments, the criterion is implemented as a range of acceptable number of total reads, e.g., requiring that the sequencing reaction generate from 50 million to 300 million total sequence reads, from 100 million to 300 million sequence reads, from 100 million to 200 million sequence reads, and the like.
• A criterion for the number of unique reads generated by the sequencing reaction. Generally, a unique read number threshold will be selected based on the sequencing technology used, the size of the targeted panel, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a minimum number of total reads threshold of at least 3 million, 4 million, 5 million, 6 million, 7 million, 8 million, 9 million, or more unique sequence reads. In some embodiments, the criterion is implemented as a range of acceptable number of unique reads, e.g., requiring that the sequencing reaction generate from 2 million to 10 million total sequence reads, from 3 million to 9 million sequence reads, from 3 million to 9 million sequence reads, and the like.
• A criterion for unique read depth across the panel, defined as a measure of central tendency (e.g., a mean or median) for a distribution of the number of unique reads in the sequencing reaction encompassing the genomic regions targeted by each probe. For instance, in some embodiments, an average unique read depth is calculated for each targeted region defined in a target region BED file, using a first calculation of the number of reads mapped to the region multiplied by the read length, divided by the length of the region, if the length of the region is longer than the read length, or otherwise using a second calculation of the number of reads falling within the region multiplied by the read length. The median of unique read depth across the panel is then calculated as the median of those average unique read depths of all targeted regions. In some embodiments, the resolution as to how depth is calculated is increased or decreased, e.g., in cases where it is necessary or desirable to calculate depth for each base, or for a single gene. Generally, a unique read depth threshold will be selected based on the sequencing technology used, the size of the targeted panel, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a minimum unique read depth threshold of at least 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3250, 3500, or higher unique read depth. In some embodiments, the criterion is implemented as a range of acceptable unique read depth, e.g., requiring that the sequencing reaction generate a unique read depth of from 1000 to 4000, from 1500 to 4000, from 1500 to 4000, and the like. • A criterion for the unique read depth of a lowest percentile across the panel, defined as a measure of central tendency (e.g., a mean or median) for a distribution of the number of unique reads in the sequencing reaction encompassing the genomic regions targeted by each probe that fall within the lowest percentile of genomic regions by read depth (e.g., the first, second, third, fourth, fifth, tenth, fifteenth, twentieth, twenty -fifth, or similar percentile). Generally, a unique read depth at a lowest percentile threshold will be selected based on the sequencing technology used, the size of the targeted panel, the lowest percentile selected, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by synthesis technology is used, the criterion is implemented as a minimum unique read depth threshold at the fifth percentile of at least 500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500, or higher unique read depth. In some embodiments, the criterion is implemented as a range of acceptable unique read depth at the fifth percentile, e.g., requiring that the sequencing reaction generate a unique read depth at the fifth percentile of from 250 to 3000, from 500 to 3000, from 500 to 2500, and the like.
• A criterion for the deamination or OxoG Q-score of a sequencing reaction, defined as a Q-score for the occurrence of artifacts arising from template oxidation/deamination. Generally, a deamination or OxoG Q-score threshold will be selected based on the sequencing technology used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a minimum deamination or OxoG Q-score threshold of at least 10, 20, 30, 40, 50, 60, 70, 80, 90, or higher. In some embodiments, the criterion is implemented as a range of acceptable deamination or OxoG Q-scores, e.g., from 10 to 100, from 10 to 90, and the like.
• A criterion for the estimated contamination fraction is of a sequencing reaction, defined as an estimate of the fraction of template fragments in the sample being sequenced arising from contamination of the sample, commonly expressed as a decimal, e.g., where 1% contamination is expressed as 0.01. An example method for estimating contamination in a sequencing method is described in Jun G. el al, Am. J. Hum. Genet., 91:839-48 (2012). For example, in some embodiments, the criterion is implemented as a maximum contamination fraction threshold of no more than 0.001, 0.0015, 0.002, 0.0025, 0.003, 0.0035, 0.004. In some embodiments, the criterion is implemented as a range of acceptable contamination fractions, e.g., from 0.0005 to 0.005, from 0.0005 to 0.004, from 0.001 to 0.004, and the like.
• A criterion for the fingerprint correlation score of a sequencing reaction, defined as a Pearson correlation coefficient calculated between the variant allele fractions of a set of pre-defmed single nucleotide polymorphisms (SNPs) in two samples. An example method for determining a fingerprint correlation score is described in Sejoon L. et al, Nucleic Acids Research, Volume 45, Issue 11, 20 June 2017, Page el03, the content of which is incorporated herein by reference, in its entirety, for all purposes. For example, in some embodiments, the criterion is implemented as a minimum fingerprint correlation score threshold of at least 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or higher. In some embodiments, the criterion is implemented as a range of acceptable fingerprint correlation scores, e.g., from 0.1 to 0.9, from 0.2 to 0.9, from 0.3 to 0.9, and the like.
• A criterion for the raw coverage of a minimum percentage of the genomic regions targeted by a probe, defined as a minimum number of unique reads in the sequencing reaction encompassing each of a minimum percentage (e.g., at least 80%, 85%, 90%, 95%, 98%, 99%, 99.5%, 99.9%, and the like) of the genomic regions targeted by the probe panel. In some embodiments, the term "unique read depth" is used to distinguish deduplicated reads from raw reads that may contain multiple reads sequenced from the same original DNA molecule via PCR. Generally, a raw coverage of a minimum percentage of the genomic regions targeted by a probe threshold will be selected based on the sequencing technology used, the size of the targeted panel, the minimum percentage selected, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by synthesis technology is used, the criterion is implemented as a raw coverage of 95% of the genomic regions targeted by a probe threshold of at least 500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500, or higher unique read depth. In some embodiments, the criterion is implemented as a range of acceptable unique read depth for 95% of the genomic regions targeted by a probe, e.g., requiring that the sequencing reaction generate a unique read depth for 95% of the targeted regions of from 250 to 3000, from 500 to 3000, from 500 to 2500, and the like. • A criterion for the PCR duplication rate of a sequencing reaction, defined as the percentage of sequence reads that arise from the same template molecule as at least one other sequence read generated by the reaction. Generally, a PCR duplication rate threshold will be selected based on the sequencing technology used, the size of the targeted panel, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a minimum PCR duplication rate threshold of at least 91%, 92% ,93%, 94%, 95%, 96%, 97%, 98%, 99%, or higher. In some embodiments, the criterion is implemented as a range of acceptable PCR duplication rates, e.g., of from 90% to 100%, from 90% to 99%, from 91% to 99%, and the like.
[0505] Similarly, in some embodiments, the quality control system includes methods that fail samples for reporting if various criteria are met. In some embodiments, the system includes methods that allow for more manual review if a sample does meet the criteria established for automatic fail. In some embodiments, the criteria for failing panel sequencing results include one or more of the following:
• A criterion for the on-target rate of the sequencing reaction, defined as a comparison (e.g., a ratio) of (i) the number of sequenced nucleotides or reads falling within the targeted panel region of a genome and (ii) the number of sequenced nucleotides or reads falling outside of the targeted panel region of the genome. Generally, an on- target rate threshold will be selected based on the sequencing technology used, the size of the targeted panel, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a maximum on-target rate threshold of no more than 30%, 40%, 50%, 60%, 70%, or greater. That is, the criterion for failing the sample is satisfied when the on-target rate for the sequencing reaction is below the maximum on-target rate threshold. In some embodiments, the on-target rate criteria is implemented as not falling within a range of acceptable on-target rates, e.g., falling outside of an on-target rate for a reaction of from 30% to 70%, from 30% to 80%, from 40% to 70%, from 40% to 80%, and the like.
A criterion for the number of total reads generated by the sequencing reaction, including both unique sequence reads and non-unique sequence reads. Generally, a total read number threshold will be selected based on the sequencing technology used, the size of the targeted panel, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a maximum number of total reads threshold of no more than 100 million, 110 million, 120 million, 130 million, 140 million, 150 million, 160 million, 170 million, 180 million, 190 million, 200 million, or more total sequence reads. That is, the criterion for failing the sample is satisfied when the number of total reads for the sequencing reaction is below the maximum total read threshold. In some embodiments, the criterion is implemented as not falling within a range of acceptable number of total reads, e.g., falling outside of a range of from 50 million to 300 million total sequence reads, from 100 million to 300 million sequence reads, from 100 million to 200 million sequence reads, and the like.
• A criterion for the number of unique reads generated by the sequencing reaction. Generally, a unique read number threshold will be selected based on the sequencing technology used, the size of the targeted panel, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a maximum number of total reads threshold of no more than 3 million, 4 million, 5 million, 6 million, 7 million, 8 million, 9 million, or more unique sequence reads. That is, the criterion for failing the sample is satisfied when the number of unique reads for the sequencing reaction is below the maximum total read threshold. In some embodiments, the criterion is implemented as not falling within a range of acceptable number of unique reads, e.g., falling outside of a range of from 2 million to 10 million total sequence reads, from 3 million to 9 million sequence reads, from 3 million to 9 million sequence reads, and the like.
• A criterion for unique read depth across the panel, defined as a measure of central tendency (e.g., a mean or median) for a distribution of the number of unique reads in the sequencing reaction encompassing the genomic regions targeted by each probe. Generally, a unique read depth threshold will be selected based on the sequencing technology used, the size of the targeted panel, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a maximum unique read depth threshold of no more than 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3250, 3500, or higher unique read depth. That is, the criterion for failing the sample is satisfied when the unique read depth across the panel for the sequencing reaction is below the maximum total read threshold. In some embodiments, the criterion is implemented as falling outside of a range of acceptable unique read depth, e.g., falling outside of a unique read depth range of from 1000 to 4000, from 1500 to 4000, from 1500 to 4000, and the like.
• A criterion for the unique read depth of a lowest percentile across the panel, defined as a measure of central tendency (e.g., a mean or median) for a distribution of the number of unique reads in the sequencing reaction encompassing the genomic regions targeted by each probe that fall within the lowest percentile of genomic regions by read depth (e.g., the first, second, third, fourth, fifth, tenth, fifteenth, twentieth, twenty -fifth, or similar percentile). Generally, a unique read depth at a lowest percentile threshold will be selected based on the sequencing technology used, the size of the targeted panel, the lowest percentile selected, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by synthesis technology is used, the criterion is implemented as a maximum unique read depth threshold at the fifth percentile of no more than 500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500, or higher unique read depth. That is, the criterion for failing the sample is satisfied when the unique read depth at a lowest percentile threshold for the sequencing reaction is below the maximum unique read depth at a lowest percentile threshold. In some embodiments, the criterion is implemented as falling outside of a range of acceptable unique read depth at the fifth percentile, e.g., falling outside of a unique read depth at the fifth percentile range of from 250 to 3000, from 500 to 3000, from 500 to 2500, and the like.
• A criterion for the deamination or OxoG Q-score of a sequencing reaction, defined as a Q-score for the occurrence of artifacts arising from template oxidation/deamination. Generally, a deamination or OxoG Q-score threshold will be selected based on the sequencing technology used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a maximum deamination or OxoG Q-score threshold of no more than 10, 20, 30,
40, 50, 60, 70, 80, 90, or higher. That is, the criterion for failing the sample is satisfied when the deamination or OxoG Q-score for the sequencing reaction is below the maximum deamination or OxoG Q-score threshold. In some embodiments, the criterion is implemented as falling outside of a range of acceptable deamination or OxoG Q-scores, e.g., falling outside of a deamination or OxoG Q-score range of from 10 to 100, from 10 to 90, and the like.
• A criterion for the estimated contamination fraction is of a sequencing reaction, defined as an estimate of the fraction of template fragments in the sample being sequenced arising from contamination of the sample, commonly expressed as a decimal, e.g., where 1% contamination is expressed as 0.01. An example method for estimating contamination in a sequencing method is described in Jun G. el al, Am. J. Hum. Genet., 91:839-48 (2012). For example, in some embodiments, the criterion is implemented as a minimum contamination fraction threshold of at least 0.001, 0.0015, 0.002, 0.0025, 0.003, 0.0035, 0.004. That is, the criterion for failing the sample is satisfied when the contamination fraction for the sequencing reaction is above the minimum contamination fraction threshold. In some embodiments, the criterion is implemented as falling outside of a range of acceptable contamination fractions, e.g., falling outside of a contamination fraction range of from 0.0005 to 0.005, from 0.0005 to 0.004, from 0.001 to 0.004, and the like.
• A criterion for the fingerprint correlation score of a sequencing reaction, defined as a Pearson correlation coefficient calculated between the variant allele fractions of a set of pre-defmed single nucleotide polymorphisms (SNPs) in two samples. An example method for determining a fingerprint correlation score is described in Sejoon L. et al, Nucleic Acids Research, Volume 45, Issue 11, 20 June 2017, Page el03. For example, in some embodiments, the criterion is implemented as a maximum fingerprint correlation score threshold of no more than 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or higher. That is, the criterion for failing the sample is satisfied when the fingerprint correlation score for the sequencing reaction is below the maximum fingerprint correlation score threshold. In some embodiments, the criterion is implemented as falling outside of a range of acceptable fingerprint correlation scores, e.g., falling outside of a fingerprint correlation range of from 0.1 to 0.9, from 0.2 to 0.9, from 0.3 to 0.9, and the like. • A criterion for the raw coverage of a minimum percentage of the genomic regions targeted by a probe, defined as a minimum number of unique reads in the sequencing reaction encompassing each of a minimum percentage (e.g., at least 80%, 85%, 90%, 95%, 98%, 99%, 99.5%, 99.9%, and the like) of the genomic regions targeted by the probe panel. Generally, a raw coverage of a minimum percentage of the genomic regions targeted by a probe threshold will be selected based on the sequencing technology used, the size of the targeted panel, the minimum percentage selected, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a raw coverage of 95% of the genomic regions targeted by a probe threshold of no more than 500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500, or higher unique read depth. That is, the criterion for failing the sample is satisfied when the raw coverage of a minimum percentage of the genomic regions targeted by a probe for the sequencing reaction is below the maximum raw coverage of a minimum percentage of the genomic regions targeted by a probe threshold. In some embodiments, the criterion is implemented as falling outside of a range of acceptable unique read depth for 95% of the genomic regions targeted by a probe, e.g., requiring that the sequencing reaction generate a unique read depth for 95% of the targeted regions falling outside of a range of from 250 to 3000, from 500 to 3000, from 500 to 2500, and the like.
• A criterion for the PCR duplication rate of a sequencing reaction, defined as the percentage of sequence reads that arise from the same template molecule as at least one other sequence read generated by the reaction. Generally, a PCR duplication rate threshold will be selected based on the sequencing technology used, the size of the targeted panel, and the expected number of sequence reads generated by the combination of the technology and targeted panel used. For example, in some embodiments where next generation sequencing-by-synthesis technology is used, the criterion is implemented as a maximum PCR duplication rate threshold of at least 91%, 92% ,93%, 94%, 95%, 96%, 97%, 98%, 99%, or higher. That is, the criterion for failing the sample is satisfied when the PCR duplication rate for the sequencing reaction is below the maximum PCR duplication rate threshold. In some embodiments, the criterion is implemented as falling outside of a range of acceptable PCR duplication rates, e.g., of from 90% to 100%, from 90% to 99%, from 91% to 99%, and the like.
[0506] Thresholds for the auto-pass and auto-fail criteria may be established with reference to one another but are not necessarily set at the same level. For instance, in some embodiments, samples with a metric that falls between auto-pass and auto-fail criteria may be routed for manual review by a qualified bioinformatics scientist. Samples that are failed either automatically or by manual review may be routed to medical and laboratory teams for final review and can be released for downstream processing at the discretion of the laboratory medical director or designee.
Systems and Methods for Improved Validation of Copy Number Variation
[0507] An overview of methods for providing clinical support for personalized cancer therapy is described above with reference to Figures 2-4 above. Below, systems and methods for improving validation of copy number variation in a test subject, e.g., within the context of the methods and systems described above, are described with reference to Figures 5A1-E1 and 6A1-C1.
[0508] Many of the embodiments described below, in conjunction with Figures 5A1-E1 and 6A1-C1, relate to analyses performed using sequencing data for cfDNA obtained from a liquid biopsy sample of a cancer patient. Generally, these embodiments are independent and, thus, not reliant upon any particular DNA sequencing methods. However, in some embodiments, the methods described below include generating the sequencing data.
[0509] In one aspect, the disclosure provides a method for validating a copy number variation (e.g., identifying a true focal copy number variation) in a test subject, by applying one or more filters to segmented copy ratio data from a sequencing assay performed on a liquid biopsy sample from the subject. The method includes obtaining, from a first sequencing reaction, a corresponding sequence of each cell-free DNA fragment in a first plurality of cell-free DNA fragments in a liquid biopsy sample of the test subject, thereby obtaining a first plurality of sequence reads, e.g., a plurality of de-dupbcated sequence reads, where each sequence read correspond to a unique cell-free DNA fragment from the sample.
In some embodiments, the first plurality of sequence reads includes at least 1000 sequence reads. In some embodiments, the first plurality of sequence reads includes at least 10,000 sequence reads. In some embodiments, the first plurality of sequence reads includes at least 100,000 sequence reads. In some embodiments, the first plurality of sequence reads includes at least 200,000, 300,000, 400,000, 500,000, 750,000, 1,000,000, 2,500,000, 5,000,000 sequence reads, or more.
[0510] The method then includes aligning each respective sequence read in the first plurality of sequence reads to a reference sequence for the species of the subject. As described above, in some embodiments, the reference sequence is a reference genome, e.g., a reference human genome. In some embodiments, a reference genome has several blacklisted regions, such that the reference genome covers only about 75%, 80%, 85%, 90%, 95%, 98%, 99%, 99.5%, or 99.9% of the entire genome for the species of the subject. In some embodiments, the reference sequence for the subject covers at least 10% of the entire genome for the species of the subject, or at least 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, or more of the entire genome for the species of the subject. In some embodiments, the reference sequence for the subject represents a partial or whole exome for the species of the subject. For instance, in some embodiments, the reference sequence for the subject covers at least 10% of the exome for the species of the subject, or at least 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 98%, 99%, 99.9%, or 100% of the exome for the species of the subject. In some embodiments, the reference sequence covers a plurality of loci that constitute a panel of genomic loci, e.g., a panel of genes used in a panel-enriched sequencing reaction. An example of genes useful for precision oncology, e.g., which may be targeted with such a panel, are shown in Table 1. Accordingly, in some embodiments, the reference sequence for the subject covers at least 100 kb of the genome for the species of the subject. In other embodiments, the reference sequence for the subject covers at least 250 kb, 500 kb, 750 kb, 1 Mb, 2 Mb, 5 Mb, 10 Mb, 25 Mb, 50 Mb, 100 Mb, 250 Mb, or more of the genome for the species of the subject.
However, in some embodiments, there is no size limitation of the reference sequence. For example, in some embodiments, the reference sequence can be a sequence for a single locus, e.g., a single exon, gene, etc.) within the genome for the species of the subject.
[0511] The method then includes determining several metrics for the sequencing data. In some embodiments, the metrics include a plurality of bin-level sequence ratios, each respective bin-level sequence ratio in the plurality of bin-level sequence ratios corresponding to a respective bin in a plurality of bins. In some embodiments, the plurality of bins includes at least 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10,000, 25,000, 50,000, or more bins distributed across the reference sequence (e.g., the genome) for the species of the subject. In some embodiments, the bins are distributed relatively uniformly across the reference sequence, e.g., such that the each encompasses a similar number of bases, e.g, about 0.5 kb,
1 kb, 2 kb, 5 kb, 10 kb, 25 kb, 50 kb, 100 kb or more bases. Each respective bin in the plurality of bins represents a corresponding region of a reference sequence (e.g., genome) for the species of the subject. In some embodiments, the bins are distributed relatively uniformly across the reference sequence, e.g., such that the each encompasses a similar number of bases, e.g., about 0.5 kb, 1 kb, 2 kb, 5 kb, 10 kb, 25 kb, 50 kb, 100 kb or more bases. Each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is determined from a comparison of the first plurality of sequence reads to sequence reads from one or more reference samples. In some embodiments, the one or more reference sample is a process- matched reference sample. That is, in some embodiments the one or more reference samples are prepared for sequencing using the same methodology as used to prepare the sample from the test subject. Similarly, in some embodiments, the one or more reference samples are sequenced using the same sequencing methodology as used to sequence the sample from the test subject. In this fashion, internal biases for particular regions or sequences are controlled for in the reference samples.
[0512] In some embodiments, the metrics include a plurality of segment-level sequence ratios, each respective segment-level sequence ratio in the plurality of segment-level sequence ratios corresponding to a segment in a plurality of segments. Each respective segment in the plurality of segments represents a corresponding region of the reference genome for the species of the subject encompassing a subset of adjacent bins in the plurality of bins. Each respective segment-level sequence ratio in the plurality of segment-level sequence ratios is determined from a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment. That is, in some embodiments, bins adjacent to each other in the reference sequence (e.g., reference genome) are grouped together to form segments of the reference sequence (e.g., genome) having similar sequence ratios and, therefore, presumably the same copy number in the cancerous tissue of the subject.
[0513] In some embodiments, the metrics include a plurality of segment-level measures of dispersion. Each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion corresponding to a respective segment in the plurality of segments. Each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion is determined using the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment. That is, a measure of the dispersion of the individual bin-level sequence ratio that make up a segment is determined.
[0514] The method then includes validating a copy number status annotation (e.g., determining whether a copy number variation is a focal amplification or deletion) of a respective segment in the plurality of segments that is annotated with a copy number variation by applying the first dataset to an algorithm having one or more criteria filters. The copy number status annotation of the respective segment (e.g., whether or not a segment represents a focal amplification or focal deletion) is then verified or rejected based on a predetermined pattern of firing or lack of firing of each of the filters in the one or more filters.
[0515] In some embodiments, the one or more filters includes a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds.
[0516] In some embodiments, the one or more filters includes a confidence filter that is fired when the segment-level measure of dispersion corresponding to the respective segment fails to satisfy a confidence threshold.
[0517] In some embodiments, the one or more filters includes a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds. The one or more measure of central tendency-plus-deviation bin-level copy ratio thresholds are derived from (i) a measure of central tendency of the bin-level sequence ratios corresponding to the plurality of bins that map to the same chromosome of the reference genome for the species of the subject as the respective segment, and (ii) a measure of dispersion across the bin-level sequence ratios corresponding to the plurality of bins that map to the respective chromosome.
[0518] This general method is described with further, optional details below, with reference to Method 500-1. Referring to method 500-1, the present disclosure provides a method for validating a copy number variation in a test subject.
[0519] Subjects and Biological Samples. [0520] Referring to Block 502-1, the method includes obtaining a first dataset that comprises a plurality of bin-level sequence ratios, each respective bin-level sequence ratio in the plurality of bin-level sequence ratios corresponding to a respective bin in a plurality of bins. Each respective bin in the plurality of bins represents a corresponding region of a human reference genome, and each respective bin-level sequence ratio in the plurality of bin- level sequence ratios is determined from a sequencing of a plurality of cell-free nucleic acids in a first liquid biopsy sample of the test subject and one or more reference samples. In some embodiments, the plurality of bin-level sequence ratios comprises 2 or more bin-level sequence ratios, 3 or more bin-level sequence ratios, 4 or more bin-level sequence ratios, 5 or more bin-level sequence ratios, 6 or more bin-level sequence ratios, 7 or more bin-level sequence ratios, 8 or more bin-level sequence ratios, 100 or more bin-level sequence ratios, 1000 or more bin-level sequence ratios, 1500 or more bin-level sequence ratios, 2000 or more bin-level sequence ratios, 2500 or more bin-level sequence ratios, 3000 or more bin-level sequence ratios, 3500 or more bin-level sequence ratios, 4000 or more bin-level sequence ratios, 4500 or more bin-level sequence ratios, 5000 or more bin-level sequence ratios, 5500 or more bin-level sequence ratios, 6000 or more bin-level sequence ratios, 6500 or more bin- level sequence ratios, 7000 or more bin-level sequence ratios, 7500 or more bin-level sequence ratios, 8000 or more bin-level sequence ratios, 8500 or more bin-level sequence ratios, 9000 or more bin-level sequence ratios, 9500 or more bin-level sequence ratios,
10,000 or more bin-level sequence ratios, 20,000 or more bin-level sequence ratios, 50,000 or more bin-level sequence ratios, or 100,000 or more bin-level sequence ratios. In some embodiments, the plurality of bin-level sequence ratios consists of between 100 and 100,000 bin-level sequence ratios.
[0521] In some embodiments, the test subject is a patient in a clinical trial. Referring to Block 504-1, in some embodiments, the test subject is a patient with a cancer. In some such embodiments, the cancer is a solid tumor cancer. In some embodiments, the cancer is Ovarian Cancer, Cervical Cancer, Uveal Melanoma, Colorectal Cancer, Chromophobe Renal Cell Carcinoma, Liver Cancer, Endocrine Tumor, Oropharyngeal Cancer, Retinoblastoma, Biliary Cancer, Adrenal cancer, Neural, Neuroblastoma, Basal Cell Carcinoma, Brain Cancer, Breast Cancer, Melanoma, Non-Clear Cell Renal Cell Carcinoma, Glioblastoma, Glioma, Tumor of Unknown Origin, Kidney Cancer, Gastrointestinal Stromal Tumor, Medulloblastoma, Bladder Cancer, Gastric Cancer, Bone Cancer, Non-Small Cell Lung Cancer, Thymoma, Low Grade Glioma, Prostate Cancer, Clear Cell Renal Cell Carcinoma, Skin Cancer, Thyroid Cancer, Sarcoma, Testicular cancer, Head and Neck Cancer, Head and Neck Squamous Cell Carcinoma, Meningioma, Peritoneal cancer, Endometrial Cancer, Pancreatic Cancer, Mesothelioma, Esophageal Cancer, Small Cell Lung Cancer, Her2 Negative Breast Cancer, Solid Tumor, Ovarian Serous Carcinoma, HR+ Breast Cancer, Uterine Serous Carcinoma, Endometrial Cancer, Uterine Corpus Endometrial Carcinoma, Gastroesophageal Junction Adenocarcinoma, Gallbladder Cancer, Chordoma, or Papillary Renal Cell Carcinoma.
[0522] Referring to Block 506-1, in some embodiments, the liquid biopsy sample is a liquid biopsy sample. Referring to Block 508-1, in some embodiments, the liquid biopsy sample is blood. For example, in some embodiments, the liquid biopsy sample comprises blood, whole blood, peripheral blood, plasma, serum, or lymph of the test subject. In some alternative embodiments, the liquid biopsy sample is any of the embodiments described above (see, Definitions: Liquid Biopsy and/or Example Methods: Figure 2A: Example Workflow for Precision Oncology).
[0523] In some embodiments, the method further comprises obtaining the liquid biopsy sample from a sample repository or database (e.g., BioIVT, TSC Biosample Repository, BioLINCC, etc.). In some embodiments, the liquid biopsy sample is obtained from the test subject at least 1 hour, at least 2 hours, at least 12 hours, at least 1 day, at least 2 days, at least 1 week, at least 1 month, or at least 1 year prior to processing and/or sequencing the liquid biopsy sample. In some such embodiments, the liquid biopsy sample is fresh, frozen, dried, and/or fixed. In some embodiments, the liquid biopsy sample is processed and/or sequenced at least 1 day, at least 2 days, at least 1 week, at least 1 month, or at least 1 year prior to obtaining the first dataset. For example, in some embodiments, the sequencing data for the liquid biopsy sample are obtained from a data repository (e.g., GenBank, NCBI Assembly, DNA DataBank of Japan, European Nucleotide Archive, European Variation Archive, etc.).
[0524] Concurrent Testing
[0525] Unless stated otherwise, as used herein, the term “concurrent” as it relates to assays refers to a period of time between zero and ninety days. In some embodiments, concurrent tests using different biological samples from the same subject (e.g., two or more of a liquid biopsy sample, cancerous tissue — such as a solid tumor sample or blood sample for a blood-based cancer — and a non-cancerous sample) are performed within a period of time (e.g., the biological samples are collected within the period of time) of from 0 days to 90 days. In some embodiments, concurrent tests using different biological samples from the same subject (e.g., two or more of a liquid biopsy sample, cancerous tissue — such as a solid tumor sample or blood sample for a blood-based cancer — and a non-cancerous sample) are performed within a period of time (e.g., the biological samples are collected within the period of time) of from 0 days to 60 days. In some embodiments, concurrent tests using different biological samples from the same subject (e.g., two or more of a liquid biopsy sample, cancerous tissue — such as a solid tumor sample or blood sample for a blood-based cancer — and a non-cancerous sample) are performed within a period of time (e.g., the biological samples are collected within the period of time) of from 0 days to 30 days. In some embodiments, concurrent tests using different biological samples from the same subject (e.g., two or more of a liquid biopsy sample, cancerous tissue — such as a solid tumor sample or blood sample for a blood-based cancer — and a non-cancerous sample) are performed within a period of time (e.g., the biological samples are collected within the period of time) of from 0 days to 21 days. In some embodiments, concurrent tests using different biological samples from the same subject (e.g., two or more of a liquid biopsy sample, cancerous tissue — such as a solid tumor sample or blood sample for a blood-based cancer — and a non-cancerous sample) are performed within a period of time (e.g., the biological samples are collected within the period of time) of from 0 days to 14 days. In some embodiments, concurrent tests using different biological samples from the same subject (e.g., two or more of a liquid biopsy sample, cancerous tissue — such as a solid tumor sample or blood sample for a blood-based cancer — and a non-cancerous sample) are performed within a period of time (e.g., the biological samples are collected within the period of time) of from 0 days to 7 days. In some embodiments, concurrent tests using different biological samples from the same subject (e.g., two or more of a liquid biopsy sample, cancerous tissue — such as a solid tumor sample or blood sample for a blood-based cancer — and a non-cancerous sample) are performed within a period of time (e.g., the biological samples are collected within the period of time) of from 0 days to 3 days.
[0526]
[0527] In some embodiments, a liquid biopsy assay may be used concurrently with a solid tumor assay to return more comprehensive information about a patient’s variants. For example, a blood specimen and a solid tumor specimen may be sent to a laboratory for evaluation. The solid tumor specimen may be analyzed using a bioinformatics pipeline to produce a solid tumor result. A solid tumor assay is described, for instance, in U.S. Patent Application No. 16/657,804. The cancer type of the solid tumor may include, for example, non small cell lung cancer, colorectal cancer, or breast cancer. Alterations identified in the tumor/matched normal result may include, for example, EGFR+ for non small cell lung cancer; HER2+ for breast cancer; or KRAS G12C for several cancers.
[0528] In some embodiments, the blood specimen may be divided into a first portion and a second portion. The first portion of the blood specimen and the solid tumor specimen may be analyzed using a bioinformatics pipeline to produce a tumor/matched normal result. The second portion of the blood specimen may be analyzed using a bioinformatics pipeline to produce a liquid biopsy result. For example, the blood specimen may be analyzed using at least an improvement in somatic variant identification, e.g., as described herein in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification.” For example, the blood specimen may be analyzed using an improvement in focal copy number identification, e.g., as described herein in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation.” For example, the blood specimen may be analyzed using an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.”
[0529] Therapies may be identified for further consideration in response to receiving the tumor or tumor/matched normal result along with the liquid biopsy result. For example, if the results overall indicate that the patient has HER2+ breast cancer, neratinib may be identified along with the test results for further consideration by the ordering clinician.
[0530] The solid tumor or tumor/matched normal assay may be ordered concurrently; their results may be delivered concurrently; and they may be analyzed concurrently.
[0531] In some embodiments, the liquid biopsy sample corresponds to a matched tumor sample (e.g., a solid tumor sample obtained from the test subject). For example, in some embodiments, the method further comprises obtaining a second dataset that is determined from a sequencing of a plurality of cell-free nucleic acids in a matched tumor sample of the test subject. In some embodiments, the matched tumor sample is obtained from the test subject concurrently with the liquid biopsy sample. In some embodiments, the matched tumor sample is obtained from the test subject at a different time point from the obtaining the liquid biopsy sample. In some embodiments, the matched tumor sample is any of the embodiments described above (see, Example Methods: Figure 2A: Example Workflow for Precision Oncology). In some embodiments, the method further comprises obtaining the matched tumor sample from a sample repository or database (e.g., BioIVT, TSC Biosample Repository, BioLINCC, etc.). In some embodiments, the matched tumor sample is obtained from the test subject at least 1 hour, at least 2 hours, at least 12 hours, at least 1 day, at least 2 days, at least 1 week, at least 1 month, or at least 1 year prior to obtaining the liquid biopsy sample. In some such embodiments, the matched tumor sample is fresh, frozen, dried, and/or fixed. In some embodiments, the matched tumor sample is processed and/or sequenced at least 1 day, at least 2 days, at least 1 week, at least 1 month, or at least 1 year prior to obtaining the second dataset. For example, in some such embodiments, the sequencing data for the plurality of nucleic acids in the matched tumor sample are obtained from a data repository (e.g., GenBank, NCBI Assembly, DNA DataBank of Japan, European Nucleotide Archive, European Variation Archive, etc.).
[0532] In some embodiments, the one or more reference samples are non-cancerous samples. In some embodiments, the one or more reference samples is a matched normal sample (e.g., a normal sample obtained from the test subject). In some embodiments, the matched normal sample is obtained from the test subject concurrently with the liquid biopsy sample. In some embodiments, the matched normal sample is obtained from the test subject at a different time point from the obtaining the liquid biopsy sample. In some embodiments, the matched normal sample is any of the embodiments described above (see, Example Methods: Figure 2A: Example Workflow for Precision Oncology).
[0533] In some alternative embodiments, the one or more reference samples comprise a pool of normal (e.g., non-cancerous) samples obtained from a plurality of control subjects (e.g., healthy subjects). In some such embodiments, the method further comprises obtaining the one or more reference samples from a sample repository or database (e.g., BioIVT, TSC Biosample Repository, BioLINCC, etc.). In some embodiments, the one or more reference samples include liquid biopsy samples comprising a plurality of cell-free nucleic acids and/or solid tissue samples comprising a plurality of nucleic acids. In some embodiments, the one or more reference samples are processed and/or sequenced at least 1 day, at least 2 days, at least 1 week, at least 1 month, or at least 1 year prior to obtaining the first dataset. For example, in some such embodiments, the sequencing data for the one or more reference samples are obtained from a data repository (e.g., GenBank, NCBI Assembly, DNA DataBank of Japan, European Nucleotide Archive, European Variation Archive, etc.). [0534] Referring to Block 510-1, in some embodiments, the cell-free nucleic acids (e.g., in the first liquid biopsy sample of the test subject and the one or more reference samples) comprise circulating tumor DNA (ctDNA). In some embodiments, the method further comprises isolating the plurality of cell-free nucleic acids from the liquid biopsy sample of the test subject prior to the sequencing. In some embodiments, the sequencing is multiplexed sequencing. In some embodiments, the sequencing is short-read sequencing or long-read sequencing.
[0535] In some embodiments, the sequencing is a panel-enriched sequencing reaction. In some such embodiments, the sequencing reaction is performed at a read depth of 100X or more, 250X or more, 500X or more, 1000X or more, 2500X or more, 5000X or more, IO,OOOC or more, 20,000X or more, or 30,000X or more. In some embodiments, the sequencing panel comprises 1 or more, 10 or more, 20 or more, 50 or more, 100 or more, 150 or more, 200 or more, 300 or more, 500 or more, or 1000 or more genes. In some embodiments, the sequencing panel comprises one or more genes listed in Table 1. In some embodiments, the sequencing panel includes at least 2, 3, 4, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100, or all of the genes listed in Table 1. In some embodiments, the sequencing panel comprises one or more genes selected from the group consisting of MET, EGFR, ERBB2, CD274, CCNE1, MYC, BRCA1 and BRCA2. In some embodiments, the sequencing panel includes at least 2, 3, 4, 5, 6, 7, or all 8 of MET, EGFR, ERBB2, CD274, CCNE1, MYC, BRCA1 and BRCA2. In some embodiments, the sequencing reaction is a whole exome sequencing reaction.
[0536] In some embodiments, the sequencing reaction is a whole genome sequencing reaction. In some such embodiments, the sequencing reaction is performed at an average read depth of 10X or more, 15X or more, 20X or more, 25X or more, 30X or more, 40X or more, or 50X or more. In some embodiments, certain regions of the genome are blacklisted from the analysis of a whole genome sequencing reaction, e.g., centromeres, telomeres, highly repeated sequences, and the like, for which accurate sequencing results are difficult to obtain.
[0537] In some embodiments, the obtaining the first dataset further comprises aligning a plurality of sequence reads, obtained from a sequencing of the plurality of cell-free nucleic acids in the first liquid biopsy sample of the test subject, to the human reference genome. [0538] In some embodiments, on average, each respective bin in the plurality of bins has two or more, three or more, five or more, ten or more, fifteen or more, twenty or more, fifty or more, one hundred or more, five hundred or more, one thousand or more, ten thousand or more, or 100,000 or more sequence reads in the plurality of sequence reads mapping onto the portion of the reference genome corresponding to the respective bin, where each such sequence read uniquely represents a different molecule in the plurality of cell-free nucleic acids in the liquid biopsy sample. For instance, in some embodiments, the plurality of cell- free nucleic acids in the liquid biopsy sample are sequenced with a sequencing methodology that makes use of unique molecular identifier (UMIs) for each cell-free nucleic acid in the liquid biopsy sample and each sequence read in the plurality of sequence reads has a unique UMI. In such embodiments, sequence reads with the same UMI are bagged (collapsed) into a single sequence read bearing the UMI.
[0539] In some embodiments, the sequencing of the plurality of cell-free nucleic acids in the first liquid biopsy sample of the test subject is performed at a central laboratory or sequencing facility. In some such embodiments, the obtaining the first dataset comprises accessing one or more sequencing datasets and/or one or more auxiliary files, in electronic form, through a cloud-based interface. For example, a first dataset can be obtained by performing a bioinformatics pipeline using tumor BAM files, normal BAM files, a human reference genome file, a target region BED file, a list of mappable regions of the genome, and/or a blacklist of recurrent problematic areas of the genome.
[0540] In some embodiments, the obtaining the first dataset comprises accessing the first dataset, in electronic form, through a cloud-based interface. For example, a first dataset can comprise one or more outputs from a bioinformatics pipeline (e.g., CNVkit outputs “ cns” and/or “ cnr”).
[0541] Additional methods and embodiments for sequencing nucleic acids, including aligning and preprocessing sequence reads, are described in further detail above (see,
Example Methods: Figure 2A: Example Workflow for Precision Oncology). Additional methods and embodiments for performing the presently disclosed methods at a distributed diagnostic and clinical environment are described in detail above (see, Example Methods: Figure 2B: Distributed Diagnostic and Clinical Environment). Other embodiments and/or any combinations, substitutions, additions or deletions thereof are possible, as will be apparent to one skilled in the art. [0542] Bins and Sequence Ratios.
[0543] In some embodiments, the methods and systems described herein bin sequences (e.g., sequence reads) across one or more regions of a genome to evaluate the copy number at one or more locations of the genome in a tissue of a subject. In this fashion, a count of the number of sequences generated for a test sample that map to the region of the genome corresponding to the bin, or a measure of depth of coverage across the region of the genome corresponding to the bin, are determined. All or a portion of these bin values (e.g., copy number or count number) can then be then compared to reference values for the same corresponding bins, to evaluate how the genomic copy number of the genome corresponding to the test sample differs from that of a reference, which can be a single sample or an average of a plurality of samples. For instance, where the reference bin values represent one or more non-cancerous reference samples, a comparison of bin values for the test sample to these reference values can reveal copy number differences having biological significance for the diagnosis and/or treatment of cancer in the test subject.
[0544] Generally, each bin in a plurality of bins corresponds to a contiguous and non overlapping region, of any size, of a reference genome (e.g., a reference human genome or equivalent construct). For example, in some embodiments, each bin in a plurality of bins (e.g., spanning all or a portion of a reference genome) is at least 50 base pairs (bp), at least 100 bp, at least 150 bp, at least 200 bp, at least 300 bp, at least 400 bp, at least 500 bp, at least 750 bp, at least 1 kilobase pairs (kb), at least 2.5 kb, at least 5 kb, at least 10 kb, at least 25 kb, or more. In some embodiments, each bin in a plurality of bins (e.g., spanning all or a portion of a reference genome) is less than 250 kb, less than 100 kb, less than 50 kb, less than 25 kb, less than 10 kb, less than 5 kb, less than 2.5 kb, or less. In some embodiments, the average bin size of each bin in the plurality of bins is from 50 bp to 25 kb, from 50 bp to 5 kb, from 50 bp to 1 kb, from 50 bp to 500 bp, or within any other range starting no lower than 25 bp and ending no higher than 350 kb.
[0545] When targeted-panel sequencing is used, generally bins encompassing on-target reads (those sequence reads corresponding to fragments bound by an enrichment probe) will have smaller sizes than bins encompassing off-target reads (those sequence reads corresponding to fragments not bound by an enrichment probe). Accordingly, in some embodiments, the size of each bin depends on whether it is an on-target bin or an off-target bin. In some embodiments, bin size also varies bin to bin, e.g., such that the number of reads per bin is similar (e.g., within 25% or less of each other). For example, CNVkit automatically adjusts each bin's size so that the number of reads per bin is roughly consistent.
[0546] In some embodiments, on-target bins have an average size of about 100 bp. In some embodiments, on-target bins have an average size of from 25 to 500 bp. In some embodiments, on-target bins have an average size of from 25 to 250 bp. In some embodiments, on-target bins have an average size of from 50 to 250 bp. In some embodiments, on-target bins have an average size of from 50 to 150 bp. A smaller size could be used if a higher resolution (for segmentation and subsequent CNV calling) is desired, but the bins may be noisier since they would contain fewer reads. Thus, the optimal bin size may depend on sequencing depth and sensitivity requirements.
[0547] In some embodiments, off-target bins have an average size of at least 1 kb. In some embodiments, off-target bins have an average size of at least 5 kb. In some embodiments, off-target bins have an average size of from 5 kb to 350 kb. In some embodiments, off-target bins have an average size of from 10 kb to 250 kb. The size of off- target bins may depend on both the on-target and off-target sequencing depths of a sequencing reaction.
[0548] Generally, each bin has a defined start nucleotide and a defined ending nucleotide in the reference genome for the species of subject. For example, where the test species is a human, each bin comprises a start and end position that indicates its location in the human reference genome. In some embodiments, each bin corresponds to (i) a first subset of bins that map to the same position of the human reference genome as a locus in a targeted sequencing panel (e.g., target bins), or (ii) a second subset of bins that map to an off-target portion of a reference genome that is not represented in the targeted sequencing panel (e.g., off-target bins). In some embodiments, each bin in the first subset of bins represents a different gene, open reading frame, or genetic feature (e.g., promoter of a gene, enhancer of a gene, repressor of a gene) in a reference genome.
[0549] In some embodiments, each bin in a plurality of bins is approximately the same size (e.g., spans about the same number of base pairs in the reference genome as every other bin). For example, in some embodiments, the bin size specified by a user, such that the number of bins is dependent upon the size of the region over which the plurality of bins span. In some embodiments, the number of bins spanning a region is specified by a user, such that the size of each respective bin is dependent upon the size of the region over which the plurality of bins span.
[0550] In some embodiments, each bin in a plurality of bins is not the same size. For example, in some embodiments, each bin size is determined based on the number of sequences falling with the bins in one or more reference samples, e.g., to normalize for an expected number of sequence reads mapping to each bin. In some embodiments, where panel-enriched sequencing is used, bins in a first subset of bins spanning regions of the genome corresponding to the enrichment panel (e.g., bins corresponding to on-target reads of the sequencing reaction) are smaller than a second subset of bins spanning regions of the genome that do not correspond to the enrichment panel (e.g, bins corresponding to off-target reads of the sequencing reaction).
[0551] In some embodiments, the plurality of bins covers at least 1 Mb of a reference genome for the species of the subject (e.g, the human genome). In some embodiments, the plurality of bins covers at least 2.5 Mb, at least 5 Mb, at least 10 Mb, at least 25 Mb, at least 50 Mb, at least 100 Mb, at least 250 Mb, at least 500 Mb, at least 1000 Mb, at least 2000 Mb, at least 3000 Mb, or more of the reference genome. In some embodiments, the plurality of bins covers at least 25% of a reference genome for the species of the subject (e.g, the human genome). In some embodiments, the plurality of bins covers at least 50%, at least 75%, at least 90%, at least 95%, at least 98%, at least 99%, or more of the reference genome.
[0552] In some embodiments, a plurality of sequence reads are obtained from a sequencing of nucleic acids (e.g, in the liquid biopsy sample and/or in the one or more reference samples), and the obtained sequences, e.g, collapsed (de-dupbcated) sequence reads, are assigned to respective bins corresponding to the region of the genome that the sequence reads map to. In some embodiments, the sequencing data is pre-processed to correct biases or errors using one or more methods such as normalization, correction of GC biases, correction of biases due to PCR over-amplification, etc., prior to binning.
[0553] In some embodiments, the bin values processed to correct for biases or errors, e.g. , by normalization, standardization, etc. For instance, in some embodiments, a median bin value across a plurality of bin values for a sample is obtained, and each respective bin value in the plurality of bin values is divided by this median value, assuring that the bin values for the respective training subject are centered on a known value (e.g., on zero): where, bvL = the bin value of bin i in the plurality of bin values for the sample, bv* = the normalized bin value of bin i in the plurality of bin values for the sample upon this first normalization, and median{bvj) = the median bin value across the plurality of unnormalized bin values for the sample. In some embodiments, rather than using the median bin value across the corresponding plurality of bin values, some other measure of central tendency is used, such as an arithmetic mean, weighted mean, midrange, midhinge, trimean, Winsorized mean, mean, or mode across the plurality of bin values of the sample.
[0554] In some embodiments, each respective normalized bin count bv* is further normalized by the median normalized value for the respective bin across a plurality of samples k where, bv* = the normalized bin value of bin i in the first plurality of bin values for the sample from the first normalization procedure described above, bv** = the normalized bin value of bin i for the respective sample upon this second normalization described here, and median(bv * = the median normalized bin value bv* for bin i across the plurality of samples ( e.g ., Preference samples).
[0555] In some embodiments, the un-normalized bin values (counts) bvL are GC normalized. In some embodiments, the normalized bin values b v*are GC normalized. In some embodiments, the normalized bin values bv**axc GC normalized. In such embodiments, GC counts of respective sequence reads in the plurality of sequence reads of each sample in the plurality of reference samples are binned. A curve describing the conditional mean fragment count per GC value is estimated by such binning (Yoon et aί, 2009, Genome Research 19(9): 1586), or, alternatively, by assuming smoothness (Boeva et ctl, 2011, Bioinformatics 27(2), p. 268; Miller etcil, 2011, PLoS ONE 6(1), p. el6327). The resulting GC curve determines a predicted count for each bin based on the bin's GC. These predictions can be used directly to normalize the original signal {e.g., bv*, bv t , or bv**). As a non-limiting example, in the case of binning and direct normalization, for each respective G+C percentage in the set {0%, 1%, 2%, 3%,... , 100%}, the value mGC, the median value of b n**oΐ all bins across the plurality of training subjects having this respective G+C percentage, is determined and subtracted from the normalized bin values bv**of those bins having the respective G+C percentage to form GC normalized bin values bv***. In some embodiments, rather than using the median value of bv**of all bins across the first plurality of subjects having this respective G+C percentage, some other form of measure of central tendency of bv**of all bins across the plurality of training subjects having this respective G+C percentage is used, such as an arithmetic mean, weighted mean, midrange, midhinge, trimean, Winsorized mean, mean, or mode. In some embodiments, a correction curve is determined using a locally weighted scatterplot smoothing model (e.g., LOESS, LOWESS, etc.). See, for example, Benjamini and Speed, 2012, Nucleic Acids Research 40(10): e72; and Alkan et ctl, 2009, Nat Genet 41:1061-7. For example, in some embodiments, the GC bias curve is determined by LOESS regression of count by GC (e.g., using the ‘loess’ R package) on a random sampling (or exhaustive sampling) of bins from the plurality of training subjects. In some embodiments, the GC bias curve is determined by LOESS regression of count by GC (e.g., using the ‘loess’ R package), or some other form of curve fitting, on a random sampling of bins from a cohort of reference samples that were sequenced using the same sequencing techniques used to sequence the test sample.
[0556] In some embodiments, the bin counts are normalized using principal component analysis (PCA) to remove higher-order artifacts for a population-based (e.g., healthy subjects) correction. See, for example, Price et al, 2006, Nat Genet 38, pp. 904-909; Leek and Storey, 2007, PLoS Genet 3, pp. 1724-1735; and Zhao et al, 2015, Clinical Chemistry 61(4), pp. 608-616. Such normalization can be in addition to or instead of any of the above-identified normalization techniques. In some such embodiments, to train the PCA normalization, a data matrix comprising LOESS normalized bin counts bv*** from young healthy subjects in the plurality of training subjects (or another cohort that was sequenced in the same manner as the plurality of training subjects) is used and the data matrix is transformed into principal component space thereby obtaining the top N number of principal components across the training set. In some embodiments, the top 2, the top 3, the top 4, the top 5, the top 6, the top 7, the top 8, the top 9 or the top 10 such principal components are used to build a linear regression model:
LM(PC± . PCN) Then, each bin bv*™ of each respective bin of each respective sample in the plurality of reference samples is fit to this linear model to form a corresponding PCA-normalized bin count bvi bvt = bvi — fitLM pc1,...,pcN)·
In other words, for each respective sample in the plurality of reference samples, a linear regression model is fit between its normalized bin counts {bv *, ... , bv***} and the top principal components from the training set. The residuals of this model serve as final normalized bin values {bv™**, ... , bv™**} for the respective sample. Intuitively, the top principal components represent noise commonly seen in reference samples, and therefore removing such noise (in the form of the top principal components derived from the healthy cohort) from the bin values bv· ** can effectively improve normalization. See Zhao el aί, 2015, Clinical Chemistry 61(4), pp. 608-616 for further disclosure on PCA normalization of sequence reads using a health population. Regarding the above normalization, it will be appreciated that all variables are standardized ( e.g ., by subtracting their means and dividing by their standard deviations) when necessary.
[0557] It will be appreciated that any form of representation of the number of nucleic sequence reads mapping to a given bin i can constitute a “bin value” and that such a bin value can be in un-normalized form (e.g., bv;) or normalized form (e.g., bv*, bv™, bv™*, bv**™, etc).
[0558] After binning sequences, a bin count or read depth is determined for each bin. For example, in some embodiments, the read depth is the average number of times that the corresponding region of the human reference genome spanned by the respective bin is represented in the plurality of sequence reads obtained from the sequencing reaction.
[0559] In some embodiments, the read depths for each respective bin, in the plurality of bins, are determined by binning sequence reads obtained for the plurality of cell-free nucleic acids in a panel-enriched sequencing reaction. In some embodiments, the panel-enriched sequencing reaction is an ultra-high depth sequencing, where each locus in the plurality of loci in the targeted sequencing panel is sequenced at an average coverage of at least lOOOx, at least 2500x, or at least 5000x. In some embodiments, read depths are obtained from targeted captured sequencing reads (e.g., target bins) and non-specifically captured off-target reads (e.g., off-target bins). [0560] The bin values (e.g. , bin counts or read depth) generated from the binning operation are then compared to bin values for one or more reference sample. Referring to Block 512-1, in some embodiments, each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is derived from a comparison of (a) a read depth for the corresponding bin in the plurality of bins, determined from a sequencing of a plurality of cell- free nucleic acids in a liquid biopsy sample of the test subject, to (b) a measure of central tendency of read depths for the corresponding bin, across one or more reference samples (or simply the read depth of a single reference sample in the case where only one reference sample is used). Thus, in some such embodiments, a sequence ratio for a respective bin is a comparison of the read depths between the test sample and one or more reference samples, e.g., a pool of reference samples. In some embodiments, the one or more reference sample is a single sample, two or more samples, five or more samples, or 100 or more samples.
[0561] For example, in some embodiments, the (a) read depth and the (b) read depths are determined by binning sequence reads from one or more panel -enriched sequencing reactions, and the plurality of bin-level sequence ratios comprises (i) a first sub-plurality of bin-level sequence ratios corresponding to bins that map to the same position of the human reference genome as an enriched locus in the panel-enriched sequencing reaction; and (ii) a second sub-plurality of bin-level sequence ratios corresponding to bins that do not map to the same position of the human reference genome as any enriched locus in the panel-enriched sequencing reaction. For example, in some such embodiments, the bin-level sequence ratios for target bins and the bin-level sequence ratios for off-target bins are separately determined.
[0562] In some embodiments, the (a) read depth and the (b) read depths are log2- transformed (e.g., log2 read depths).
[0563] In some embodiments the ratio of the (a) read depth (X) and the (b) measure of central tendency of the read depths (Y) is taken as XJY, Y/X, logN(X/Y), logN(Y/X), X'/Y, Y/X', logN(X'/Y), or logN(Y/X'), X/Y', Y'/X, logN(X/Y'), logN(YTX) , X'/Y', Y'/X', logN(X'/Y'), or logN(Y'/X'), where N is any real number greater than 1 and where example mathematical transformations of X and Y include, but are not limited to, raising X or Y to a power Z, multiplying X or Y by a constant Q, where Z and Q are any real numbers, and/or taking an M based logarithm of X and/or Y, where M is a real number greater than 1.
[0564] In some embodiments, the (a) read depth and the (b) read depths are centered and corrected. In some such embodiments, the (a) read depth and the (b) read depths are median- centered. In some embodiments, the correcting comprises correcting for bias (e.g., GC content, genome sequence repetitiveness, target size and/or spacing). For example, in some embodiments, the method further comprises, for each sample, centering and correcting the plurality of read depths corresponding to the plurality of bins, across all target and off-target bins in the sample.
[0565] In some embodiments, the (b) measure of central tendency of read depths for the corresponding bin, across the one or more reference samples, is an arithmetic mean, a weighted mean, a midrange, a midhinge, a trimean, a Winsorized mean, a mean, a median, or a mode. In some embodiments, the (b) measure of central tendency of read depths for the corresponding bin, across the one or more reference samples, is Tukey’s biweight location.
[0566] In some embodiments, the method further comprises determining the spread of the (b) read depths for the corresponding bin, across the one or more reference samples. In some such embodiments, the spread is a measure of dispersion including, but not limited to, a range, a standard deviation, a standard error, and/or a confidence interval. In some embodiments, the spread is a midvariance. For additional background on these statistical methods, see, for example, Lax, J Am Stat Assoc, 80, 736-741 (1985), and Randal, Comput Stat Data An, 52, 5014-5021 (2008), each of which is hereby incorporated herein by reference in its entirety.
[0567] Referring to Block 514-1, in some embodiments, each bin-level sequence ratio in the plurality of bin-level sequence ratios is a copy ratio.
[0568] For example, in some embodiments, the centered and corrected log2 read depth of each bin in the test sample (e.g., the liquid biopsy sample) is subtracted by the log2 read depth of the corresponding bin in the one or more reference samples (e.g., the reference pool). This generates a log2 copy ratio between the test sample and the one or more reference samples. Then, in some embodiments, the copy ratio of a bin can be defined as: log2 copy ratio = log2(test ) - log2(ref) where log2(test) (e.g., the test log2 read depth) is the median-centered and corrected log2- transformed read depth, for the liquid biopsy sample, for the respective bin, and log2(ref) (e.g., the reference log2 read depth) is determined by calculating the weighted average of median-centered and corrected log2-transformed read depths, for each reference sample in the plurality of reference samples, for the respective bin. [0569] In some embodiments, the one or more reference samples includes one or more test samples comprising less than a threshold number of copy number variations. In some embodiments, the one or more reference samples includes one or more test samples comprising one or more copy number variations, where each copy number variation occurs less than a threshold number of times in the one or more of test samples. In some embodiments, the threshold number of copy number variations is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more than 10. In some embodiments, the threshold number of occurrences for each of the one or more copy number variations is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more than 10.
[0570] In some alternative embodiments, the one or more reference samples includes one or more process-matched normal samples. In some such embodiments, the one or more process-matched normal samples are not pooled ( e.g ., the read depths are not averaged across the one or more matched normal samples), and each test sample is normalized against its process-matched normal sample.
[0571] In some embodiments, no reference samples are obtained, and each test sample is normalized using one or more fixed values for normalization (e.g., a specified log2 depth correction value for each bin in the tumor sample). For example, in some embodiments, a fixed value for log2 depth correction is a neutral copy number (e.g., log2 1.0).
[0572] In some embodiments, the method further comprises removing (e.g., filtering), from the plurality of bins, each bin that fails to satisfy one or more filtering criteria.
[0573] In some embodiments, the one or more filtering criteria comprises a threshold reference log2 read depth. For example, in some embodiments, each bin that has a reference log2 read depth below a threshold value is removed from the plurality of bins. In some such embodiments, the threshold reference log2 read depth is less than 5, less than 1, less than 0, less than -1, less than -2, less than -3, less than -4, less than -5, less than -6, less than -7, less than -8, less than -9, or less than -10. In some embodiments, the threshold reference log2 read depth is between 0 and -10.
[0574] In some embodiments, the one or more filtering criteria comprises a threshold test log2 read depth. For example, in some embodiments, each bin that has a test log2 read depth below a threshold value is removed from the plurality of bins. In some such embodiments, the threshold test log2 read depth is less than 10, less than 5, less than 1, less than 0, less than -1, less than -2, less than -3, less than -4, less than -5, less than -6, less than -7, less than -8, less than -9, or less than -10. In some embodiments, the threshold test log2 read depth is between 5 and -5.
[0575] In some embodiments, the one or more filtering criteria comprises a proximity of a test log2 read depth to a blacklist value. For example, in some embodiments, each bin that has a test log2 read depth that is within a specified range around a blacklist value is removed from the plurality of bins. In some embodiments, the blacklist value is 0, and the specified range is +/- 1 or less (e.g., each bin that has a test log2 read depth between -1 and 1 is removed from the plurality of bins). In some embodiments, the blacklist value is 0, and the specified range is +/- 0.9 or less, +/- 0.8 or less, +/- 0.7 or less, +/- 0.6 or less, +/- 0.5 or less, +/- 0.4 or less, +/- 0.3 or less, +/- 0.2 or less, or +/- 0.1 or less. In some embodiments, the specified range is greater than +/- 1.
[0576] In some embodiments, the one or more filtering criteria comprises a distance of a test log2 read depth from a whitelist value. For example, in some such embodiments, each bin that has a test log2 read depth that is outside of a specified range around a whitelist value is removed from the plurality of bins. In some embodiments, the whitelist value is a measure of central tendency of the test log2 read depths for a subset of bins in the plurality of bins.
The measure of central tendency can be a mean, a median, or a mode. In some embodiments, the subset of bins is 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10 bins including the respective bin. In some embodiments, the subset of bins is 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10 contiguous bins including the respective bin. For example, the measure of central tendency of the test log2 read depths for a subset of bins can be a local average of log2 read depths, where the local average of log2 read depths is determined by calculating a rolling average for the subset of bins including the respective bin. In some embodiments, the specified range around the whitelist value is at least +1- 1 (e.g. , each bin that has a test log2 read depth between that has a difference of 1 or greater from the rolling average is removed from the plurality of bins). In some embodiments, the specified range is at least +/- 2, at least +/- 3, at least +/- 4, or at least +/- 5. In some embodiments, the specified range is less than +/- 1.
[0577] In some embodiments, the one or more filtering criteria comprises a threshold spread (e.g., a standard deviation, a standard error, and/or a confidence interval) of reference log2 read depths, for the respective bin, across all samples in the one or more reference samples. For example, in some such embodiments, each bin that has a spread of read depths greater than a threshold value is removed from the plurality of bins. [0578] In some embodiments, each bin in the plurality of bins is assigned a weight, and the one or more filtering criteria comprises a threshold weight. In some such embodiments, the weight is determined based on one or more of: a size of the bin (e.g., the number of base pairs in the respective bin); a deviation (e.g., distance) from 0 of the log2 read depth for the respective bin in the pooled reference; and/or the spread of log2 read depths for the respective bin in the pooled reference. For example, in some embodiments, each bin with a weight of 0 is removed from the plurality of bins.
[0579] Other methods for binning and/or determining sequence ratios are possible, as will be apparent to one skilled in the art. See, for example, Talevich et al, PLoS Comput Biol, 12:1004873 (2016), the content of which is hereby incorporated by reference, in its entirety, for all purposes.
[0580] Segments.
[0581] Referring again to Block 502-1, the first dataset further comprises a plurality of segment-level sequence ratios, each respective segment-level sequence ratio in the plurality of segment-level sequence ratios corresponding to a segment in a plurality of segments.
[0582] Each respective segment in the plurality of segments represents a corresponding region of the human reference genome encompassing a subset of adjacent bins in the plurality of bins, and each respective segment-level sequence ratio in the plurality of segment-level sequence ratios is determined from a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0583] The first dataset further comprises a plurality of segment-level measures of dispersion, each respective segment-level measure of dispersion in the plurality of segment- level measures of dispersion (i) corresponding to a respective segment in the plurality of segments and (ii) determined using the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0584] Referring to Block 516-1, in some embodiments, one or more respective segments in the plurality of segments that represents a corresponding region of the human reference genome encodes a target gene. Referring to Block 518-1, in some embodiments, the target gene is MET, EGFR, ERBB2, CD274, CCNE1, MYC, BRCA1 or BRCA2. Referring to Block 520-1, in some embodiments, the target gene is any of the genes listed in Table 1. [0585] Referring to Block 522-1, the method further comprises, for each respective segment in the plurality of segments that represents a corresponding region of the human reference genome, grouping the respective subset of adjacent bins in the plurality of bins based on a similarity between the respective sequence ratios of the subset of adjacent bins. Referring to Block 524-1, in some such embodiments, the grouping is performed using circular binary segmentation (CBS).
[0586] Circular binary segmentation groups bins into larger segments that divide each chromosome into regions comprising equal sequence ratios (e.g., copy number or copy ratio). This is generally performed by calculated a statistic for each genomic position, where the statistic comprises a likelihood ratio for the null hypothesis (no change in sequence ratio at the respective position) against the alternative (one change in sequence ratio at the respective position), and where the null hypothesis is rejected if the statistic is greater than a predefined distribution threshold. Notably, in circular binary segmentation, the chromosome is assumed to be circularized, such that the calculation is performed recursively for each position (e.g., each bin) around the circumference of the circle to identify all change-points across the length of the chromosome. See, for example, Olshen et al, Biostatistics 5, 4, 557-572 (2004), doi: 10.1093/biostatistics/kxh008, which is hereby incorporated herein by reference in its entirety.
[0587] In some embodiments, the grouping (e.g., segmentation) is performed using a Fused Lasso algorithm, a wavelet-based algorithm (e.g., HaarSeg), and/or a Hidden Markov Model. For example, in some embodiments, the grouping is performed using a 3-state Hidden Markov Model, a 5-state Hidden Markov Model, and/or a 3-state Hidden Markov Model with fixed amplitude for the loss, neutral, and gain states. In some embodiments, the grouping is performed by dividing a respective chromosome into a plurality of predefined regions (e.g., chromosome arms) are calculating the sequence ratios for each predefined region using a measure of central tendency of the sequence ratios of all bins within the predefined region (e.g., a weighted mean of the log2 copy ratios of all bins within each chromosome arm).
[0588] As described above, the segment-level sequence ratio is then calculated, for each segment, as a measure of central tendency for the one or more bins grouped together by the segmentation. In some embodiments, the measure of central tendency of the plurality of bin- level sequence ratios corresponding to the subset of bins encompassed by the respective segment is an arithmetic mean, a weighted mean, a midrange, a midhinge, a trimean, a Winsorized mean, a mean, a median, or a mode. Referring to Block 526-1, in some embodiments, the measure of central tendency of the plurality of bin-level sequence ratios is a weighted mean. For example, a segment-level copy ratio can be calculated as the weighted mean of the plurality of copy ratios for all bins grouped within the segment.
[0589] In some embodiments, the segmentation further comprises obtaining a measure of dispersion based on the sequence ratios (e.g., copy ratios) for each bin in the subset of adjacent bins. In some embodiments, each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion is a confidence interval, a standard deviation, a standard error, a variance, or a range. Referring to Block 528-1, in some embodiments, each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion is a confidence interval, and determining each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion comprises bootstrapping the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment. In some alternative embodiments, determining segment-level measures of dispersion (e.g., segment-level confidence intervals) is performed using normal distributions, binomial distributions, and/or statistical models for estimation as will be apparent to one skilled in the art.
[0590] Copy Number Status Annotations.
[0591] Referring again to Block 500-1, the present disclosure provides systems and methods for validating a copy number variation in a test subject, such as a copy number status annotation assigned to a genomic segment.
[0592] In some embodiments, a respective segment in the plurality of segments is annotated with a copy number status annotation when the corresponding segment-level sequence ratio satisfies one or more segment-level sequence ratio thresholds.
[0593] In some embodiments, a copy number status annotation is a qualitative status. For example, in some such embodiments, a copy number status annotation is selected from the group consisting of “amplified”, “deleted”, or “neutral”.
[0594] As an example, the annotation can comprise, when the segment-level sequence ratio is a positive number, marking the segment as “amplified”; when the segment-level sequence ratio is a negative number, marking the segment as “deleted”; and when the segment-level sequence ratio is zero or within a specified range around zero, marking the segment as “neutral”.
[0595] As another example, the annotation can comprise, when the segment-level sequence ratio is greater than a first threshold, marking the segment as “amplified”; when the segment-level sequence ratio is less than a second threshold, marking the segment as “deleted”; and when the segment-level sequence ratio is between the first and the second thresholds, marking the segment as “neutral”. In an embodiment, the one or more segment- level sequence ratio thresholds are one or more segment-level copy ratio thresholds, where the copy number status annotation is “amplified” if the segment-level copy ratio is greater than 0.03, “deleted” if the segment-level copy ratio is less than -0.5, or “neutral” if between - 0.5 and 0.03.
[0596] In some embodiments, a copy number status annotation is a quantitative status (e.g., an integer copy number).
[0597] In some embodiments, the annotation comprises, for each segment, rounding the segment-level sequence ratio to the nearest integer and assigning an absolute copy number based on one or more integer segment-level sequence ratio thresholds. For example, in some embodiments, segment-level copy numbers can be estimated based on positive correlations with segment-level copy ratios. In some embodiments, the annotation further comprises, for each segment, determining whether the segment-level sequence ratio falls within a specified range in a plurality of specified ranges, and assigning an absolute copy number (e.g., an integer copy number) based on the specified range.
[0598] In some embodiments, the annotation further comprises rescaling the segment- level sequence ratio based on one or more scaling factors (e.g., tumor fraction, B-allele frequency, known ploidy, and/or point estimates (mean, median, maximum, etc.) of somatic variant allele frequencies). For example, a segment-level copy ratio can be divided by a tumor fraction estimate for the test subject or the biological sample, thus estimating the copy ratio that would be expected in a pure tumor sample.
[0599] In some embodiments, the method further comprises removing (e.g., filtering) from the plurality of segments each respective segment that fails to satisfy one or more filtering criteria.
[0600] For example, in some embodiments, the one or more filtering criteria comprises a threshold absolute copy number, where each segment that is annotated with an absolute copy number lower than the threshold is removed from the plurality of segments. In some such embodiments, the threshold absolute copy number is 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10 copies.
[0601] In some embodiments, the one or more filtering criteria comprises one or more threshold values for a measure of dispersion, where (i) each segment in the plurality of segments is annotated with an absolute copy number; (ii) for a subset of adjacent segments, the measure of dispersion is calculated using the absolute copy number of each segment in the subset of adjacent segments; and (iii) the removing from the plurality of segments each respective segment that fails to satisfy the one or more filtering criteria comprises removing the each segment in the subset of adjacent segments. Thus, if the measure of dispersion for a group of adjacent segments fails to satisfy a filtering criterion, then all of the segments used to calculate the measure of dispersion are removed from the plurality of segments. In some embodiments, the measure of dispersion is a confidence interval, and the filtering criterion is inclusion of zero.
[0602] Other methods for annotating and preprocessing genomic segments are possible, as will be apparent to one skilled in the art. See, for example, Talevich et al., PLoS Comput Biol, 12:1004873 (2016), the content of which is hereby incorporated by reference, in its entirety, for all purposes.
[0603] In some embodiments, genomic region binning, coverage calculation, bias correction, normalization to a reference pool, segmentation, visualization and annotation are performed using any methods and/or software, or any embodiments, combinations, substitutions, additions, and/or deletions thereof as will be apparent to one skilled in the art.
[0604] Validation Filters.
[0605] Referring to Block 530-1, the method further comprises validating a copy number status annotation of a respective segment in the plurality of segments that is annotated with a copy number variation by applying the first dataset to an algorithm having a plurality of filters.
[0606] Referring to Block 532-1, the plurality of filters comprises (1) a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds. [0607] In some embodiments, the measure of central tendency of the plurality of bin- level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds when the measure of central tendency is lower than a bin-level sequence ratio amplification threshold. In some embodiments, a bin-level sequence ratio amplification threshold is between -0.5 and 5, between -0.1 and 3, between -0.047 and 1.6, or between 0 and 0.5. In some embodiments, a bin-level sequence ratio amplification threshold is lower than 0.3.
[0608] In some alternative embodiments, the measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds when the measure of central tendency is higher than a bin-level sequence ratio deletion threshold. In some embodiments, a bin-level sequence ratio deletion threshold is between -5 and 0.5, between -2 and 0, between -1 and -0.2, or between -0.75 and -0.25.
[0609] In some embodiments, the measure of central tendency of the plurality of bin- level sequence ratios corresponding to the subset of bins encompassed by the respective segment is an arithmetic mean, a weighted mean, a midrange, a midhinge, a trimean, a Winsorized mean, a mean, a median or a mode of the bin-level sequence ratios for all the respective bins encompassed by the respective segment. Referring to Block 534-1, in some embodiments, the measure of central tendency of the plurality of bin-level sequence ratios is a median.
[0610] In some embodiments, the measure of central tendency of the plurality of bin- level sequence ratios in the (1) a measure of central tendency bin-level sequence ratio filter is different from the measure of central tendency of the plurality of bin-level sequence ratios used to determine the segment-level sequence ratio (e.g., where the (1) filter is a median copy ratio filter for all the bins in the segment, and the segment-level sequence ratio is calculated from a weighted mean of the bins in the segment).
[0611] Referring to Block 536-1, the plurality of filters further comprises (2) a confidence filter that is fired when the segment-level measure of dispersion (e.g., confidence interval) corresponding to the respective segment fails to satisfy a confidence threshold.
[0612] In some embodiments, the segment-level measure of dispersion (e.g., confidence interval) corresponding to the respective segment fails to satisfy a confidence threshold (e.g., for amplification) when the lower bound of the measure of dispersion is lower than the confidence threshold. In some alternative embodiments, the segment-level measure of dispersion (e.g., confidence interval) corresponding to the respective segment fails to satisfy a confidence threshold (e.g., for deletion) when the upper bound of the measure of dispersion is higher than the confidence threshold.
[0613] Referring to Block 538-1, in some embodiments, the confidence threshold is a measure of central tendency of the segment-level sequence ratios corresponding to all other segments that map to the same chromosome of the human reference genome as the respective segment (e.g., all other segments excluding the respective segment, if the segment is located at an end of the chromosome).
[0614] Referring to Block 540-1, in some embodiments, the confidence threshold comprises a measure of central tendency of the segment-level sequence ratios corresponding to all preceding segments that map to the same chromosome of the human reference genome as the respective segment, and the measure of central tendency of the segment-level sequence ratios corresponding to all subsequent segments that map to the same chromosome of the human reference genome as the respective segment (e.g., all preceding segments and all following segments, if the respective segment is not located at an end of the chromosome).
In some such embodiments, the (2) confidence filter tests the upper or lower bound of the measure of dispersion (e.g., confidence interval) against two independent confidence thresholds (e.g., one preceding measure of central tendency, and one following measure of central tendency), where the bound of the measure of dispersion must satisfy both confidence thresholds in order to pass the filter. In some such embodiments, the two independent confidence thresholds have different values.
[0615] In some embodiments, the measure of central tendency of the segment-level sequence ratios in the (2) confidence filter is an arithmetic mean, a weighted mean, a midrange, a midhinge, a trimean, a Winsorized mean, a mean, a median or a mode.
[0616] Referring to Block 542-1, the plurality of filters further comprises (3) a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds. The one or more measure of central tendency-plus-deviation bin-level copy ratio thresholds are derived from (i) a measure of central tendency of the bin-level sequence ratios corresponding to the plurality of bins that map to the same chromosome of the human reference genome as the respective segment, and (ii) a measure of dispersion across the bin-level sequence ratios corresponding to the plurality of bins that map to the respective chromosome. The (i) measure of central tendency is calculated using all of the bins that map to the respective chromosome, including the bins encompassed by the respective segment under investigation.
[0617] In some embodiments, the measure of dispersion across the bin-level sequence ratios in the (3) measure of central tendency-plus-deviation filter is a variance, standard deviation, or interquartile range across the bin-level copy ratios. In some embodiments, the measure of dispersion is a median of a plurality of absolute deviations, where each absolute deviation corresponds to a bin in the plurality of bins that map to the chromosome and is calculated by subtracting the “chromosome sequence ratio” (e.g., the median of all bin-level sequence ratios for the plurality of bins in the chromosome) from each bin’s sequence ratio.
[0618] In some embodiments, the one or more measures of central tendency-plus- deviation bin-level sequence ratio thresholds (e.g., for amplifications) is a sum of (i) a measure of central tendency value of the bin-level sequence ratios corresponding to the plurality of bins that map to the same chromosome (e.g., the “chromosome sequence ratio”), and (ii) the measure of central tendency value of a plurality of absolute dispersions, where each absolute dispersion is determined using a comparison (e.g., a subtraction) between each bin-level sequence ratio corresponding to each bin in the plurality of bins that map to the same chromosome as the respective segment, and the measure of central tendency value of the bin-level sequence ratios measured in (i). The measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy the one or more measure of central tendency -plus-deviation bin-level sequence ratio thresholds when the measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment is lower than the one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds.
[0619] For example, in some embodiments, a segment annotated with an amplification status will pass the (3) filter if the median copy ratio of all bins encompassed in the segment is equal to or higher than the median plus the median absolute deviation (MAD) of all bins’ copy ratios on the same chromosome. [0620] In some embodiments, the one or more measure of central tendency -plus- deviation bin-level sequence ratio thresholds (e.g., for deletions) comprises (i) a measure of central tendency value of the bin-level sequence ratios corresponding to the plurality of bins that map to the same chromosome (e.g., the “chromosome sequence ratio”), minus (ii) the measure of central tendency value of a plurality of absolute dispersions, where each absolute dispersion is determined using a comparison (e.g., a subtraction) between each bin-level sequence ratio corresponding to each bin in the plurality of bins that map to the same chromosome as the respective segment, and the measure of central tendency value of the bin- level sequence ratios measured in (i). The measure of central tendency of the plurality of bin- level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy the one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds when the measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment is higher than the one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds.
[0621] In some such embodiments, the one or more measure of central tendency-plus- deviation bin-level sequence ratio thresholds is the measure of central tendency value of the bin-level sequence ratios corresponding to the plurality of bins that map to the same chromosome, minus the measure of central tendency value of the plurality of absolute dispersions multiplied by a factor k. In some embodiments, k is between 0.1 and 0.95, between 0.3 and 0.9, between 0.5 and 0.85, between 0.65 and 0.8, or between 0.73 and 0.77.
[0622] For example, in some embodiments, a segment annotated with a deletion status will pass the (3) filter if the median copy ratio of all bins encompassed in the segment is less than or equal to the median minus 0.75 of the median absolute deviation (MAD) of all bins’ copy ratios on the same chromosome.
[0623] Referring to Block 544-1, in some embodiments, the plurality of filters further comprises (4) a segment-level sequence ratio filter that is fired when the segment-level sequence ratio corresponding to the respective segment fails to satisfy one or more segment- level sequence ratio thresholds.
[0624] In some embodiments, the segment-level sequence ratio corresponding to the respective segment fails to satisfy one or more segment-level sequence ratio thresholds when the segment-level sequence ratio is lower than a segment-level sequence ratio amplification threshold. In some such embodiments, a segment-level sequence ratio amplification threshold is between -0.5 and 5, between -0.1 and 3, between -0.047 and 1.6, or between 0 and 0.5.
[0625] In some alternative embodiments, the segment-level sequence ratio corresponding to the respective segment fails to satisfy one or more segment-level sequence ratio thresholds when the segment-level sequence ratio is higher than a segment-level sequence ratio deletion threshold. In some such embodiments, a segment-level sequence ratio deletion threshold is between -5 and 0.5, between -2 and 0, between -1 and -0.2, or between -0.75 and -0.25.
[0626] For example, in some embodiments, a segment annotated with an amplification status will pass the (4) segment-level sequence ratio filter if the segment’s copy ratio is greater than 0.03, and a segment annotated with a deletion status will pass the (4) segment- level sequence ratio filter if the segment’s copy ratio is less than -0.5. In some embodiments, the amplification and/or deletion thresholds are specified by the user or practitioner. In some embodiments, the amplification and/or deletion thresholds are optimized for improved specificity and sensitivity for one or more test samples.
[0627] In some embodiments, the threshold for the (4) segment-level sequence ratio filter is determined by (i) estimating a circulating tumor fraction for the liquid biopsy sample, and (ii) calculating an expected log2 copy ratio for a high copy gain or deletion, where the expected log2 copy ratio is used as the threshold. In some embodiments, a high copy gain is at least 4 copies. In some embodiments, a high copy gain is at least 5 copies. In some embodiments, a high copy gain is at least 6 copies. In some embodiments, a high copy gain is at least 7 copies. In some embodiments, a high copy gain is at least 8 copies. In some embodiments, a high copy gain is at least 9 copies. In some embodiments, a high copy gain is at least 10 copies.
[0628] In some embodiments, an additional filter is used that filters out candidate segments that are longer than threshold length. In some embodiments, the threshold length is determined empirically. In some embodiments, the threshold length is at least 15 Mb. In some embodiments, the threshold length is at least 20 Mb. In some embodiments, the threshold length is at least 25 Mb. In some embodiments, the threshold length is at least 30 Mb. In some embodiments, the threshold length is at least 35 Mb. In some embodiments, the threshold length is no more than 50 Mb. In some embodiments, the threshold length is no more than 40 Mb. In some embodiments, the threshold length is no more than 30 Mb. In some embodiments, the threshold length is from 15 Mb to 50 Mb.
[0629] In some embodiments, one or more of the validation filters disclosed herein are optionally included in the plurality of validation filters applied to the first dataset. For example, in some embodiments, the plurality of validation filters comprises less than one, less than two, less than three, or less than four of the validation filters described in the present disclosure. In some embodiments, any one or more of the validation filters described herein can include any modifications, substitutions, additions and/or combinations thereof, as will be apparent to one skilled in the art.
[0630] Validating Copy Number Variations.
[0631] Referring to Block 546-1, the method further comprises, when a filter in the plurality of filters is fired, the copy number status annotation of the respective segment is rejected; and when no filter in the plurality of filters is fired, the copy number status annotation of the respective segment is validated.
[0632] In some embodiments, validation of an amplification status requires satisfaction of each filter in a plurality of amplification filters, and validation of a deletion status requires satisfaction of each filter in a plurality of deletion filters. Thus, all filters in the plurality of filters applied to the first dataset must be appropriate for the type of copy number status annotation to be validated.
[0633] For example, referring to Block 548-1, in some embodiments, the method further comprises validating an amplification status of a respective segment in the plurality of segments, by applying the first dataset to an algorithm having a plurality of filters. The plurality of filters comprises (1) a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment is lower than a bin-level sequence ratio amplification threshold; (2) a confidence filter that is fired when the lower bound of the segment-level measure of dispersion corresponding to the respective segment is lower than the confidence threshold; and (3) a measure of central tendency -plus - deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment is lower than the measure of central tendency-plus-deviation bin-level sequence ratio threshold. When a filter in the plurality of filters is fired, the amplification status of the respective segment is rejected; and when no filter in the plurality of filters is fired, the amplification status of the respective segment is validated.
[0634] Referring to Block 550-1, in some alternative embodiments, the method further comprises validating a deletion status of a respective segment in the plurality of segments, by applying the first dataset to an algorithm having a plurality of filters. The plurality of filters comprises (1) a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment is higher than a bin-level sequence ratio deletion threshold; (2) a confidence filter that is fired when the upper bound of the segment-level measure of dispersion corresponding to the respective segment is higher than the confidence threshold; and (3) a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin- level sequence ratios corresponding to the subset of bins encompassed by the respective segment is higher than the measure of central tendency -plus-deviation bin-level sequence ratio threshold. When a filter in the plurality of filters is fired, the deletion status of the respective segment is rejected; and when no filter in the plurality of filters is fired, the deletion status of the respective segment is validated.
[0635] In some embodiments, the plurality of filters comprises a plurality of amplification filters and a plurality of deletion filters.
[0636] In some embodiments, the copy number status annotation is “neutral”, and the validating the copy number status annotation comprises firing at least one filter in the plurality of amplification filters and firing at least one filter in the plurality of deletion filters.
[0637] In some embodiments, a segment is flagged as ambiguous if less than a threshold number of filters is fired. For example, in some embodiments, a segment is flagged as ambiguous if less than 4, less than 3, or less than 2 filters are fired.
[0638] In some embodiments, a validated copy number variation for a segment is assigned to the segment and to each bin encompassed by the respective segment.
[0639] Applications to Precision Oncology.
[0640] Referring to Block 552-1, in some embodiments, the method further comprises, after the validating, applying the validated copy number variation of the respective segment to a diagnostic assay. [0641] For example, in some embodiments, the method further comprises treating a patient with a cancer containing a copy number variation of a target gene by determining whether the copy number variation of the target gene is a focal copy number variation by validating the copy number variation in the patient, thus determining whether the patient has an aggressive form of the cancer associated with a focal copy number variation of the target gene. The method further comprises, when the patient has the aggressive form of cancer associated with focal copy number variation of the target gene, administering a first therapy for the aggressive form of the cancer to the patient, and when the patient does not have the aggressive form of cancer associated with focal copy number variation of the target gene, administering a second therapy for a less aggressive form of the cancer to the patient.
[0642] In some such embodiments, the first therapy is selected from Table 2. In some embodiments, the first therapy is trastuzumab, lapatinib, or crizotinib.
[0643] Table 2. Matched therapies for selected targeted panel genes.
[0644] In some embodiments, the method further comprises generating a report (e.g. , for use by a physician) comprising the validated copy number status of the respective segment for the biological sample of the respective test subject. In some such embodiments, the generated report further comprises matched therapies (e.g., treatments and/or clinical trials) based on the copy number status of the respective segment.
[0645] In some embodiments, the method further comprises disease screening and/or monitoring over a plurality of time points. For example, in some embodiments, the method is used for monitoring disease progression and/or recurrence after treatment, for assessing the efficacy of a treatment, and/or for performing comparative studies using liquid biopsy samples and matched solid tissue samples.
[0646] For example, in some embodiments, the method further comprises obtaining a second dataset that comprises a plurality of bin-level sequence ratios, each respective bin- level sequence ratio in the plurality of bin-level sequence ratios corresponding to a respective bin in a plurality of bins, where each respective bin in the plurality of bins represents a corresponding region of a human reference genome, and each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is determined from a sequencing of a plurality of cell-free nucleic acids in a second liquid biopsy sample of the test subject and one or more reference samples. The second dataset further comprises a plurality of segment-level sequence ratios, each respective segment-level sequence ratio in the plurality of segment- level sequence ratios corresponding to a segment in a plurality of segments, where each respective segment in the plurality of segments represents a corresponding region of the human reference genome encompassing a subset of adjacent bins in the plurality of bins, and each respective segment-level sequence ratio in the plurality of segment-level sequence ratios is determined from a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment. The second dataset also includes a plurality of segment-level measures of dispersion, each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion (i) corresponding to a respective segment in the plurality of segments and (ii) determined using the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0647] The method further includes validating a copy number status annotation of a respective segment in the plurality of segments that is annotated with a copy number variation by applying the second dataset to an algorithm having a plurality of filters. The plurality of filters can include any of the filters disclosed herein.
[0648] In some such embodiments, the first liquid biopsy sample is obtained at a first time point and the second liquid biopsy sample of the test subject is obtained at a second time point. For example, in some embodiments, the second time point is at least 1 day, at least 1 week, at least 1 month, at least 2 months, at least 3 months, at least 6 months, or at least 1 year after the first time point.
[0649] Longitudinal Testing
[0650] In some embodiments, one or more liquid biopsy assays described herein may be used to analyze specimens from a patient taken over the course of the patient’s treatment.
For example, a blood specimen may be obtained periodically and/or upon indication of response to therapy, disease relapse, and/or disease progression. In some embodiments, the one or more liquid biopsy assays may be used on a specimen collected from the patient each month, every two months, every three months, every four months, every five months, every 6-12 months, and so forth. In some embodiments, the longitudinal use of liquid biopsy assays may be used to track clonal evolution to identify resistance mutations. In some embodiments, the longitudinal use of liquid biopsy assays may be used to track evolution of mutations, such as EGFR or APC mutations.
[0651] In some embodiments, longitudinal use of liquid biopsy assays may be used to detect emerging therapy resistance mechanisms. In some embodiments, longitudinal use of liquid biopsy assays may be used to detect AR gene alterations. In some embodiments, longitudinal use of liquid biopsy assays may be used to detect WNT pathway alterations in mCRPC associated with resistance to enzalutimide and abiraterone. In some embodiments, longitudinal use of liquid biopsy assays may be used to detect ER mutations, such as ER mutations associated with resistance to endocrine therapy in breast cancer. In some embodiments, longitudinal use of liquid biopsy assays may be used to detect EGFR mutations responsible for anti-EGFR therapy resistance (e.g., T790M) in NSCLC. In some embodiments, longitudinal use of liquid biopsy assays may be used to detect KRAS, NRAS, MET, ERBB2, FLT3, or EGFR mutations associated with primary or acquired resistance to EGFR inhibitors in colorectal cancer. In some embodiments, longitudinal use of liquid biopsy assays may be used to assess gene alterations from tumor cells shed by primary tumor and metastatic sites.
[0652] In some embodiments the one or more blood specimens may be collected from the patient in a home-based environment. For example, the blood specimens may be collected by a mobile phlebotomist.
[0653] For example, a first blood specimen, a second blood specimen, and a third blood specimen may be collected from a patient during the course of treatment.
[0654] In some embodiments, the first blood specimen may be analyzed using at least an improvement in somatic variant identification, e.g., as described herein in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” the second blood specimen may be analyzed using at least an improvement in somatic variant identification, e.g., as described herein in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and the third blood specimen may be analyzed using at least an improvement in somatic variant identification, e.g., as described herein in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification.” [0655] In some embodiments, the first blood specimen may be analyzed using at least an improvement in focal copy number identification, e.g., as described herein in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” the second blood specimen may be analyzed using at least an improvement in focal copy number identification, e.g., as described herein in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and the third blood specimen may be analyzed using at least an improvement in focal copy number identification, e.g., as described herein in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation.”
[0656] In some embodiments, the first blood specimen may be analyzed using at least an improvement in circulating tumor fraction determination, e.g., as described herein in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” the second blood specimen may be analyzed using at least an improvement in circulating tumor fraction determination, e.g., as described herein in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and the third blood specimen may be analyzed using using at least an improvement in circulating tumor fraction determination, e.g., as described herein in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.”
[0657] Diagnostic Applications.
[0658] Referring now to Block 600-1, the present disclosure also provides a method for treating a patient with a cancer containing a copy number variation of a target gene.
[0659] Referring to Block 602-1, the method comprises determining whether the patient has an aggressive form of cancer associated with a focal copy number variation of the target gene.
[0660] A focal copy number variation of a target gene can be associated with, for example, recurrence, high-grade forms of a cancer, aggressive forms of a cancer, tumor growth, and/or other aberrations. See, for example, Nord et al, Int. J. Cancer, 126, 1390- 1402 (2010), which is hereby incorporated herein by reference in its entirety.
[0661] In some embodiments, the target gene is any of the embodiments described above. For example, referring to Block 604-1, in some embodiments, the target gene is any of the genes listed in Table 1. Referring to Block 606-1, in some embodiments, the target gene is MET, EGFR, ERBB2, CD274, CCNE1, MYC, BRCA1 or BRCA2.
[0662] Referring to Block 608-1, the method further comprises obtaining a first biological sample of the cancer from the patient. In some embodiments, the biological sample is a liquid biopsy sample or a solid tissue biological sample. In some embodiments, the biological sample is a liquid biopsy sample or a tumor biopsy sample. In some embodiments, the biological sample comprises (e.g., is obtained, prepared, sequenced, and/or analyzed by) any of the methods and/or embodiments described above, or any modifications, substitutions, and/or combinations thereof as will be apparent to one skilled in the art.
[0663] Referring to Block 610-1, the method further comprises performing copy number variation analysis on the first biological sample to identify the copy number status of the target gene in the cancer, where the copy number variation analysis generates a first dataset.
[0664] The first dataset includes a plurality of bin-level sequence ratios, each respective bin-level sequence ratio in the plurality of bin-level sequence ratios corresponding to a respective bin in a plurality of bins, where each respective bin in the plurality of bins represents a corresponding region of a human reference genome, and each respective bin- level sequence ratio in the plurality of bin-level sequence ratios is determined from a sequencing of a plurality of nucleic acids in the first biological sample of the cancer from the patient and one or more reference samples.
[0665] The first dataset also includes a plurality of segment-level sequence ratios, each respective segment-level sequence ratio in the plurality of segment-level sequence ratios corresponding to a segment in a plurality of segments, where each respective segment in the plurality of segments represents a corresponding region of the human reference genome encompassing a subset of adjacent bins in the plurality of bins, and the plurality of segment- level sequence ratios is determined from a measure of central tendency of the plurality of bin- level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment.
[0666] The first dataset further comprises a plurality of segment-level measures of dispersion, each respective segment-level measure of dispersion in the plurality of segment- level measures of dispersion (i) corresponding to a respective segment in the plurality of segments and (ii) determined using the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment. [0667] Methods for obtaining the first dataset, including binning, segmenting, calculating sequence ratios and measures of dispersion, normalizing and/or preprocessing, can comprise any of the methods and/or embodiments described above, or any modifications, substitutions, and/or combinations thereof as will be apparent to one skilled in the art.
[0668] Referring to Block 612-1, the method further comprises determining whether the copy number variation of the target gene is a focal copy number variation by applying the first dataset to an algorithm having a plurality of copy number variation filters.
[0669] Referring to Block 614-1, in some embodiments, the plurality of copy number variation filters comprises a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds, thus determining that the copy number variation of the target gene is not a focal copy number variation when fired.
[0670] Referring to Block 616-1, in some embodiments, the plurality of copy number variation filters further comprises a confidence filter that is fired when the segment-level measure of dispersion corresponding to the respective segment fails to satisfy a confidence threshold, thus determining that the copy number variation of the target gene is not a focal copy number variation when fired.
[0671] Referring to Block 618-1, in some embodiments, the plurality of copy number variation filters further comprises a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin- level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds. The one or more measure of central tendency -plus-deviation bin- level copy ratio thresholds are derived from (i) a measure of the bin-level sequence ratios corresponding to the plurality of bins that map to the same chromosome of the human reference genome as the respective segment, and (ii) a measure of dispersion across the bin- level sequence ratios corresponding to the plurality of bins that map to the respective chromosome. The method further comprises determining that the copy number variation of the target gene is not a focal copy number variation when fired.
[0672] Referring to Block 620-1, in some embodiments, the plurality of copy number variation filters further comprises a segment-level sequence ratio filter that is fired when the segment-level sequence ratio corresponding to the respective segment fails to satisfy one or more segment-level sequence ratio thresholds, thus determining that the copy number variation of the target gene is not a focal copy number variation when fired.
[0673] In some embodiments, the plurality of copy number variation filters comprises any of the methods and/or embodiments described above, or any modifications, substitutions, and/or combinations thereof as will be apparent to one skilled in the art.
[0674] Referring to Block 622-1, the method further comprises, when the patient has the aggressive form of cancer associated with focal copy number variation of the target gene, administering a first therapy for the aggressive form of the cancer to the patient, and when the patient does not have the aggressive form of cancer associated with focal copy number variation of the target gene, administering a second therapy for a less aggressive form of the cancer to the patient.
[0675] Referring to Block 624-1 and Block 626-1, in some embodiments, the first therapy is selected from Table 2. In some such embodiments, the first therapy is trastuzumab, lapatinib, or crizotinib.
[0676] Referring to Block 628-1 and Block 630-1, in some embodiments, the method further comprises generating a report (e.g., for use by a physician) comprising the copy number status of the target gene. In some such embodiments, the generated report further comprises matched therapies (e.g., treatments and/or clinical trials) based on the copy number status of the respective segment. When the patient has the aggressive form of cancer associated with focal copy number variation of the target gene, a first therapy for the aggressive form of the cancer is matched to the patient, and when the patient does not have the aggressive form of cancer associated with focal copy number variation of the target gene, a second therapy for a less aggressive form of the cancer is matched to the patient.
[0677] The present disclosure also provides a computer system comprising one or more processors and a non-transitory computer-readable medium including computer-executable instructions that, when executed by the one or more processors, cause the processors to perform any of the methods and embodiments disclosed herein.
[0678] The present disclosure also provides a non-transitory computer-readable storage medium having stored thereon program code instructions that, when executed by a processor, cause the processor to perform any of the methods and embodiments disclosed herein. [0679] In some embodiments, the methods described herein include generating a clinical report 139-3 (e.g., a patient report), providing clinical support for personalized cancer therapy, and/or using the information curated from sequencing of a liquid biopsy sample, as described above. In some embodiments, the report is provided to a patient, physician, medical personnel, or researcher in a digital copy (for example, a JSON object, a pdf file, or an image on a website or portal), a hard copy (for example, printed on paper or another tangible medium). A report object, such as a JSON object, can be used for further processing and/or display. For example, information from the report object can be used to prepare a clinical laboratory report for return to an ordering physician. In some embodiments, the report is presented as text, as audio (for example, recorded or streaming), as images, or in another format and/or any combination thereof.
[0680] The report includes information related to the specific characteristics of the patient’s cancer, e.g., detected genetic variants, epigenetic abnormalities, associated oncogenic pathogenic infections, and/or pathology abnormalities. In some embodiments, other characteristics of a patient’s sample and/or clinical records are also included in the report. For example, in some embodiments, the clinical report includes information on clinical variants, e.g., one or more of copy number variants (e.g., for actionable genes CCNE1, CD274(PD-L1), EGFR, ERBB2(HER2), MET, MYC, BRCA1, and/or BRCA2), fusions, translocations, and/or rearrangements (e.g., in actionable genes ALK, ROS1, RET, NTRK1, FGFR2, FGFR3, NTRK2 and/or NTRK3), pathogenic single nucleotide polymorphisms, insertion-deletions (e.g., somatic/tumor and/or germline/normal), therapy biomarkers, microsatellite instability status, and/or tumor mutational burden.
[0681] Conversion of solid tumor test to liquid biopsy test. In one embodiment, the solid tissue sample is insufficient for NGS testing (for example, the sample is too small or too degraded, the amount or quality of nucleic acids extracted from the sample does not result in quality NGS results that would result in reliable determination of variants and/or other genetic characteristics of the sample), and the physician or patient may decide to convert the solid tissue test that was ordered to a liquid biopsy test to be performed on a liquid biopsy sample collected from the same patient. The resulting report and/or display of the results on a portal may include an “xF Conversion Badge” to distinguish any order that has been converted from solid tissue test to a liquid biopsy test (compared to, for example, a liquid biopsy test that was not initially ordered as a solid tissue test). This will allow a user to identify which orders have been converted by this process, and distinguish between orders that were intentionally placed for the liquid biopsy panel.
[0682] Longitudinal Reporting. In various embodiments, a report may include and/or compare the results of multiple liquid biopsy tests and/or solid tumor tests (for example, multiple tests associated with the same patient). The results of multiple liquid biopsy tests and/or solid tumor tests may be displayed on a portal in a variety of configurations that may be selected and/or customized by the viewer. The tests may have been performed at different times, and the samples on which the tests were performed may have been collected at different times.
[0683] Download result. Clinical and/or molecular data associated with a patient (for example, information that would be included in the report), may be aggregated and made available via the portal. Any portion of the report data may be available for download (for example, as a CSV file) by the physician and/or patient. In various embodiments, the data may include data related to genetic variants, RNA expression levels, immunotherapy markers (including MSI and TMB), RNA fusions, etc. In one embodiment, if a physician or medical facility has ordered multiple tests (all tests may be associated with the same patient or tests may be associated with multiple patients), results associated with more than one test may be aggregated into a single file for downloading.
Systems and Methods for Improved Validation of Somatic Sequence Variants
[0684] Below, systems and methods for improving validation of somatic sequence variants, e.g., within the context of the methods and systems described above, are described with reference to Figures 5A2 and 5B2.
[0685] Many of the embodiments described below, in conjunction with Figures 5A2 and 5B2, relate to analyses performed using sequencing data for cfDNA obtained from a liquid biopsy sample of a cancer patient. Generally, these embodiments are independent and, thus, not reliant upon any particular DNA sequencing methods. However, in some embodiments, the methods described below include generating the sequencing data.
[0686] For example, provided herein is a generalized application of Bayes’ Theorem through the likelihood ratio test for diagnostic assays that allows dynamic calibration of filtering thresholds for somatic sequence variant detection in a patient, in accordance with some embodiments of the present disclosure. These thresholds are based on sample specific error rate, error rate from a pool of process matched healthy control samples, and/or a cohort of human solid tumors to inform our probability models. The method takes the form of the following formula: where: odds (post- test) is the post-test odds of a variant being positive given the application of Bayes Theorem, odds(pre- test) is the pre-test odds of a positive given the cancer type of the patient and the prevalence (measured as a fraction) of alterations detected in that gene or within a specific genomic window within a reference population with the cancer type, sensitivity is the sensitivity bin nearest that measured for the assay at a proposed circulating variant fraction, specificity is a term to be solved for, denoting the level of uncertainty that is acceptable given some fixed value of odds(post- test). Specificity can be replaced as the quantile of the beta binomial distribution (see below) defined by the within sample trinucleotide error rate and the background base position specific error rate, d(beta-binomial) is a beta binomial distribution defined by specified parameters (alpha, beta, Pr), and
Min(AO) is the minimum number of alternate alleles observed for a given sample.
[0687] Given a fixed value for odds (post- test), it is possible to solve instead for specificity or, rather, the minimum acceptable quantile of the beta binomial error distribution. Therefore, the equation can be refrained as:
Solving for specificity gives the quantile of the beta binomial function which can then be plugged into quantile(beta-binomial) to derive a minimum number of alternative alleles observed at a given depth, or:
Min(AO) = quantile (sensitivity) x [0688] Determining pre-test probability:
[0689] In some embodiments, pre-test probability, which is related to odds(pre- test), is defined as: odds(
1pre-test) and is determined through historical data derived from matched solid tumor test data. By analyzing an extensive set of cancers and using process matched liquid biopsy and tissue biopsy samples to identify somatic variants with high confidence, it is possible to accurately assess the prevalence of specific variants within a population of advanced human cancers.
For a population of patients most likely to require liquid biopsy type tests, the sampling distribution most closely models the distribution into which any given patient receiving the test will fall. To model this prevalence, there are two factors at play: gene level prevalence, and genomic window level prevalence.
[0690] Assessing prevalence by sliding window segmentation:
[0691] In some embodiments, in order to get an accurate estimate of prevalence, it is critical to divide the estimated rate of mutation by the mechanism of disease. Gain of function (GOF) mutations tend to cluster in “hotspots,” whereas loss of function (LOF) mutations tend to be scattered throughout a gene and suppress or eliminate a protein’s wild type behaviors. Due to this evolutionary constraint on mutation position, prevalence calculation must take into account whether a gene has a GOF or LOF mechanism of disease. While this cannot be directly analyzed given available data, it is possible to bootstrap this calculation by segmentation of mutational prevalence across exons.
[0692] Based on historical sequencing data, it is possible to bin mutations by exon. In order to assess whether a single exon is enriched for mutations over the rest of the gene (a hotspot or GOF gene), a rolling Poisson test of difference is applied jumping from exon to exon. If there is a single (or multiple) exons that show statistically significant deviation from other exons within the gene, that region is annotated as the window of interest. Prevalence is subsequently calculated as prevalence within the exon(s) encompassing the window of interest.
[0693] If no exons can be shown to be over-represented for mutations, the gene is assumed to have an LOF mechanism of action and the prevalence for the whole gene having an alteration within the specified cancer type is used. When a variant is being assessed for filtering, the prevalence within the pre-specified window or the prevalence within the gene itself is used as the pre-test probability (Pr(pre-test)) for the likelihood ratio test.
[0694] Referring to Block 500-2, the present disclosure provides a method for validating a somatic sequence variant in a test subject having a cancer condition, at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors.
[0695] Referring to Block 502-2, the method comprises obtaining, from a first sequencing reaction, a corresponding sequence of each cell-free DNA fragment in a first plurality of cell-free DNA fragments in a liquid biopsy sample of the test subject, thus obtaining a first plurality of sequence reads, e.g., a plurality of de-duplicated sequence reads, where each sequence read correspond to a unique cell-free DNA fragment from the sample.
In some embodiments, the first plurality of sequence reads includes at least 1000 sequence reads. In some embodiments, the first plurality of sequence reads includes at least 10,000 sequence reads. In some embodiments, the first plurality of sequence reads includes at least 100,000 sequence reads. In some embodiments, the first plurality of sequence reads includes at least 200,000, 300,000, 400,000, 500,000, 750,000, 1,000,000, 2,500,000, 5,000,000 sequence reads, or more.
[0696] In some embodiments, the liquid biopsy sample is blood. In some embodiments, the liquid biopsy sample comprises blood, whole blood, peripheral blood, plasma, serum, or lymph of the test subject.
[0697] In some embodiments, the cancer condition is a particular type and stage of cancer (e.g., stage 2 lung cancer). Advantageously, the variant filtering methods described herein are superior to filtering methods that simply account for the tumor fraction of a sample. This is achieved, in part, by accounting for the types of mutations found in a particular type of cancer, which improves the quality of the pre-odds probability of finding a particular type of variant (e.g., a variant within a particular genomic region) in a sample from a subject with a known type of cancer. Accordingly, in some embodiments, the pre-odds probabilities are based on as specific of a cancer type as possible, e.g., accounting for one or more of a type of cancer, an origin of the cancer, the stage of the cancer, any previously known genomic variants in the cancer (e.g., whether a breast cancer subject is BRCA1 or BRCA2 positive), a personal characteristic of the subject, e.g., age, gender, race, smoking status, alcohol consumption status, etc.), any pathology classification of the cancer, etc. However, there are practical considerations when determining the level of specificity for which a subject’s cancer should be specified when matching the cancer to a training cohort. For instance, when an insufficient number of training samples from matching samples are available for calculation of pre-test odds, the specificity of the cancer classification should be reduced in order to provide a large enough sample of training data to provide meaningful prior information.
[0698] In some embodiments, the test subject, the liquid biopsy sample, the cancer condition, and/or methods and systems for obtaining, accessioning, storing, processing, preparing and/or analyzing thereof, comprise any of the embodiments as described above in the present disclosure with reference to Figures 2-4.
[0699] In some embodiments, the first sequencing reaction is a panel-enriched sequencing reaction. For example, in some embodiments, the first sequencing reaction is a panel-enriched sequencing reaction of a first plurality of enriched loci, and each respective locus in the plurality of enriched loci are sequenced at an average unique sequence depth of at least 250x. In some such embodiments, each respective locus in the plurality of enriched loci are sequenced at an average unique sequence depth of at least lOOOx. In some embodiments, the first plurality of sequence reads is obtained from ultra-high depth sequencing (e.g., where each locus in a plurality of loci are sequenced at an average coverage of at least lOOOx, at least 2500x, or at least 5000x). Example genes that are informative for precision oncology, e.g., when implemented in a liquid biopsy-based assay, are shown in Table 1. In some embodiments, a panel-enriched sequencing reaction described herein uses a probe set that includes at least 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 100, or all 105 of the genes listed in Table 1.
[0700] In some embodiments, the first sequencing reaction is a whole genome sequencing reaction, and the average sequencing depth of the reaction across the genome is at least 5x, lOx, 15x, 20x, 25x, 30x, 40x, 50x, or higher.
[0701] In some embodiments, the first plurality of sequence reads includes at least 50,000 sequence reads, at least 100,000 sequence reads, at least 250,000 sequence reads, at least 500,000 sequence reads, at least 1,000,000 sequence reads, at least 5,000,000 sequence reads, or more.
[0702] In some embodiments, the first sequencing reaction and/or the first plurality of sequence reads includes any of the embodiments as described above in the present disclosure. For example, in some embodiments, methods and systems for nucleic acid extraction, library preparation, capture and hybridization, pooling, sequencing, aligning, normalization and/or other sequence read processing comprise any of the embodiments as described above in the present disclosure with reference to Figures 2-4.
[0703] Referring to Block 504-2, the method further comprises aligning each respective sequence read in the first plurality of sequence reads to a reference sequence for the species of the subject thus identifying (i) a variant allele fragment count for a candidate variant, where the candidate variant maps to a locus in the reference sequence, and (ii) a locus fragment count for the locus encompassing the candidate variant. In some embodiments, the variant allele fragment count refers to a unique number of sequence reads in the test subject that encompass the candidate variant. In some embodiments, the locus fragment count refers to the number of sequence reads in the test subject that map to the respective locus encompassing the candidate variant.
[0704] As described above, in some embodiments, the reference sequence is a reference genome, e.g., a reference human genome. In some embodiments, a reference genome has several blacklisted regions, such that the reference genome covers only about 75%, 80%, 85%, 90%, 95%, 98%, 99%, 99.5%, or 99.9% of the entire genome for the species of the subject. In some embodiments, the reference sequence for the subject covers at least 10% of the entire genome for the species of the subject, or at least 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, or more of the entire genome for the species of the subject. In some embodiments, the reference sequence for the subject represents a partial or whole exome for the species of the subject. For instance, in some embodiments, the reference sequence for the subject covers at least 10% of the exome for the species of the subject, or at least 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 98%, 99%, 99.9%, or 100% of the exome for the species of the subject. In some embodiments, the reference sequence covers a plurality of loci that constitute a panel of genomic loci, e.g., a panel of genes used in a panel-enriched sequencing reaction. An example of genes useful for precision oncology, e.g., which may be targeted with such a panel, are shown in Table 1. Accordingly, in some embodiments, the reference sequence for the subject covers at least 100 kb of the genome for the species of the subject.
In other embodiments, the reference sequence for the subject covers at least 250 kb, 500 kb, 750 kb, 1 Mb, 2 Mb, 5 Mb, 10 Mb, 25 Mb, 50 Mb, 100 Mb, 250 Mb, or more of the genome for the species of the subject. However, in some embodiments, there is no size limitation of the reference sequence. For example, in some embodiments, the reference sequence can be a sequence for a single locus, e.g., a single exon, gene, etc.) within the genome for the species of the subject.
[0705] Referring to Block 506-2, the method further comprises comparing the variant allele fragment count for the candidate variant against a dynamic variant count threshold for the locus in the reference sequence that the candidate variant maps to. The dynamic variant count threshold is based upon a pre-test odds of a positive variant call for the locus based upon the prevalence of variants in a genomic region that includes the locus from a first set of nucleic acids obtained from a cohort of subjects having the cancer condition.
[0706] For example, in some embodiments, the dynamic variant count threshold is determined based on the number of sequence variants that map to the respective locus, obtained from a sequencing of nucleic acids from a cohort of subjects having the cancer condition (e.g., a baseline variant threshold). In some embodiments, the cohort of subjects having the cancer condition are matched to at least one personal characteristic of the test subject (e.g., age, gender, race, smoking status, average alcohol consumption, other underlying medical conditions, etc.).
[0707] In some embodiments, the dynamic variant count threshold is also based upon a sequencing error rate for the sequencing reaction. For example, in some such embodiments, the sequencing error rate for the sequencing reaction is a trinucleotide sequencing error rate. In some embodiments, the dynamic variant count threshold is also based upon a background sequencing error rate determined for the locus.
[0708] Referring to Block 508-2, in some embodiments, the method further comprises obtaining a distribution of variant detection sensitivities as a function of circulating variant allele fraction from the cohort of subjects. The distribution of variant detection sensitivities is based on the circulating variant allele fraction of a second set of nucleic acids collected from the cohort of subjects relative to variant alleles detected in the first set of nucleic acids collected from the cohort of subjects. The first set of nucleic acids are from solid tumor biopsies of the cohort of subjects, and the second set of nucleic acids are cell-free nucleic acids from liquid biopsies of the cohort of subjects.
[0709] Figure 6A2 illustrates a flow chart of a method 600-2 for obtaining a distribution of variant detection sensitivities as a function of circulating variant allele fraction from a cohort of subjects, in accordance with some embodiments of the present disclosure. For example, referring to Block 602-2, matched liquid biopsy and solid tumor samples are obtained from a set of training subjects. In some embodiments, the training subjects comprise any of the cancer conditions, personal characteristics, and/or feature data described above in the present disclosure. Furthermore, in some embodiments, obtaining the matched liquid biopsy and solid tumor samples comprise any of the methods and embodiments described above in the present disclosure.
[0710] Referring to Block 604-2, the solid tumor sample is sequenced (e.g., by extracting nucleic acids from the solid tumor sample and performing a sequencing reaction for the sample). The plurality of sequence reads obtained from sequencing the solid tumor sample are aligned to a reference genome (e.g., a human reference genome), thus determining any sequence variants included in the solid tumor sample. Referring to Block 606-2, the liquid biopsy sample is sequenced as described above for the solid tumor sample, thus determining any sequence variants included in the liquid biopsy sample.
[0711] Referring to Block 608-2, the results of the sequencing reactions are compared by comparing the sequence variants detected in the liquid biopsy sample against the sequence variants detected in the solid tumor sample (e.g., a measure of how many of the variants detected in the solid tumor sample were also detected in the liquid biopsy sample, or a circulating variant allele fraction). The comparison determines a variant detection sensitivity for each variant (e.g., corresponding to a respective locus) in the liquid biopsy sample. Referring to Block 610-2, each variant detection sensitivity is binned, in a plurality of bins, with respect to an estimated tumor fraction for the liquid biopsy sample, thus obtaining a distribution of variant detection sensitivities.
[0712] In some embodiments, a distribution of variant detection sensitivities is established based on a set of training samples (e.g., sensitivity distribution training data) with known variant allele fractions, e.g., samples derived from a solid tumor sample for which one or more variant allele fraction has been determined (e.g., by deep sequencing of the sample). For example, in some embodiments, nucleic acids from each of a plurality of training samples 181 having a known variant allele fraction 184 for one or more variant alleles 183 is sequenced according to a processed-matched sequencing reaction (e.g., using a substantially identical or identical sequencing reaction), and it is determined whether each sequence variant can be detected, e.g., defining a detection status 185 for each locus/variant 183. Over a large number of training samples, a specificity of detection of variants having different variant allele fractions can be determined. In some embodiments, the specificity is determined on a locus-by-locus basis, such that the specificity of detection is specific for the genomic region or locus encompassing the candidate sequence variant. In some embodiments, the specificity is determined globally, e.g., not on a locus-by-locus basis.
[0713] Referring again to Block 508-2, in some embodiments, the method comprises estimating a circulating variant fraction for the candidate variant. In some embodiments, the circulating variant fraction for the candidate variant is the ratio of the variant allele fragment count to the locus fragment count (e.g., the proportion of sequence reads that include the candidate variant in the plurality of sequence reads that map to the respective locus encompassing the variant). In some embodiments, the circulating variant fraction is based only upon the variant allele frequency for that locus. In some alternative embodiments, the circulating variant fraction is a circulating tumor fraction determined for the sample.
[0714] For example, in some embodiments, the circulating variant fraction is specific to the variant being validated. In some such embodiments, the estimated variant fraction is determined by calculating the percentage of sequence reads encompassing the locus that include the variant (e.g., a variant allele fraction).
[0715] In some embodiments, the estimated circulating variant fraction for the candidate variant is an estimated tumor fraction for the sample, where the estimated tumor fraction for the sample is estimated based on a second sequencing reaction comprising low-pass whole- genome methylation sequencing of a second plurality of cell-free DNA fragments in the liquid biopsy sample of the test subject.
[0716] In some such embodiments, the dynamic threshold for the locus is set based upon a desired variant detection specificity determined by the relationship: where sensitivity is the variant detection sensitivity in the distribution of variant detection sensitivities that corresponds to the circulating variant fraction for the candidate variant, odds(post- test) is the post-test odds of a positive variant call for the locus, and odds(pre- test) is the pre-test odds of the positive variant call for the locus.
[0717] In some embodiments, the specificity is used to select a quantile of a beta- binomial distribution of the minimal variant allele fragment count required to support a positive variant call for the locus, thus defining the dynamic threshold for the locus. The beta-binomial distribution is defined by the sequencing error rate for the sequencing reaction and the background sequencing error rate determined for the locus. For example, in some embodiments, the minimum number of alternative alleles required to validate a positive variant call is represented by odds(pre-test)
Min(AO) = quantile (sensitivity) x , d (beta- binomial) odds (post- test)
[0718] In some embodiments, as described in Figure 6A2, obtaining the distribution of variant detection sensitivities comprises binning variant detection sensitivities in a plurality of bins as a function of circulating variant allele fraction. Each bin in the plurality of bins is associated with a corresponding variant detection sensitivity and sensitivity is the variant detection sensitivity corresponding to the respective bin, in the plurality of bins that encompasses the circulating variant fraction for the candidate variant. In some alternative embodiments, the distribution of variant detection sensitivities is a continuous function.
[0719] Additional details and embodiments for obtaining thresholds for filtering variants (e.g., dynamic thresholds) are described above in the present disclosure (see, Example Methods: Variant Identification).
[0720] In some embodiments, the pre-test odds of a positive variant call for the locus is based on (i) the prevalence of variants in the genomic region that includes the locus from the first set of nucleic acids obtained from the cohort of subjects having the cancer condition (e.g., the percentage of patients with the particular cancer type that have a variant in the region of interest), and (ii) a known or inferred effect of the variants. When the known or inferred effect of a variant is loss-of-function (LOF) of a gene that includes the locus, the genomic region used to compute the pre-test probability is the entire gene, and when the known or inferred effect of a variant is gain-of-function (GOF) of the gene that includes the locus, the genomic region used to compute the pre-test probability is the exon, of the gene, that includes the locus.
[0721] In some such embodiments, the effect of the variants is inferred by binning each respective variant of the variants in the genomic region that includes the locus from the first set of nucleic acids obtained from the cohort of subjects having the cancer condition into a respective bin, in a plurality of bins for the gene that include the locus, corresponding to the exon encompassing the respective variant in the gene. Each bin in the plurality of bins corresponds to a different exon of the respective gene. After determining whether any bin in the plurality of bins contains significantly more variants than the other bins in the plurality of bins, the effect of the sequence variant is inferred to be a gain-of-function of the gene when a bin contains significantly more variants than the other bins in the plurality of bins. Alternatively, the effect of the sequence variant is inferred to be a loss-of-function of the gene when no bin in the plurality of bins contains significantly more sequence variants than the other bins in the plurality of bins.
[0722] For example, Figures 7A2 and 7B2 illustrates a method of inferring an effect of a sequence variant as a gain-of-function or a loss-of-function of a gene, in accordance with some embodiments of the present disclosure.
[0723] Figure 7A2 illustrates a gene 700-A with a plurality of exons 701-A, 702-A, 703- A. Each exon corresponds to a bin in a plurality of bins. A first exon 701-A comprises a region of interest (e.g., a locus) that encompasses a candidate variant. A plurality of sequence variants (e.g., 704- A, 705-A, 706-A, 707-A, 708-A, 709-A) is obtained from a sequencing of nucleic acids from a cohort of subjects, where each sequence variant maps to a respective locus in the gene. The effect of the variants is inferred by binning each sequence variant into the respective bin corresponding to the exon to which the respective variant maps. Thus, sequence variants 704-A, 705-A, 706-A, and 707-A are binned into the bin corresponding to exon 701-A, sequence variant 708-A is binned into the bin corresponding to exon 702-A, and sequence variant 709-A is binned into the bin corresponding to exon 703-A. In Figure 7A2, it can be determined that the bin corresponding to exon 701-A contains significantly more variants than the other bins in the plurality of bins, and thus the effect of the sequence variant is inferred to be a gain-of-function of the gene. In such case, the genomic region used to compute the pre-test probability is the exon 701-A of the gene, that includes the locus encompassing the candidate variant.
[0724] Alternatively, Figure 7B2 illustrates a gene 700-B with a plurality of exons 701-B, 702-B, 703-B. Each exon corresponds to a bin in a plurality of bins. A first exon 701-B comprises a region of interest (e.g., a locus) that encompasses a candidate variant. A plurality of sequence variants (e.g., 704-B, 705-B, 706-B, 707-B, 708-B, 709-B) is obtained from a sequencing of nucleic acids from a cohort of subjects, where each sequence variant maps to a respective locus in the gene. The effect of the variants is inferred by binning each sequence variant into the respective bin corresponding to the exon to which the respective variant maps. Thus, sequence variants 704-B and 705-B are binned into the bin corresponding to exon 701-B, sequence variant 706-B and 707-B are binned into the bin corresponding to exon 702-B, and sequence variant 708-B and 709-B are binned into the bin corresponding to exon 703-B. In Figure 7B2, it can be determined that no bin in the plurality of bins contains significantly more sequence variants than the other bins in the plurality of bins, and thus the effect of the sequence variant is inferred to be a loss-of-function of the gene. In such case, the genomic region used to compute the pre-test probability is the entire gene.
[0725] In some such embodiments, determining whether any bin in the plurality of bins contains significantly more variants than the other bins in the plurality of bins comprises applying a rolling Poisson test of difference between bin counts corresponding to adjacent exons in the gene.
[0726] Referring to Block 510-2, the method further comprises validating the presence of the somatic sequence variant in the test subject when the variant allele fragment count for the candidate variant satisfies the dynamic variant count threshold for the locus, or rejecting the presence of the somatic sequence variant in the test subject when the variant allele fragment count for the candidate variant does not satisfy the dynamic variant count threshold for the locus. In some embodiments, the validating includes other variant filtering criteria, as described above in the present disclosure (see, Example Methods: Variant Identification).
[0727] In some embodiments, the methods and systems disclosed herein are used for precision oncology applications. For example, in some embodiments, the method further comprises generating a report for the test subject comprising the identity of variant alleles having variant allele counts, in the first sequencing reaction, that satisfy the dynamic variant count threshold. In some embodiments, the generated report further comprises therapeutic recommendations for the test subject based on the identity of one or more of the reported variant alleles. Additional embodiments for precision oncology applications, including matched clinical trials, matched therapies, report generation, and/or other aspects of the digital and laboratory health care platform are described in detail below.
[0728] Another aspect of the present disclosure provides a computer system comprising one or more processors and a non-transitory computer-readable medium including computer- executable instructions that, when executed by the one or more processors, cause the processors to perform a method according to any one of the embodiments disclosed herein.
[0729] Another aspect of the present disclosure provides a non-transitory computer- readable storage medium having stored thereon program code instructions that, when executed by a processor, cause the processor to perform the method according to any one of the embodiments disclosed herein. [0730] In some embodiments, the methods described herein include generating a clinical report 139-3 (e.g., a patient report), providing clinical support for personalized cancer therapy, and/or using the information curated from sequencing of a liquid biopsy sample, as described above. In some embodiments, the report is provided to a patient, physician, medical personnel, or researcher in a digital copy (for example, a JSON object, a pdf file, or an image on a website or portal), a hard copy (for example, printed on paper or another tangible medium). A report object, such as a JSON object, can be used for further processing and/or display. For example, information from the report object can be used to prepare a clinical laboratory report for return to an ordering physician. In some embodiments, the report is presented as text, as audio (for example, recorded or streaming), as images, or in another format and/or any combination thereof.
[0731] The report includes information related to the specific characteristics of the patient’s cancer, e.g., detected genetic variants, epigenetic abnormalities, associated oncogenic pathogenic infections, and/or pathology abnormalities. In some embodiments, other characteristics of a patient’s sample and/or clinical records are also included in the report. For example, in some embodiments, the clinical report includes information on clinical variants, e.g., one or more of copy number variants (e.g., for actionable genes CCNE1, CD274(PD-L1), EGFR, ERBB2(HER2), MET, MYC, BRCA1, and/or BRCA2), fusions, translocations, and/or rearrangements (e.g., in actionable genes ALK, ROS1, RET, NTRK1, FGFR2, FGFR3, NTRK2 and/or NTRK3), pathogenic single nucleotide polymorphisms, insertion-deletions (e.g., somatic/tumor and/or germline/normal), therapy biomarkers, microsatellite instability status, and/or tumor mutational burden.
[0732] Conversion of solid tumor test to liquid biopsy test. In one embodiment, the solid tissue sample is insufficient for NGS testing (for example, the sample is too small or too degraded, the amount or quality of nucleic acids extracted from the sample does not result in quality NGS results that would result in reliable determination of variants and/or other genetic characteristics of the sample), and the physician or patient may decide to convert the solid tissue test that was ordered to a liquid biopsy test to be performed on a liquid biopsy sample collected from the same patient. The resulting report and/or display of the results on a portal may include an “xF Conversion Badge” to distinguish any order that has been converted from solid tissue test to a liquid biopsy test (compared to, for example, a liquid biopsy test that was not initially ordered as a solid tissue test). This will allow a user to identify which orders have been converted by this process, and distinguish between orders that were intentionally placed for the liquid biopsy panel.
[0733] Longitudinal Reporting. In various embodiments, a report may include and/or compare the results of multiple liquid biopsy tests and/or solid tumor tests (for example, multiple tests associated with the same patient). The results of multiple liquid biopsy tests and/or solid tumor tests may be displayed on a portal in a variety of configurations that may be selected and/or customized by the viewer. The tests may have been performed at different times, and the samples on which the tests were performed may have been collected at different times.
[0734] Download result. Clinical and/or molecular data associated with a patient (for example, information that would be included in the report), may be aggregated and made available via the portal. Any portion of the report data may be available for download (for example, as a CSV file) by the physician and/or patient. In various embodiments, the data may include data related to genetic variants, RNA expression levels, immunotherapy markers (including MSI and TMB), RNA fusions, etc. In one embodiment, if a physician or medical facility has ordered multiple tests (all tests may be associated with the same patient or tests may be associated with multiple patients), results associated with more than one test may be aggregated into a single file for downloading.
Systems and Methods for Improved Circulating Tumor Fraction Estimates
[0735] Below, systems and methods for improving circulating tumor fraction estimates, e.g., within the context of the methods and systems described above, are described with reference to Figures 4F3, 5A3-B3, and 6A3-C3.
[0736] Many of the embodiments described below, in conjunction with Figures 4F3, 5A3-B3, and 6A3-C3, relate to analyses performed using sequencing data for cfDNA obtained from a liquid biopsy sample of a cancer patient. Generally, these embodiments are independent and, thus, not reliant upon any particular DNA sequencing methods. However, in some embodiments, the methods described below include generating the sequencing data.
[0737] As described herein, in some embodiments, the methods described herein (e.g., methods 400-3 and 500-3 as illustrated in Figures 4F3 and 5A3-B3) include one or more data collection steps, in addition to data analysis and downstream steps. For example, as described herein, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include collection of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). Likewise, as described herein, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include extraction of DNA from the liquid biopsy sample (cfDNA) and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). Similarly, as herein, e.g., with reference to Figures 2 and 3, in some embodiments, the methods include nucleic acid sequencing of DNA from the liquid biopsy (cfDNA) sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject).
[0738] However, in other embodiments, the methods described herein begin with obtaining nucleic acid sequencing results, e.g., raw or collapsed sequence reads of DNA from a liquid biopsy sample (cfDNA) and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject), from which the genomic features needed for estimating circulating tumor fraction (e.g., variant allele count and/or variant allele fraction) can be determined. For example, in some embodiments, sequencing data 122 for a patient 121 is accessed and/or downloaded over network 105 by system 100.
[0739] Similarly, in some embodiments, the methods described herein begin with obtaining the genomic features needed for estimating circulating tumor fraction (e.g., variant allele count and/or variant allele fraction) for a sequencing of a liquid biopsy sample and, optionally, one or more matching biological samples from the subject (e.g., a matched cancerous and/or matched non-cancerous sample from the subject). For example, in some embodiments, variant allele counts and/or variant allele fractions for sequencing data 122 of patient 121 is accessed and/or downloaded over network 105 by system 100.
[0740] Figure 4F3 illustrates a flow chart of a method for precision oncology including determining accurate circulating tumor fraction estimates using on-target and off-target sequence reads, in accordance with some embodiments of the present disclosure.
[0741] In some embodiments, the method includes obtaining (402-3) cell-free DNA sequencing data 122 from a sequencing reaction of a liquid biopsy sample of a test subject 121 (e.g., sequence reads 123-1-1-1, . . . 123-1 -1-K for sequence run 122-1-1 for a liquid biopsy sample from patient 121-1, as illustrated in Figure IB) As described herein, in some embodiments, the obtaining includes a step of sequencing cell-free nucleic acids from a liquid biopsy sample. Example methods for sequencing cell-free nucleic acids are described herein. The sequence reads obtained from the targeted-panel sequencing include a first subset of sequence reads that map to one or more target genes (e.g., on-target reads) in the panel and a second subset of sequence reads that map to an off-target portion of the reference genome (e.g., off-target reads). In some embodiments, the plurality of sequence reads includes at least 1000 sequence reads. In some embodiments, the first plurality of sequence reads includes at least 10,000 sequence reads. In some embodiments, the first plurality of sequence reads includes at least 100,000 sequence reads. In some embodiments, the first plurality of sequence reads includes at least 200,000, 300,000, 400,000, 500,000, 750,000, 1,000,000, 2,500,000, 5,000,000 sequence reads, or more.
[0742] In some embodiments, the panel size is relatively small, e.g., less than 1000 genes, less than 750 genes, less than 500 genes, less than 250 genes, less than 200 genes, less than 150 genes, less than 125 genes, less than 100 genes, less than 75 genes, less than 50 genes, etc. In some such embodiments, the sequencing reaction is performed at a read depth of 100X or more, 250X or more, 500X or more, 1000X or more, 2500X or more, 5000X or more, IO,OOOC or more, 20,000X or more, or 30,000X or more. In some embodiments, the sequencing panel comprises 1 or more, 10 or more, 20 or more, 50 or more, 100 or more, 150 or more, 200 or more, 300 or more, 500 or more, or 1000 or more genes. In some embodiments, the sequencing panel comprises one or more genes listed in Table 1. In some embodiments, the sequencing panel includes at least 2, 3, 4, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100, or all of the genes listed in Table 1. In some embodiments, the sequencing panel comprises one or more genes selected from the group consisting of MET, EGFR, ERBB2, CD274, CCNE1, MYC, BRCA1 and BRCA2. In some embodiments, the sequencing panel includes at least 2, 3, 4, 5, 6, 7, or all 8 of MET, EGFR, ERBB2, CD274, CCNE1, MYC, BRCA1 and BRCA2. In some embodiments, the sequencing reaction is a whole exome sequencing reaction.
[0743] Sequence reads 123 from the sequencing data 122 are then aligned (404-3) to a human reference sequence (e.g. , a human genome or a portion of a human genome, e.g. , 1 %, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 75%, 90%, 95%, 99%, or more of the human genome, or to a map of a human reference genome or a set of human reference genomes, or a portion thereof), thereby generating a plurality of aligned reads 124.
Optionally, the pre-aligned sequence reads 123 and/or aligned sequence reads 124 are pre- processed (408-3) using any of the methods disclosed above (e.g., normalization, bias correction, etc.). In some embodiments, as described herein, device 100 obtains previously aligned sequence reads.
[0744] As described above, in some embodiments, the reference sequence is a reference genome, e.g., a reference human genome. In some embodiments, a reference genome has several blacklisted regions, such that the reference genome covers only about 75%, 80%,
85%, 90%, 95%, 98%, 99%, 99.5%, or 99.9% of the entire genome for the species of the subject. In some embodiments, the reference sequence for the subject covers at least 10% of the entire genome for the species of the subject, or at least 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, or more of the entire genome for the species of the subject. In some embodiments, the reference sequence for the subject represents a partial or whole exome for the species of the subject. For instance, in some embodiments, the reference sequence for the subject covers at least 10% of the exome for the species of the subject, or at least 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 98%, 99%, 99.9%, or 100% of the exome for the species of the subject. In some embodiments, the reference sequence covers a plurality of loci that constitute a panel of genomic loci, e.g., a panel of genes used in a panel-enriched sequencing reaction. An example of genes useful for precision oncology, e.g., which may be targeted with such a panel, are shown in Table 1. Accordingly, in some embodiments, the reference sequence for the subject covers at least 100 kb of the genome for the species of the subject.
In other embodiments, the reference sequence for the subject covers at least 250 kb, 500 kb, 750 kb, 1 Mb, 2 Mb, 5 Mb, 10 Mb, 25 Mb, 50 Mb, 100 Mb, 250 Mb, or more of the genome for the species of the subject. However, in some embodiments, there is no size limitation of the reference sequence. For example, in some embodiments, the reference sequence can be a sequence for a single locus, e.g., a single exon, gene, etc.) within the genome for the species of the subject.
[0745] In some embodiments, the bins for off-target sequence reads (those sequence reads that do not correspond to a sequencing panel enrichment probe) are established to provide roughly uniform distribution of sequence reads to each bin, e.g., based on training data establishing historical distributions of sequence reads across the genome for a given targeted-panel sequencing reaction. In some embodiments, the method includes processes for enforcing uniformity, such as defining different bin sizes, GC correction, and sequencing depth corrections. In other embodiments, the binning is performed based upon a predetermined bin size. In some embodiments, the plurality of bins includes at least 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10,000, 25,000, 50,000, or more bins distributed across the reference sequence (e.g., the genome) for the species of the subject. In some embodiments, the bins are distributed relatively uniformly across the reference sequence, e.g, such that the each encompasses a similar number of bases, e.g, about 0.5 kb, 1 kb, 2 kb, 5 kb, 10 kb, 25 kb, 50 kb, 100 kb or more bases. Each respective bin in the plurality of bins represents a corresponding region of a reference sequence (e.g, genome) for the species of the subject. In some embodiments, the bins are distributed relatively uniformly across the reference sequence, e.g, such that the each encompasses a similar number of bases, e.g, about 0.5 kb, 1 kb, 2 kb, 5 kb, 10 kb, 25 kb, 50 kb, 100 kb or more bases. Each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is determined from a comparison of the first plurality of sequence reads to sequence reads from one or more reference samples. In some embodiments, the one or more reference samples are process- matched reference samples. That is, in some embodiments the one or more reference samples are prepared for sequencing using the same methodology as used to prepare the sample from the test subject. Similarly, in some embodiments, the one or more reference samples are sequenced using the same sequencing methodology as used to sequence the sample from the test subject. In this fashion, internal biases for particular regions or sequences are controlled for in the reference samples.
[0746] In some embodiments, binned sequence reads are segmented via circular binary segmentation (CBS). For example, in some embodiments, the method includes genomic region binning, coverage calculation, bias correction, normalization to a reference pool, segmentation, and/or visualization (e.g, using CNVkit).
[0747] In some embodiments, the method includes determining a sequence ratio (e.g., a coverage ratio) for a plurality of segments of the genome using the, e.g., binned, corrected, normalized, and/or segmented sequence reads as described above. In some embodiments, coverage ratio (CR) is calculated for the plurality of segments based on the following relationship (Block 410-3): normalized sample coverage log2(CR ) = log2( normalized pool coverage )· (1)
[0748] In some embodiments, the data is then cleaned-up by (i) removing segments located on sex chromosomes, and/or (ii) removing segments with fewer probes than a minimal threshold. In some embodiments, segments are then fitted to integer copy states via a maximum likelihood estimation (e.g, an expectation-maximization algorithm 412-3) using, for example, the sum of squared error of segment log2 ratios (e.g., normalized to genomic interval size) to expected coverage ratios given a putative copy state and tumor purity.
[0749] For example, in some embodiments, the method includes calculating expected sequence ratios 414-3 (e.g., coverage ratios) for a set of copy states at a given tumor purity. For instance, for a set of tumor purity values TP, and a set of copy states CN, the expected log2 coverage ratio is calculated for each tumor purity (TPi) and copy number state (CNj) according to: log2(CR) = log2((2(l-TPi)+( CNj)( TPM2)). (2)
[0750] In some embodiments, the method includes calculating the distance 416-3 to the closest copy state expected sequence ratio (e.g., coverage ratio) at the given tumor purity, where the distance (e.g., error) for a segment k (CRA) from the expected copy state is defined as: dk = \CRTp — CRk |. (3)
[0751] In some embodiments, the method includes assigning segment copy states by selecting expected copy states with the closest sequence ratio. That is, the copy state of the segment that is closest to the expected state is assigned the copy state with the smallest distance:
CNk = argmin(d ). (4)
The error for that segment is therefore the minimum distance:
[0752] In some embodiments, the method then includes estimating the circulating tumor fraction for the test subject based on a measure of fit between corresponding segment-level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions.
[0753] In some embodiments, estimating the circulating tumor fraction comprises minimization of an error between corresponding segment-level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions. For example, in some embodiments, the method includes summing the weighted errors for each tumor purity and selecting the model with the lowest score. In some embodiments, the scores 418-3 for each segment are weighted by the number of probes on that segment. The number of probes is highly co-linear with the length of the segment. In some embodiments, the weighting is performed according to:
Wi = åksklk, (6) where:
Wj is the weighted score for the sample copy ratios at tumor purity i,
£k is the error of segment k to its closest copy state, and lk is the number of probes on segment k.
[0754] The circulating tumor fraction estimate, therefore, is selected as the tumor purity with the lowest score (Block 420-3):
TP = argmin(w), (7) e.g., where w = (w0.0i, ... ,w0.99}.
[0755] In some embodiments, estimating the circulating tumor fraction includes identifying a plurality of local optima for fit (e.g., minima for the error between corresponding segment-level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions), and selecting the local optima (e.g., minima) that is closest to a second estimate of circulating tumor fraction determined by a different methodology.
[0756] For example, Figure 19 is an example plot of the errors between corresponding segment-level coverage ratios and integer copy states determined across a plurality of simulated circulated tumor fractions ranging from about 0 to about 1. As seen in the plot, there are two local minima 1902 and 1904 for the error, representing two possible solutions for the circulating tumor fraction for the liquid biopsy sample. In some embodiments, a second estimation of circulating tumor fraction 1906 or 1908 is determined, e.g., according to any of the methods described in the “Circulating Tumor Fraction” section above. The second estimation of circulating tumor fraction is then compared with the local minima, and the local minima that is closest to the second circulating tumor estimate is selected as the circulating tumor fraction for the liquid biopsy sample. For instance, if the second circulating tumor fraction 1906 was determined to be about 0.35, first local minima 1902 would be selected, and the circulating tumor fraction for the sample would be estimated to be about 0.325. However, if the second circulating tumor fraction 1908 was determined to be about 0.65, second local minima 1904 would be selected, and the circulating tumor fraction for the sample would be estimated to be about 0.625.
[0757] In some embodiments, the second estimate of circulating tumor fraction is generated by detecting a plurality of germline variants in the liquid biopsy sample based on the first plurality of sequence reads and determining, for each respective germline variant in the plurality of germline variants, a corresponding germline variant allele frequency for the liquid biopsy sample, thereby determining a plurality of germline variant allele frequencies for the liquid biopsy sample. For each respective germline variant in the plurality of germline variants, an absolute value of the difference between the corresponding germline variant allele frequency for the liquid biopsy sample and a germline variant allele frequency for the respective germline variant allele in a non-cancerous tissue of the subject is then determined, thereby generating a plurality of germline variant allele deltas for the liquid biopsy sample. The second estimated circulating tumor fraction for the liquid biopsy sample is then defined as twice the value of the maximum germline variant allele delta in the plurality of germline variant allele deltas.
[0758] In some embodiments, for each respective germline variant in the plurality of germline variants, the corresponding germline variant allele frequency for the respective germline variant allele in a non-cancerous tissue of the subject is defined as 0.5. However, in other embodiments, for each respective germline variant in the plurality of germline variants, the corresponding germline variant allele frequency for the respective germline variant allele in a non-cancerous tissue of the subject is determined based on a second sequencing reaction of nucleic acids from a non-cancerous sample of the subject. For example, in some embodiments, a plurality of somatic variants is detected in the liquid biopsy sample based on the first plurality of sequence reads. For each respective somatic variant in the plurality of somatic variants, a corresponding somatic variant allele frequency is determined for the liquid biopsy sample, thereby determining a plurality of somatic variant allele frequencies for the liquid biopsy sample. The second estimated circulating tumor fraction for the liquid biopsy sample as twice the value of the largest somatic variant allele frequency in the plurality of somatic variant allele frequencies.
[0759] In some embodiments, the second estimate of circulating tumor fraction is generated by detecting a plurality of somatic variants in the liquid biopsy sample based on the first plurality of sequence reads, determining, for each respective somatic variant in the plurality of somatic variants, a corresponding somatic variant allele frequency for the liquid biopsy sample, thereby determining a plurality of somatic variant allele frequencies for the liquid biopsy sample, and then estimating the circulating tumor fraction for the liquid biopsy sample as the value of the largest somatic variant allele frequency in the plurality of somatic variant allele frequencies.
[0760] An example of the off-target tumor estimation method described above is illustrated in Figures 6A3, 6B3, and 6C3, in accordance with some embodiments of the present disclosure. The plot in Figure 6A3 shows the log2 coverage ratios, calculated using Eq. (1) using off-target sequence reads from a test liquid biopsy sample (e.g, binned, corrected, normalized, and segmented using CNVkit). Segments were filtered to remove segments on sex chromosomes and segments with fewer than a minimum number of probes and arranged according to chromosome (indicated along the x-axis).
[0761] A set of tumor purity values TP and a set of copy states CN were selected for calculation of expected log2 coverage ratio. In this implementation, TP = [0.01, 0.02, ... , 0.99] and CN = [0, 1, 2, 3, 4] Thus, using Eq. (2), the expected log2 coverage ratio can be calculated for each possible combination of TPi and CNj. For example, for TP = 0.5 and CN = 4, the expected log2 coverage ratio is 0.58, and for TP = 0.5 and each possible value of CN, the set of expected log2 coverage ratios is CRTP=0 S = [-1, -0.415, 0, 0.322, 0.585] Values for expected log2 coverage ratios are indicated in Figure 6 A3 by the horizontal bars marked CNo, CNi, CN2, CN3, CN4.
[0762] Referring to Figure 6B3, the distances (e.g., error) for each segment from the expected copy state were determined using Eq. (3) and indicated by the vertical arrows (e.g, for a segment k). Eqs. (4) and (5) were then used to determine the copy state of the segment by selecting the expected copy state with the minimum distance. For example, in Figure 6B3, at TP = 0.5 (e.g. , a tumor purity of 50%) the segment k is closest to the expected copy state of 0, and thus the segment is assigned a copy state of 0. Figure 6C3 further illustrates the selection of copy states and minimum distances for each segment in the plurality of segments across each chromosome in the reference genome.
[0763] The minimum distances for each segment in the plurality of segments across the reference genome were summed, for each tumor purity value in the set, thus obtaining a score for each tumor purity. For example, Figure 6C3 illustrates a plurality of minimum distances, between each segment and its closest copy state value, for the plurality of segments across the reference genome. Additionally, the scores for each segment were weighted by the number of probes on the segment, according to Eq. (6). Finally, the tumor purity with the smallest score (e.g., smallest error) was selected according to Eq. (7), thus obtaining the circulating tumor fraction estimate for the test liquid biopsy sample.
[0764] Optionally, the method generates a circulating tumor fraction estimate 422-3 that can be reported as a biomarker. The ctFE is used, in some embodiments, to match therapies and/or clinical trials (Block 424-3) and can be included in a patient report 426-3 indicating the ctFE.
[0765] Optionally, the tumor fraction estimate obtained by the method is used (423-3) in one or more of the variant identification methods described herein, e.g., with respect to feature extraction module 145 illustrated in Figure 1A.
[0766] Figures 5A3-5B3 collectively provide a flow chart of processes and features for determining accurate circulating tumor fraction estimates using off-target sequence reads, in accordance with some embodiments of the present disclosure.
[0767] The present disclosure provides a method 500-3 for estimating a circulating tumor fraction for a test subject from panel-enriched sequencing data for a plurality of sequences.
[0768] Referring to Block 502-3, the method comprises obtaining, from a first panel- enriched sequencing reaction, a first plurality of sequence reads, where the first plurality of sequence reads comprises at least 100,000 sequence reads.
[0769] The plurality of sequences comprises (i) a corresponding sequence for each cell- free DNA fragment in a first plurality of cell-free DNA fragments obtained from a liquid biopsy sample from the test subject. Each respective cell-free DNA fragment in the first plurality of cell-free DNA fragments corresponds to a respective probe sequence in a plurality of probe sequences used to enrich cell-free DNA fragments in the liquid biopsy sample in the first panel-enriched sequencing reaction.
[0770] The plurality of sequences further comprises (ii) a corresponding sequence for each cell-free DNA fragment in a second plurality of cell-free DNA fragments obtained from the liquid biopsy sample. Each respective cell-free DNA fragment in the second plurality of DNA fragments does not correspond to any probe sequence in the plurality of probe sequences.
[0771] For example, in some embodiments, the plurality of sequence reads from a first panel-enriched sequencing reaction includes a first subset of sequence reads that correspond to cfDNA fragments targeted by one or more probes in a targeted enrichment panel (e.g., on- target), and a second subset of sequence reads that correspond to cfDNA fragments the map to an off-target region of the reference genome not targeted by any of the probes in the targeted enrichment panel (e.g., off-target).
[0772] In some embodiments, the plurality of sequence reads comprises at least 500,000 sequence reads, or at least 1,000,000 sequence reads, or at least 2,000,000 sequence reads, or at least 5,000,000 sequence reads.
[0773] In some embodiments, the obtaining, accessioning, storing, preparing, processing and/or analyzing the liquid biopsy sample from the test subject comprises any of the methods and/or embodiments described above in the present disclosure. In some embodiments, the sequencing reaction comprises any of the methods and/or embodiments described above in the present disclosure.
[0774] In some embodiments, the plurality of probe sequences used to enrich cell-free DNA fragments in the liquid biopsy sample in the first panel-enriched sequencing reaction collectively map to at least 25 different genes in human reference genome. In some embodiments, the plurality of probe sequences collectively maps to at least 50, at least 100, at least 250, at least 500, or at least 1000 different genes in the human reference genome. In some embodiments, the plurality of probe sequences collectively maps to at least 10 of the genes listed in Table 1. In some embodiments, the plurality of probe sequences collectively maps to at least 20, 25, 30, 40, 50, 60, 75, 100, or all 105 of the genes listed in Table 1.
[0775] For example, in some embodiments, a targeted enrichment panel comprises any of the embodiments described above in the present disclosure. For example, in some embodiments, the targeted enrichment panel includes probes targeting one or more gene loci, e.g., exon or intron loci. In some embodiments, the targeted enrichment panel includes probes targeting one or more locus not encoding a protein, e.g., regulatory loci, miRNA loci, and other non-coding loci, e.g., that have been found to be associated with cancer. In some embodiments, the plurality of loci includes at least 25, 50, 100, 150, 200, 250, 300, 350, 400, 500, 750, 1000, 2500, 5000, or more human genomic locus.
[0776] In some embodiments, the targeted enrichment panel includes probes targeting one or more of the genes listed in Table 1. In some embodiments, the targeted enrichment panel includes probes targeting at least 5 of the genes listed in Table 1. In some embodiments, the targeted enrichment panel includes probes targeting at least 10 of the genes listed in Table 1. In some embodiments, the targeted enrichment panel includes probes targeting at least 25 of the genes listed in Table 1. In some embodiments, the targeted enrichment panel includes probes targeting at least 50 of the genes listed in Table 1. In some embodiments, the targeted enrichment panel includes probes targeting at least 75 of the genes listed in Table 1. In some embodiments, the targeted enrichment panel includes probes targeting at least 100 of the genes listed in Table 1. In some embodiments, the targeted enrichment panel includes probes targeting all of the genes listed in Table 1.
[0777] Referring to Block 504-3, the method comprises determining a plurality of bin- level coverage ratios from the plurality of sequences. Each respective bin-level coverage ratio in the plurality of bin-level coverage ratios corresponding to a respective bin in a plurality of bins, and each respective bin in the plurality of bins represents a corresponding region of a human reference genome. Additionally, each respective bin-level coverage ratio in the plurality of bin-level coverage ratios is determined from a comparison of (i) a number of sequence reads in the first plurality of sequence reads that map to the corresponding bin and (ii) a number of sequence reads from one or more reference samples that map to the corresponding bin.
[0778] In some embodiments, each bin is defined as any region of a reference genome ( e.g ., that maps to a location in a reference genome). For example, in some embodiments, a bin is any number of bases in size. In some embodiments, a bin is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or more than 30 base pairs long. In some embodiments, a bin is at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 110, at least 120, at least 130, at least 140, at least 150, at least 160, at least 170, at least 180, at least 190, or at least 200 base pairs long. In some embodiments, a bin is between 5 base pairs and 100,000 base pairs long. In some embodiments, a bin is between 10 and 10,000 base pairs long. In some embodiments, a bin is greater than 100,000 base pairs long. In some embodiments, each bin in the plurality of bins is the same size. In some embodiments, a first bin in the plurality of bins is a different size from a second bin in the plurality of bins. In some embodiments, each bin further comprises a start and end position that corresponds to a location on a reference genome. In some embodiments, the plurality of bins comprises at least 10, at least 50, at least 100, at least 1,000, at least 2,000, at least 5,000, at least 10,000, at least 20,000, at least 50,000, at least 100,000, at least 500,000, at least lxlO6, at least 2xl06, at least 5xl06, at least lxlO7, or at least lxlO8 bins. [0779] In some embodiments, on average, each respective bin in the plurality of bins has two or more, three or more, five or more, ten or more, fifteen or more, twenty or more, fifty or more, one hundred or more, five hundred or more, one thousand or more, ten thousand or more, or 100,000 or more sequence reads in the plurality of sequence reads mapping onto the portion of the reference genome corresponding to the respective bin, where each such sequence read uniquely represents a different molecule in the plurality of cell-free nucleic acids in the liquid biopsy sample. For instance, in some embodiments, the plurality of cell- free nucleic acids in the liquid biopsy sample are sequenced with a sequencing methodology that makes use of unique molecular identifier (UMIs) for each cell-free nucleic acid in the liquid biopsy sample and each sequence read in the plurality of sequence reads has a unique UMI. In such embodiments, sequence reads with the same UMI are bagged (collapsed) into a single sequence read bearing the UMI.
[0780] In some embodiments, each bin-level coverage ratio in the plurality of bin-level coverage ratios comprises any measurement of a number of copies of a genomic sequence compared to a reference sequence (e.g., a copy ratio, log2 ratio, coverage ratio, base fraction, allele fraction (e.g., VAF), tumor ploidy, etc.).
[0781] In some embodiments, each sequence read in the first plurality of sequence reads that map to the corresponding bin (e.g., used for comparison to determine a bin-level coverage ratio) is a unique sequence read. In some embodiments, each sequence read in the first plurality of sequence reads that map to the corresponding bin comprises one or more unique identifiers (e.g., a unique molecular identifier or UMI). For example, in some embodiments, each sequence read that originates (e.g., was amplified or sequenced from) a unique original cfDNA fragment comprises an identifier that indicates the original cfDNA fragment from which the sequence read is derived. In some such embodiments, a plurality of duplicate sequence reads originating from the same original cfDNA fragment share the same identifier.
[0782] In some embodiments, the sequence reads from the one or more reference samples that map to the corresponding bin are prepared using a DNA extraction and enrichment matched process, e.g., where the same process used on the test sample is also used on the one or more reference samples. In some embodiments, the sequence reads from the one or more reference samples are prepared using the same sequencing methodology as is used to generate the sequence reads for the test sample. [0783] In some alternative embodiments, the determining a plurality of bin-level coverage ratios from the plurality of sequences is determined from a comparison of (i) a number of sequence reads in the second plurality of sequence reads that map to the corresponding bin and (ii) a number of sequence reads from one or more reference samples that map to the corresponding bin. For example, in some such embodiments, the determining the plurality of bin-level coverage ratios is performed using off-target sequence reads (e.g, not panel-enriched sequence reads) rather than on-target sequence reads (e.g., panel- enriched). In some such embodiments, the bins, sequence reads, method of preparing the one or more reference samples and/or method of determining the coverage ratios comprises any of the presently disclosed embodiments described above.
[0784] Referring to Block 506-3, the method further comprises determining a plurality of segment-level coverage ratios. A plurality of segments is formed by grouping respective subsets of adjacent bins in the plurality of bins based on a similarity between the respective coverage ratios of the subset of adjacent bins. For each respective segment in the plurality of segments, a segment-level coverage ratio is determined based on the corresponding bin-level coverage ratios for each bin in the respective segment.
[0785] In some embodiments, the segmentation is performed using circular binary segmentation (CBS). In some embodiments, the segment-level coverage ratio comprises any measurement of a number of copies of a genomic sequence compared to a reference sequence (e.g, a copy ratio, log2 ratio, coverage ratio, base fraction, allele fraction (e.g, VAF), tumor ploidy, etc.). In some embodiments, the segment-level coverage ratio is obtained by a measure of central tendency of the plurality of bin-level coverage ratios for each bin in the respective segment. For example, in some embodiments, the segment-level coverage ratio is obtained by an arithmetic mean, a weighted mean, a midrange, a midhinge, a trimean, a Winsorized mean, a mean, a median or a mode of the plurality of bin-level coverage ratios for each bin in the respective segment.
[0786] In some embodiments, each segment is further filtered to remove one or more segments that fail to satisfy a filtering criterion. In some such embodiments, the filtering criterion is a position on a sex chromosome, where segments that are located on sex chromosomes are removed from the plurality of segments. In some embodiments, the filtering criterion is a minimum per-segment probe threshold. In some such embodiments, the filtering is performed by tallying the number of probes in the targeted enrichment panel that correspond to reference sequences spanned by the respective segment. If the probe count for the respective segment obtained from the tallying is below a specified probe threshold, then the segment is removed from the plurality of segments.
[0787] Referring to Block 508-3, the method further comprises fitting, for each respective simulated circulating tumor fraction in a plurality of simulated circulating tumor fractions, each respective segment in the plurality of segments to a respective integer copy state in a plurality of integer copy states. The fitting is performed by identifying the respective integer copy state in the plurality of integer copy states that best matches the segment-level coverage ratio. The fitting thus generates, for each respective simulated circulating tumor fraction in the plurality of simulated tumor fractions, a respective set of integer copy states for the plurality of segments.
[0788] In some embodiments, a simulated circulating tumor fraction is a specified value. In some embodiments, the simulated circulating tumor fraction is between 106 and 0.999. In some embodiments, the simulated circulating tumor fraction is between 105 and 0.999. In some embodiments, the simulated circulating tumor fraction is between 104 and 0.999. In some embodiments, the simulated circulating tumor fraction is between 0.001 and .999. In some embodiments, the simulated circulating tumor fraction is between 0.01 and .99. In some embodiments, the simulated circulating tumor fraction is 0 or 100. In some embodiments, the plurality of simulated circulating tumor fractions comprises at least 10 simulated circulating tumor fractions. In some embodiments, the plurality of simulated circulating tumor fractions comprises at least 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100 or more simulated circulating tumor fractions.
[0789] In some embodiments, the plurality of circulating tumor fractions comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, or more than
100 circulating tumor fraction values.
[0790] In some embodiments, the plurality of simulated circulating tumor fractions spans a range of at least from 5% to 25%. In some embodiments, the plurality of simulated circulating tumor fractions spans a range of at least from 1% to 50%. In some embodiments, the plurality of simulated circulating tumor fractions spans a range having a lower boundary between about 0.1% and about 5% ( e.g ., 0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.6%, 0.7%, 0.8%, 0.9%, 1%, 1.5%, 2%, 3%, 4%, or 5%) and an upper boundary between about 25% and about 100% (e.g., 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or 100%). In some embodiments, the span between each consecutive pair of simulated tumor fractions is no more than 5%. In some embodiments, the span between consecutive pairs of simulated tumor fractions is no more than 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, or 10%. In some embodiments, the span between consecutive pairs of simulated tumor fractions is consistent through the entire range of simulated tumor fractions. In other embodiments, the span between consecutive pairs of simulated tumor fractions increases as the simulated tumor fraction increases. That is, in some embodiments, the span between low simulated tumor fractions is small and the span between high tumor fractions is larger.
[0791] In some embodiments, the plurality of circulating tumor fractions comprises every value between 0 and 1 (that is, between 1% circulating tumor fraction and 100% circulating tumor fraction) with a span of 0.01 between each pair of values (e.g., 0.01, 0.02, 0.03,...0.98, 0.99).
[0792] In some embodiments, the plurality of integer copy states comprises a 1-copy state, a 2-copy state, a 3-copy state, and a 4-copy state. In some embodiments, the plurality of integer copy states includes at least 3 states, at least 4 states, at least 5 states, at least six states, or more. In some embodiments, the plurality of states represents a span of consecutive integer values, generally starting from 1. In some embodiments, the plurality of integer copy states comprises at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, or at least 10 copy states.
[0793] In some embodiments, the integer copy state is used to obtain a coverage ratio for each respective copy state in the plurality of copy states and each respective simulated circulating tumor fraction in the plurality of simulated circulating tumor fractions. In some embodiments, the coverage ratio is a log2-transformed coverage ratio (e.g., where negative numbers indicate copy number loss and positive numbers indicate copy number gain). In some embodiments, the coverage ratio is between -3 and 3. In some embodiments, the coverage ratio is between -4 and -3, between -3 and -2, between -2 and -1, between -1 and 0, between 0 and 1, between 1 and 2, between 2 and 3, or between 3 and 4.
[0794] In some embodiments, the fitting includes using a maximum likelihood estimation method to fit each respective segment in the plurality of segments to the respective integer copy state. In some such embodiments, the maximum likelihood estimation method is an expectation maximization algorithm that considers the error between each of the plurality of copy states and the segment-level coverage ratio at each of the plurality of simulated circulating tumor fractions.
[0795] In some such embodiments, the identifying the respective integer copy state that best matches the segment-level coverage ratio is performed by, for each respective segment in the plurality of segments, selecting the copy state with the smallest distance (e.g., the smallest error) from the segment-level coverage ratio for the respective segment, and assigning the respective copy state to the segment. In some such embodiments, the method further comprises assigning a copy state to each segment in the plurality of segments, based on a consideration (e.g., a minimization) of the error. In some embodiments, the consideration is performed for each possible copy state corresponding to each segment in the plurality of segments, and the procedure is then repeated for each simulated circulating tumor fraction in the plurality of circulating tumor fractions. Thus, each iteration of the procedure will produce a plurality of sets of integer copy states, where each set of integer copy states is associated with a simulated circulating tumor fraction in the plurality of circulating tumor fractions, and where each integer copy state in the set of integer copy states is associated with a segment in the plurality of segments.
[0796] In some embodiments, the fitting includes, for each respective simulated tumor fraction in the plurality of simulated tumor fractions: determining, for each respective integer copy state in the plurality of integer copy states, a corresponding expected coverage ratio; comparing, for each respective segment in the plurality of segments, the corresponding segment-level coverage ratio to the each of the expected coverage ratio for each respective integer copy state in the plurality of integer copy states; and assigning, for each respective segment in the plurality of segments, a corresponding integer copy state based on the comparison.
[0797] Thus, in some such embodiments, the consideration of the error between each integer copy state and the segment-level coverage ratio of each segment is determined using a comparison between the expected coverage ratio of each integer copy state and the segment- level coverage ratio of each segment.
[0798] In some such embodiments, for each respective integer copy state in the plurality of integer copy states, the corresponding expected coverage ratio is determined according to the relationship:
[0800] where CR is the expected coverage ratio; TPi is the respective simulated circulating tumor fraction, and CNj is the respective integer copy state.
[0801] Referring to Block 510-3, the method further comprises determining the circulating tumor fraction for the test subject based on optimization (e.g., minimization) of an error between corresponding segment-level coverage ratios and integer copy states (e.g., relative to the fitted integer copy state) across the plurality of simulated circulated tumor fractions.
[0802] In some embodiments, the determining the circulating tumor fraction for the test subject comprises determining a measure of fit, for each respective simulated tumor fraction in the plurality of simulated tumor fractions, based on the aggregate of a difference, for each respective segment in the plurality of segments, between the respective segment-level coverage ratio and the expected coverage ratio for the corresponding copy state fit to the respective segment. The determining further comprises selecting the simulated tumor fraction, in the plurality of tumor fractions, with the best measure of fit.
[0803] In some embodiments, the measure of fit for each respective segment, in the plurality of segments, is defined by the relationship:
[0804] wt = åk eklk,
[0805] where wt is the measure of fit for simulated tumor fraction i, £k is the square of the difference between the respective segment-level coverage ratio and expected coverage ratio for the copy state k at tumor fraction i, lk is the number of probe sequences, in the plurality of probe sequences, that fall within the respective segment.
[0806] For example, in some embodiments, the optimization of the respective segment- level errors is a minimization of error to obtain an error score. In some embodiments, the error score is determined by calculating the sum of errors between each of the plurality of assigned copy states and the segment-level coverage ratio (e.g., relative to the fitted integer copy state), for each segment in the plurality of segments, for each of a plurality of simulated circulating tumor fractions. Thus, in embodiments where the error between the segment-level coverage ratio for each segment and the assigned copy state for the respective segment is a minimized error (e.g., due to the selection of nearest copy states), the sum of errors thus generates a minimized error score for each simulated circulating tumor fraction in the plurality of circulating tumor fractions. In some such embodiments, the minimized error scores are compared, and the smallest score is selected, thus selecting the circulating tumor fraction estimate having the corresponding smallest score as the circulating tumor fraction estimate for the test subject. In some embodiments, the error scores are further weighted prior to summing (e.g., by weighting each error in the summed error score based upon a number of probes corresponding to the respective segment). Additional embodiments and examples illustrating determining the circulating tumor fraction for the test subject are described above with reference to Figures 4F3 and 6A3-C3.
[0807] In some embodiments, the obtained circulating tumor fraction estimate is used for further downstream analysis and biomarker detection (e.g., calculation of variant allele fractions, variant calling, and/or identification of other metrics). In some embodiments, the obtained circulating tumor fraction estimate is used as a metric for disease detection, diagnosis, and/or treatment. In some embodiments, the obtained circulating tumor fraction estimate is included in a clinical report made available to the patient or a clinician. In some embodiments, the obtained circulating tumor fraction estimate is used to select appropriate therapies and/or clinical trials for assessment of treatment response.
[0808] In some embodiments, the methods described herein include generating a clinical report 139-3 (e.g., a patient report), providing clinical support for personalized cancer therapy, and/or using the information curated from sequencing of a liquid biopsy sample, as described above. In some embodiments, the report is provided to a patient, physician, medical personnel, or researcher in a digital copy (for example, a JSON object, a pdf file, or an image on a website or portal), a hard copy (for example, printed on paper or another tangible medium). A report object, such as a JSON object, can be used for further processing and/or display. For example, information from the report object can be used to prepare a clinical laboratory report for return to an ordering physician. In some embodiments, the report is presented as text, as audio (for example, recorded or streaming), as images, or in another format and/or any combination thereof.
[0809] The report includes information related to the specific characteristics of the patient’s cancer, e.g., detected genetic variants, epigenetic abnormalities, associated oncogenic pathogenic infections, and/or pathology abnormalities. In some embodiments, other characteristics of a patient’s sample and/or clinical records are also included in the report. For example, in some embodiments, the clinical report includes information on clinical variants, e.g., one or more of copy number variants (e.g., for actionable genes CCNE1, CD274(PD-L1), EGFR, ERBB2(HER2), MET, MYC, BRCA1, and/or BRCA2), fusions, translocations, and/or rearrangements (e.g., in actionable genes ALK, ROS1, RET, NTRK1, FGFR2, FGFR3, NTRK2 and/or NTRK3), pathogenic single nucleotide polymorphisms, insertion-deletions (e.g., somatic/tumor and/or germline/normal), therapy biomarkers, microsatellite instability status, and/or tumor mutational burden.
[0810] Conversion of solid tumor test to liquid biopsy test. In one embodiment, the solid tissue sample is insufficient for NGS testing (for example, the sample is too small or too degraded, the amount or quality of nucleic acids extracted from the sample does not result in quality NGS results that would result in reliable determination of variants and/or other genetic characteristics of the sample), and the physician or patient may decide to convert the solid tissue test that was ordered to a liquid biopsy test to be performed on a liquid biopsy sample collected from the same patient. The resulting report and/or display of the results on a portal may include an “xF Conversion Badge” to distinguish any order that has been converted from solid tissue test to a liquid biopsy test (compared to, for example, a liquid biopsy test that was not initially ordered as a solid tissue test). This will allow a user to identify which orders have been converted by this process, and distinguish between orders that were intentionally placed for the liquid biopsy panel.
[0811] Longitudinal Reporting. In various embodiments, a report may include and/or compare the results of multiple liquid biopsy tests and/or solid tumor tests (for example, multiple tests associated with the same patient). The results of multiple liquid biopsy tests and/or solid tumor tests may be displayed on a portal in a variety of configurations that may be selected and/or customized by the viewer. The tests may have been performed at different times, and the samples on which the tests were performed may have been collected at different times.
[0812] Download result. Clinical and/or molecular data associated with a patient (for example, information that would be included in the report), may be aggregated and made available via the portal. Any portion of the report data may be available for download (for example, as a CSV file) by the physician and/or patient. In various embodiments, the data may include data related to genetic variants, RNA expression levels, immunotherapy markers (including MSI and TMB), RNA fusions, etc. In one embodiment, if a physician or medical facility has ordered multiple tests (all tests may be associated with the same patient or tests may be associated with multiple patients), results associated with more than one test may be aggregated into a single file for downloading. [0813] Methods Integrating Multiple Improvement
[0814] Advantageously, the present disclosure describes several improvements relating to the analysis of cell-free DNA in a liquid biopsy sample from a subject with cancer. For instance, among other aspects, the present disclosure describes improvements in (i) somatic variant (e.g., SNP) identification, (ii) focal copy number variation identification, and (iii) circulating tumor fraction determination. It is contemplated that various combinations of these improvements, as well as other non-conventional aspects described herein, may be integrated into a common bioinformatics pipeline for analyzing liquid biopsy samples. For instance, in some embodiments, a bioinformatics pipeline integrating one, two, or all three of these improvements is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, by parallel analysis of nucleic acids from a non- cancerous tissue of the subject, or both. Examples of various combinations of improvements that may be combined into a single liquid biopsy bioinformatic pipeline, methods associated thereof, systems for performing such methods, and/or non-transitory computer readable media for executing such methods are described below. It will be appreciated that these combinations can be performed with any other preparatory or bioinformatic steps described in the other methods described herein, e.g., methods 200, 400-1, 400-2, 400-3, 450, 500-1, 500-2, 500-3, 600-1, and 600-2 as illustrated in Figures 2, 4, 5, and 6, and further described above.
[0815] In some embodiments, a bioinformatics pipeline for analyzing nucleic acids in a liquid biopsy is provided that integrates at least an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation.”
[0816] Accordingly, in some embodiments, a method is provided for analyzing a liquid biopsy sample from a subject with cancer that includes (i) obtaining, from a first sequencing reaction of cell-free DNA fragments, a first plurality of sequence reads aligned to a reference sequence for the species of the subject, (ii) determining whether a respective candidate sequence variant (e.g., a SNP) identified from the first plurality of aligned sequence reads can be validated as a somatic sequence variant by comparing a corresponding variant allele fragment count for the respective candidate sequence variant to a dynamic variant count threshold for the locus of the reference sequence that the candidate variant maps to, where the dynamic variant count threshold is based upon a pre-test odds of a positive variant call for the locus based upon a prevalence of variants in a genomic region that includes the locus from a first set of nucleic acids obtained from a cohort of subjects having the cancer condition, such that when the corresponding variant fragment count satisfies the dynamic variant count threshold, the presence of the somatic sequence is validated, and when the corresponding variant fragment count does not satisfy the dynamic variant count threshold, the presence of the somatic sequence is rejected, and (iii) determining whether a candidate focal copy number variation for a respective genomic segment, identified from the first plurality of aligned sequence reads, can be validated as a somatic focal copy number variation by (a) determining bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion from a comparison of (i) the sequence reads in the first plurality of sequence reads that map to respective genomic bins or genomic segments to (ii) sequence reads from one or more reference samples that map to the same respective genomic bins or genomic segments, e.g., as described above in the section titled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and (b) determining whether determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion corresponding to the respective genomic segment satisfy a plurality of filters that include (1) a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds, (2) a confidence filter that is fired when the segment-level measure of dispersion corresponding to the respective segment fails to satisfy a confidence threshold, and (3) a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin- level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds, e.g., as described above in the section titled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” such that when the determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion satisfy all of the filters in the plurality of filters, the focal copy number variation is validated, and when the determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion do not satisfy all of the filters in the plurality of filters, the focal copy number variation is rejected. [0817] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in focal copy number identification, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in focal copy number identification, is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in focal copy number identification, is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0818] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in focal copy number identification, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in focal copy number identification, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in focal copy number identification, is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” [0819] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in focal copy number identification, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing”.
[0820] In some embodiments, a bioinformatics pipeline for analyzing nucleic acids in a liquid biopsy is provided that integrates at least an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.”
[0821] Accordingly, in some embodiments, a method is provided for analyzing a liquid biopsy sample from a subject with cancer that includes (i) obtaining, from a first sequencing reaction of cell-free DNA fragments, a first plurality of sequence reads aligned to a reference sequence for the species of the subject, (ii) determining whether a respective candidate sequence variant (e.g., a SNP) identified from the first plurality of aligned sequence reads can be validated as a somatic sequence variant by comparing a corresponding variant allele fragment count for the respective candidate sequence variant to a dynamic variant count threshold for the locus of the reference sequence that the candidate variant maps to, where the dynamic variant count threshold is based upon a pre-test odds of a positive variant call for the locus based upon a prevalence of variants in a genomic region that includes the locus from a first set of nucleic acids obtained from a cohort of subjects having the cancer condition, such that when the corresponding variant fragment count satisfies the dynamic variant count threshold, the presence of the somatic sequence is validated, and when the corresponding variant fragment count does not satisfy the dynamic variant count threshold, the presence of the somatic sequence is rejected, and (iii) estimating a circulating tumor fraction for the subject by (a) determining bin-level coverage ratios and segment-level coverage ratios from a comparison of (i) the number of sequence reads in the first plurality of sequence reads that map to respective genomic bins or genomic segments and (ii) the number of sequence reads from one or more reference samples that map to the same respective genomic bins or genomic segments, e.g., as described above in the section titled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” (b) identifying integer copy states that best match segment-level coverage ratios by fitting segments to integer copy states for a plurality of simulated circulating tumor fractions, and (c) estimating the circulating tumor fraction for the test subject based on a measure of fit between corresponding segment-level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions, e.g., as described above in the section titled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.”
[0822] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in circulating tumor fraction determination, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in circulating tumor fraction determination, is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in circulating tumor fraction determination, is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0823] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in circulating tumor fraction determination, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in circulating tumor fraction determination, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in circulating tumor fraction determination, is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0824] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and an improvement in circulating tumor fraction determination, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0825] In some embodiments, a bioinformatics pipeline for analyzing nucleic acids in a liquid biopsy is provided that integrates at least an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0826] Accordingly, in some embodiments, a method is provided for analyzing a liquid biopsy sample from a subject with cancer that includes (i) obtaining, from a first sequencing reaction of cell-free DNA fragments, a first plurality of sequence reads aligned to a reference sequence for the species of the subject, (ii) determining whether a respective candidate sequence variant (e.g., a SNP) identified from the first plurality of aligned sequence reads can be validated as a somatic sequence variant by comparing a corresponding variant allele fragment count for the respective candidate sequence variant to a dynamic variant count threshold for the locus of the reference sequence that the candidate variant maps to, where the dynamic variant count threshold is based upon a pre-test odds of a positive variant call for the locus based upon a prevalence of variants in a genomic region that includes the locus from a first set of nucleic acids obtained from a cohort of subjects having the cancer condition, such that when the corresponding variant fragment count satisfies the dynamic variant count threshold, the presence of the somatic sequence is validated, and when the corresponding variant fragment count does not satisfy the dynamic variant count threshold, the presence of the somatic sequence is rejected, and (iii) obtaining, from a second sequencing reaction of nucleic acid fragments in a solid tumor biopsy sample from the subject, a second plurality of sequence reads aligned to a reference sequence for the species of the subject, and analyzing the nucleic acids from the solid tumor biopsy sample using a parallel analysis including, at least, determining whether a respective candidate sequence variant (e.g., a SNP) identified from the second plurality of aligned sequence reads can be validated as a somatic sequence variant.
[0827] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0828] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Concurrent Testing,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0829] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” [0830] In some embodiments, a bioinformatics pipeline for analyzing nucleic acids in a liquid biopsy is provided that integrates at least an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0831] Accordingly, in some embodiments, a method is provided for analyzing a liquid biopsy sample from a subject with cancer that includes (i) obtaining, from a first sequencing reaction of cell-free DNA fragments, a first plurality of sequence reads aligned to a reference sequence for the species of the subject, (ii) determining whether a respective candidate sequence variant (e.g., a SNP) identified from the first plurality of aligned sequence reads can be validated as a somatic sequence variant by comparing a corresponding variant allele fragment count for the respective candidate sequence variant to a dynamic variant count threshold for the locus of the reference sequence that the candidate variant maps to, where the dynamic variant count threshold is based upon a pre-test odds of a positive variant call for the locus based upon a prevalence of variants in a genomic region that includes the locus from a first set of nucleic acids obtained from a cohort of subjects having the cancer condition, such that when the corresponding variant fragment count satisfies the dynamic variant count threshold, the presence of the somatic sequence is validated, and when the corresponding variant fragment count does not satisfy the dynamic variant count threshold, the presence of the somatic sequence is rejected, and (iii) obtaining, from a second sequencing reaction of nucleic acid fragments in a non-cancerous tissue sample from the subject, a second plurality of sequence reads aligned to a reference sequence for the species of the subject, and analyzing the nucleic acids from the non-cancerous tissue sample using a parallel analysis including, at least, determining whether a respective candidate sequence variant (e.g., a SNP) identified from the second plurality of aligned sequence reads can be validated as a somatic sequence variant.
[0832] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0833] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0834] In some embodiments, the bioinformatics pipeline integrating at least an improvement in somatic variant identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0835] In some embodiments, a bioinformatics pipeline for analyzing nucleic acids in a liquid biopsy is provided that integrates at least an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.”
[0836] Accordingly, in some embodiments, a method is provided for analyzing a liquid biopsy sample from a subject with cancer that includes (i) obtaining, from a first sequencing reaction of cell-free DNA fragments, a first plurality of sequence reads aligned to a reference sequence for the species of the subject, (ii) determining whether a candidate focal copy number variation for a respective genomic segment, identified from the first plurality of aligned sequence reads, can be validated as a somatic focal copy number variation by (a) determining bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion from a comparison of (i) the sequence reads in the first plurality of sequence reads that map to respective genomic bins or genomic segments to (ii) sequence reads from one or more reference samples that map to the same respective genomic bins or genomic segments, e.g., as described above in the section titled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and (b) determining whether determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion corresponding to the respective genomic segment satisfy a plurality of filters that include (1) a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds, (2) a confidence filter that is fired when the segment-level measure of dispersion corresponding to the respective segment fails to satisfy a confidence threshold, and (3) a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds, e.g., as described above in the section titled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” such that when the determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion satisfy all of the filters in the plurality of filters, the focal copy number variation is validated, and when the determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion do not satisfy all of the filters in the plurality of filters, the focal copy number variation is rejected, and (iii) estimating a circulating tumor fraction for the subject by (a) determining bin-level coverage ratios and segment-level coverage ratios from a comparison of (i) the number of sequence reads in the first plurality of sequence reads that map to respective genomic bins or genomic segments and (ii) the number of sequence reads from one or more reference samples that map to the same respective genomic bins or genomic segments, e.g., as described above in the section titled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” (b) identifying integer copy states that best match segment-level coverage ratios by fitting segments to integer copy states for a plurality of simulated circulating tumor fractions, and (c) estimating the circulating tumor fraction for the test subject based on a measure of fit between corresponding segment- level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions, e.g., as described above in the section titled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.”
[0837] In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and an improvement in circulating tumor fraction determination, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and an improvement in circulating tumor fraction determination, is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and an improvement in circulating tumor fraction determination, is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing”
[0838] In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and an improvement in circulating tumor fraction determination, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and an improvement in circulating tumor fraction determination, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing” In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and an improvement in circulating tumor fraction determination, is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing”
[0839] In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and an improvement in circulating tumor fraction determination, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0840] In some embodiments, a bioinformatics pipeline for analyzing nucleic acids in a liquid biopsy is provided that integrates at least an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0841] Accordingly, in some embodiments, a method is provided for analyzing a liquid biopsy sample from a subject with cancer that includes (i) obtaining, from a first sequencing reaction of cell-free DNA fragments, a first plurality of sequence reads aligned to a reference sequence for the species of the subject, (ii) determining whether a candidate focal copy number variation for a respective genomic segment, identified from the first plurality of aligned sequence reads, can be validated as a somatic focal copy number variation by (a) determining bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion from a comparison of (i) the sequence reads in the first plurality of sequence reads that map to respective genomic bins or genomic segments to (ii) sequence reads from one or more reference samples that map to the same respective genomic bins or genomic segments, e.g., as described above in the section titled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and (b) determining whether determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion corresponding to the respective genomic segment satisfy a plurality of filters that include (1) a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds, (2) a confidence filter that is fired when the segment-level measure of dispersion corresponding to the respective segment fails to satisfy a confidence threshold, and (3) a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds, e.g., as described above in the section titled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” such that when the determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion satisfy all of the filters in the plurality of filters, the focal copy number variation is validated, and when the determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion do not satisfy all of the filters in the plurality of filters, the focal copy number variation is rejected, and (iii) obtaining, from a second sequencing reaction of nucleic acid fragments in a solid tumor biopsy sample from the subject, a second plurality of sequence reads aligned to a reference sequence for the species of the subject, and analyzing the nucleic acids from the solid tumor biopsy sample using a parallel analysis including, at least, determining whether a candidate focal copy number variation for a respective genomic segment, identified from the second plurality of aligned sequence reads, can be validated as a somatic focal copy number variation.
[0842] In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.”
[0843] In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0844] In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0845] In some embodiments, a bioinformatics pipeline for analyzing nucleic acids in a liquid biopsy is provided that integrates at least an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0846] Accordingly, in some embodiments, a method is provided for analyzing a liquid biopsy sample from a subject with cancer that includes (i) obtaining, from a first sequencing reaction of cell-free DNA fragments, a first plurality of sequence reads aligned to a reference sequence for the species of the subject, (ii) determining whether a candidate focal copy number variation for a respective genomic segment, identified from the first plurality of aligned sequence reads, can be validated as a somatic focal copy number variation by (a) determining bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion from a comparison of (i) the sequence reads in the first plurality of sequence reads that map to respective genomic bins or genomic segments to (ii) sequence reads from one or more reference samples that map to the same respective genomic bins or genomic segments, e.g., as described above in the section titled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and (b) determining whether determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion corresponding to the respective genomic segment satisfy a plurality of filters that include (1) a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds, (2) a confidence filter that is fired when the segment-level measure of dispersion corresponding to the respective segment fails to satisfy a confidence threshold, and (3) a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds, e.g., as described above in the section titled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” such that when the determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion satisfy all of the filters in the plurality of filters, the focal copy number variation is validated, and when the determined bin-level sequence ratios, segment-level sequence ratios, and segment-level measures of dispersion do not satisfy all of the filters in the plurality of filters, the focal copy number variation is rejected, and (iii) obtaining, from a second sequencing reaction of nucleic acid fragments in a non-cancerous tissue sample from the subject, a second plurality of sequence reads aligned to a reference sequence for the species of the subject, and analyzing the nucleic acids from the non-cancerous tissue sample using a parallel analysis including, at least, determining whether a candidate focal copy number variation for a respective genomic segment, identified from the second plurality of aligned sequence reads, can be validated as a somatic focal copy number variation.
[0847] In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0848] In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0849] In some embodiments, the bioinformatics pipeline integrating at least an improvement in focal copy number identification and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” also integrates an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0850] In some embodiments, a bioinformatics pipeline for analyzing nucleic acids in a liquid biopsy is provided that integrates at least an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0851] Accordingly, in some embodiments, a method is provided for analyzing a liquid biopsy sample from a subject with cancer that includes (i) obtaining, from a first sequencing reaction of cell-free DNA fragments, a first plurality of sequence reads aligned to a reference sequence for the species of the subject, (ii) estimating a circulating tumor fraction for the subject by (a) determining bin-level coverage ratios and segment-level coverage ratios from a comparison of (i) the number of sequence reads in the first plurality of sequence reads that map to respective genomic bins or genomic segments and (ii) the number of sequence reads from one or more reference samples that map to the same respective genomic bins or genomic segments, e.g., as described above in the section titled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” (b) identifying integer copy states that best match segment-level coverage ratios by fitting segments to integer copy states for a plurality of simulated circulating tumor fractions, such that the circulating tumor fraction for the subject is determined from an optimization of error between corresponding segment-level coverage ratios and integer copy states across the plurality of simulated circulating tumor fractions, e.g., as described above in the section titled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and (iii) obtaining, from a second sequencing reaction of nucleic acid fragments in a solid tumor biopsy sample from the subject, a second plurality of sequence reads aligned to a reference sequence for the species of the subject, and analyzing the nucleic acids from the solid tumor biopsy sample using a parallel analysis including at least estimating a tumor fraction for the subject from the second plurality of aligned sequence reads. [0852] In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0853] In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0854] In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and is further improved by parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0855] In some embodiments, a bioinformatics pipeline for analyzing nucleic acids in a liquid biopsy is provided that integrates at least an improvement in circulating tumor fraction determination, e.g., as described above in the section entitled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0856] Accordingly, in some embodiments, a method is provided for analyzing a liquid biopsy sample from a subject with cancer that includes (i) obtaining, from a first sequencing reaction of cell-free DNA fragments, a first plurality of sequence reads aligned to a reference sequence for the species of the subject, (ii) estimating a circulating tumor fraction for the subject by (a) determining bin-level coverage ratios and segment-level coverage ratios from a comparison of (i) the number of sequence reads in the first plurality of sequence reads that map to respective genomic bins or genomic segments and (ii) the number of sequence reads from one or more reference samples that map to the same respective genomic bins or genomic segments, e.g., as described above in the section titled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction,” (b) identifying integer copy states that best match segment-level coverage ratios by fitting segments to integer copy states for a plurality of simulated circulating tumor fractions, and (c) estimating the circulating tumor fraction for the test subject based on a measure of fit between corresponding segment-level coverage ratios and integer copy states across the plurality of simulated circulated tumor fractions, e.g., as described above in the section titled “Systems and Methods for Improved Circulating Tumor Fraction Estimates” and/or “Circulating Tumor Fraction.” and (iii) obtaining, from a second sequencing reaction of nucleic acid fragments in a non-cancerous tissue sample from the subject, a second plurality of sequence reads aligned to a reference sequence for the species of the subject, and analyzing the nucleic acids from the non-cancerous tissue sample using a parallel analysis including at least estimating a tumor fraction for the subject from the second plurality of aligned sequence reads.
[0857] In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0858] In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.” In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
[0859] In some embodiments, the bioinformatics pipeline integrating at least an improvement in circulating tumor fraction determination and parallel analysis of nucleic acids from a non-cancerous tissue sample of the subject, also integrates an improvement in somatic variant identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Somatic Sequence Variants” and/or “Variant Identification,” also integrates an improvement in focal copy number identification, e.g., as described above in the section entitled “Systems and Methods for Improved Validation of Copy Number Variation” and/or “Copy Number Variation,” and is further improved by parallel analysis of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as described above in the section entitled “Concurrent Testing.”
Variant Characterization
[0860] In some embodiments, a predicted functional effect and/or clinical interpretation for one or more identified variants is curated by using information from variant databases. In some embodiments, a weighted-heuristic model is used to characterize each variant.
[0861] In some embodiments, identified clinical variants are labeled as “potentially actionable”, “biologically relevant”, “variants of unknown significance (VUSs)”, or “benign”. Potentially actionable alterations are protein-altering variants with an associated therapy based on evidence from the medical literature. Biologically relevant alterations are protein-altering variants that may have functional significance or have been observed in the medical literature but are not associated with a specific therapy. Variants of unknown significance (VUSs) are protein-altering variants exhibiting an unclear effect on function and/or without sufficient evidence to determine their pathogenicity. In some embodiments, benign variants are not reported. In some embodiments, variants are identified through aligning the patient’s DNA sequence to the human genome reference sequence version hgl9 (GRCh37). In some embodiments, actionable and biologically relevant somatic variants are provided in a clinical summary during report generation.
[0862] For instance, in some embodiments, variant classification and reporting is performed, where detected variants are investigated following criteria from known evolutionary models, functional data, clinical data, literature, and other research endeavors, including tumor organoid experiments. In some embodiments, variants are prioritized and classified based on known gene-disease relationships, hotspot regions within genes, internal and external somatic databases, primary literature, and other features of somatic drivers. Variants can be added to a patient (or sample, for example, organoid sample) report based on recommendations from the AMP/ASCO/CAP guidelines. Additional guidelines may be followed. Briefly, pathogenic variants with therapeutic, diagnostic, or prognostic significance may be prioritized in the report. Non-actionable pathogenic variants may be included as biologically relevant, followed by variants of uncertain significance. Translocations may be reported based on features of known gene fusions, relevant breakpoints, and biological relevance. Evidence may be curated from public and private databases or research and presented as 1) consensus guidelines 2) clinical research, or 3) case studies, with a link to the supporting literature. Germline alterations may be reported as secondary findings in a subset of genes for consenting patients. These may include genes recommended by the American College of Medical Genetics and Genomics (ACMG) and additional genes associated with cancer predisposition or drug resistance.
[0863] In some embodiments, a clinical report 139-3 includes information about clinical trials for which the patient is eligible, therapies that are specific to the patient’s cancer, and/or possible therapeutic adverse effects associated with the specific characteristics of the patient’s cancer, e.g., the patient’s genetic variations, epigenetic abnormalities, associated oncogenic pathogenic infections, and/or pathology abnormalities, or other characteristics of the patient’s sample and/or clinical records. For example, in some embodiments, the clinical report includes such patient information and analysis metrics, including cancer type and/or diagnosis, variant allele fraction, patient demographic and/or institution, matched therapies (e.g., FDA approved and/or investigational), matched clinical trials, variants of unknown significance (VUS), genes with low coverage, panel information, specimen information, details on reported variants, patient clinical history, status and/or availability of previous test results, and/or version of bioinformatics pipeline.
[0864] In some embodiments, the results included in the report, and/or any additional results (for example, from the bioinformatics pipeline), are used to query a database of clinical data, for example, to determine whether there is a trend showing that a particular therapy was effective or ineffective in treating (e.g., slowing or halting cancer progression), and/or adverse effects of such treatments in other patients having the same or similar characteristics.
[0865] In some embodiments, the results are used to design cell-based studies of the patient’s biology, e.g., tumor organoid experiments. For example, an organoid may be genetically engineered to have the same characteristics as the specimen and may be observed after exposure to a therapy to determine whether the therapy can reduce the growth rate of the organoid, and thus may be likely to reduce the growth rate of cancer in the patient associated with the specimen. Similarly, in some embodiments, the results are used to direct studies on tumor organoids derived directly from the patient. An example of such experimentation is described in U.S. Provisional Patent Application No. 62/944,292, filed December 5, 2019, the content of which is hereby incorporated by reference, in its entirety, for all purposes.
[0866] As illustrated in Figure 2A, in some embodiments, a clinical report is checked for final validation, review, and sign-off by a medical practitioner (e.g., a pathologist). The clinical report is then sent for action (e.g., for precision oncology applications).
Digital and Laboratory Health Care Platform:
[0867] In some embodiments, the methods and systems described herein are utilized in combination with, or as part of, a digital and laboratory health care platform that is generally targeted to medical care and research. It should be understood that many uses of the methods and systems described above, in combination with such a platform, are possible. One example of such a platform is described in U.S. Patent Application No. 16/657,804, filed October 18, 2019, which is hereby incorporated herein by reference in its entirety for all purposes.
[0868] For example, an implementation of one or more embodiments of the methods and systems as described above may include microservices constituting a digital and laboratory health care platform supporting analysis of liquid biopsy samples to provide clinical support for personalized cancer therapy. Embodiments may include a single microservice for executing and delivering analysis of liquid biopsy samples to clinical support for personalized cancer therapy or may include a plurality of microservices each having a particular role, which together implement one or more of the embodiments above. In one example, a first microservice may execute sequence analysis in order to deliver genomic features to a second microservice for curating clinical support for personalized cancer therapy based on the identified features. Similarly, the second microservice may execute therapeutic analysis of the curated clinical support to deliver recommended therapeutic modalities, according to various embodiments described herein.
[0869] Where embodiments above are executed in one or more micro-services with or as part of a digital and laboratory health care platform, one or more of such micro-services may be part of an order management system that orchestrates the sequence of events as needed at the appropriate time and in the appropriate order necessary to instantiate embodiments above. A microservices-based order management system is disclosed, for example, in U.S. Prov. Patent Application No. 62/873,693, filed July 12, 2019, which is hereby incorporated herein by reference in its entirety for all purposes.
[0870] For example, continuing with the above first and second microservices, an order management system may notify the first microservice that an order for curating clinical support for personalized cancer therapy has been received and is ready for processing. The first microservice may execute and notify the order management system once the delivery of genomic features for the patient is ready for the second microservice. Furthermore, the order management system may identify that execution parameters (prerequisites) for the second microservice are satisfied, including that the first microservice has completed, and notify the second microservice that it may continue processing the order to curate clinical support for personalized cancer therapy, according to various embodiments described herein.
[0871] In one example, the bioinformatics pipeline (for example, the liquid biopsy bioinformatics pipeline) is encoded within a docker container that receives a direct link to access a FASTA file (for example, stored in a cloud computing environment, AWS s3 bucket, GCP storage unit, etc.), from which it generates BAM files, which may be orchestrated in part or wholly, for example, by the systems and methods disclosed in US Patent App. No. 16/927,976, filed July 13, 2020 and incorporated in its entirety herein for any and all purposes. [0872] Where the digital and laboratory health care platform further includes a genetic analyzer system, the genetic analyzer system may include targeted panels and/or sequencing probes. An example of a targeted panel is disclosed, for example, in U.S. Prov. Patent Application No. 62/902,950, filed September 19, 2019, which is incorporated herein by reference and in its entirety for all purposes. In one example, targeted panels may enable the delivery of next generation sequencing results for providing clinical support for personalized cancer therapy according to various embodiments described herein. An example of the design of next-generation sequencing probes is disclosed, for example, in U.S. Prov. Patent Application No. 62/924,073, filed October 21, 2019, which is incorporated herein by reference and in its entirety for all purposes.
[0873] Where the digital and laboratory health care platform further includes a bioinformatics pipeline, the methods and systems described above may be utilized after completion or substantial completion of the systems and methods utilized in the bioinformatics pipeline. As one example, the bioinformatics pipeline may receive next- generation genetic sequencing results and return a set of binary files, such as one or more BAM files, reflecting nucleic acid (e.g., cfDNA, DNA and/or RNA) read counts aligned to a reference genome. The methods and systems described above may be utilized, for example, to ingest the cfDNA, DNA and/or RNA read counts and produce genomic features as a result.
[0874] When the digital and laboratory health care platform further includes an RNA data normalizer, any RNA read counts may be normalized before processing embodiments as described above. An example of an RNA data normalizer is disclosed, for example, in U.S. Patent Application No. 16/581,706, filed September 24, 2019, which is incorporated herein by reference and in its entirety for all purposes.
[0875] When the digital and laboratory health care platform further includes a genetic data deconvoluter, any system and method for deconvoluting may be utilized for analyzing genetic data associated with a specimen having two or more biological components to determine the contribution of each component to the genetic data and/or determine what genetic data would be associated with any component of the specimen if it were purified. An example of a genetic data deconvoluter is disclosed, for example, in U.S. Patent Application No. 16/732,229 and PCT/US 19/69161, filed December 31, 2019, U.S. Prov. Patent Application No. 62/924,054, filed October 21, 2019, and U.S. Prov. Patent Application No. 62/944,995, filed December 6, 2019, each of which is hereby incorporated herein by reference and in its entirety for all purposes. [0876] When the digital and laboratory health care platform further includes an automated RNA expression caller, RNA expression levels may be adjusted to be expressed as a value relative to a reference expression level, which is often done in order to prepare multiple RNA expression data sets for analysis to avoid artifacts caused when the data sets have differences because they have not been generated by using the same methods, equipment, and/or reagents. An example of an automated RNA expression caller is disclosed, for example, in U.S. Prov. Patent Application No. 62/943,712, filed December 4, 2019, which is incorporated herein by reference and in its entirety for all purposes.
[0877] The digital and laboratory health care platform may further include one or more insight engines to deliver information, characteristics, or determinations related to a disease state that may be based on genetic and/or clinical data associated with a patient and/or specimen. Exemplary insight engines may include a tumor of unknown origin engine, a human leukocyte antigen (HLA) loss of homozygosity (LOH) engine, a tumor mutational burden engine, a PD-Ll status engine, a homologous recombination deficiency engine, a cellular pathway activation report engine, an immune infiltration engine, a microsatellite instability engine, a pathogen infection status engine, and so forth. An example tumor of unknown origin engine is disclosed, for example, in U.S. Prov. Patent Application No. 62/855,750, filed May 31, 2019, which is incorporated herein by reference and in its entirety for all purposes. An example of an HLA LOH engine is disclosed, for example, in U.S. Prov. Patent Application No. 62/889,510, filed August 20, 2019, which is incorporated herein by reference and in its entirety for all purposes. An example of a tumor mutational burden (TMB) engine is disclosed, for example, in U.S. Prov. Patent Application No. 62/804,458, filed February 12, 2019, which is incorporated herein by reference and in its entirety for all purposes. An example of a PD-L1 status engine is disclosed, for example, in U.S. Prov. Patent Application No. 62/854,400, filed May 30, 2019, which is incorporated herein by reference and in its entirety for all purposes. An additional example of a PD-L1 status engine is disclosed, for example, in U.S. Prov. Patent Application No. 62/824,039, filed March 26, 2019, which is incorporated herein by reference and in its entirety for all purposes. An example of a homologous recombination deficiency engine is disclosed, for example, in U.S. Prov. Patent Application No. 62/804,730, filed February 12, 2019, which is incorporated herein by reference and in its entirety for all purposes. An example of a cellular pathway activation report engine is disclosed, for example, in U.S. Prov. Patent Application No. 62/888,163, filed August 16, 2019, which is incorporated herein by reference and in its entirety for all purposes. An example of an immune infiltration engine is disclosed, for example, in U.S. Patent Application No. 16/533,676, filed August 6, 2019, which is incorporated herein by reference and in its entirety for all purposes. An additional example of an immune infiltration engine is disclosed, for example, in U.S. Patent Application No. 62/804,509, filed February 12, 2019, which is incorporated herein by reference and in its entirety for all purposes. An example of an MSI engine is disclosed, for example, in U.S. Patent Application No. 16/653,868, filed October 15, 2019, which is incorporated herein by reference and in its entirety for all purposes. An additional example of an MSI engine is disclosed, for example, in U.S. Prov. Patent Application No. 62/931,600, filed November 6, 2019, which is incorporated herein by reference and in its entirety for all purposes.
[0878] When the digital and laboratory health care platform further includes a report generation engine, the methods and systems described above may be utilized to create a summary report of a patient’s genetic profile and the results of one or more insight engines for presentation to a physician. For instance, the report may provide to the physician information about the extent to which the specimen that was sequenced contained tumor or normal tissue from a first organ, a second organ, a third organ, and so forth. For example, the report may provide a genetic profile for each of the tissue types, tumors, or organs in the specimen. The genetic profile may represent genetic sequences present in the tissue type, tumor, or organ and may include variants, expression levels, information about gene products, or other information that could be derived from genetic analysis of a tissue, tumor, or organ. The report may include therapies and/or clinical trials matched based on a portion or all of the genetic profile or insight engine findings and summaries. For example, the therapies may be matched according to the systems and methods disclosed in U.S. Prov. Patent Application No. 63/130,504, filed December 24, 2020, which is incorporated herein by reference and in its entirety for all purposes. For example, the clinical trials may be matched according to the systems and methods disclosed in U.S. Patent Application No. 16/889,779, filed June 1, 2020, which is incorporated herein by reference and in its entirety for all purposes.
[0879] The report may include a comparison of the results to a database of results from many specimens. In some embodiments, a patient’s clinical data and/or molecular data, including molecular data generated through the use of the systems and methods disclosed herein, (for example, a variant call which may be generated by performing a liquid biopsy, including determination of circulating tumor fraction, dynamic variant thresholding and/or CNV) may be compared to information in a knowledge database that includes clinical and/or molecular data patterns and prescribed therapies, therapy response data, survival data, and/or prognosis data associated with one or more of those patterns. Some of the associations may be based on recommendations from regulatory bodies (for example, FDA, NCCN, etc.), scientific publications, analysis of large databases of molecular and/or clinical data, etc. For example, this comparison may be done for the purpose of determining a likely prognosis for the patient, and/or matching therapies and/or clinical trials to which the patient may be likely to respond, any of which may be included in the report. An example of methods and systems for comparing results to a database of results are disclosed in U.S. Patent Application No. 16/732,168, filed December 31, 2019, which is incorporated herein by reference and in its entirety for all purposes. The information may be used, sometimes in conjunction with similar information from additional specimens and/or clinical response information, to discover biomarkers or design a clinical trial.
[0880] In some embodiments, if the clinical history includes information indicating that the patient had previously been prescribed one or more therapy and did not respond (for example, their disease progressed during and/or after receiving the therapy), the report may include a note that the patient failed the line(s) of therapy. In this case, the report may include and/or emphasize another therapy or therapies that is/are not included in the patient’s clinical data. In one example, the report may indicate that these other therapies may be used as a second, third, or later line of therapy.
[0881] In some embodiments, the systems and methods disclosed herein include the administration of one or more therapies to the patient, which may include a therapy listed on the report.
[0882] When the digital and laboratory health care platform further includes application of one or more of the embodiments herein to organoids developed in connection with the platform, the methods and systems may be used to further evaluate genetic sequencing data derived from an organoid to provide information about the extent to which the organoid that was sequenced contained a first cell type, a second cell type, a third cell type, and so forth. For example, the report may provide a genetic profile for each of the cell types in the specimen. The genetic profile may represent genetic sequences present in a given cell type and may include variants, expression levels, information about gene products, or other information that could be derived from genetic analysis of a cell. The report may include therapies matched based on a portion or all of the deconvoluted information. These therapies may be tested on the organoid, derivatives of that organoid, and/or similar organoids to determine an organoid’s sensitivity to those therapies. For example, organoids may be cultured and tested according to the systems and methods disclosed in U.S. Patent Application No. 16/693,117, filed November 22, 2019; U.S. Prov. Patent Application No. 62/924,621, filed October 22, 2019; and U.S. Prov. Patent Application No. 62/944,292, filed December 5, 2019, each of which is incorporated herein by reference and in its entirety for all purposes.
[0883] When the digital and laboratory health care platform further includes application of one or more of the above in combination with or as part of a medical device or a laboratory developed test that is generally targeted to medical care and research, such laboratory developed test or medical device results may be enhanced and personalized through the use of artificial intelligence. An example of laboratory developed tests, especially those that may be enhanced by artificial intelligence, is disclosed, for example, in U.S. Provisional Patent Application No. 62/924,515, filed October 22, 2019, which is incorporated herein by reference and in its entirety for all purposes.
[0884] It should be understood that the examples given above are illustrative and do not limit the uses of the systems and methods described herein in combination with a digital and laboratory health care platform.
[0885] The results of the bioinformatics pipeline may be provided for report generation 208. Report generation may comprise variant science analysis, including the interpretation of variants (including somatic and germline variants as applicable) for pathogenic and biological significance. The variant science analysis may also estimate microsatellite instability (MSI) or tumor mutational burden. Targeted treatments may be identified based on gene, variant, and cancer type, for further consideration and review by the ordering physician. In some aspects, clinical trials may be identified for which the patient may be eligible, based on mutations, cancer type, and/or clinical history. Subsequent validation may occur, after which the report may be finalized for sign-out and delivery. In some embodiments, a first or second report may include additional data provided through a clinical dataflow 202, such as patient progress notes, pathology reports, imaging reports, and other relevant documents. Such clinical data is ingested, reviewed, and abstracted based on a predefined set of curation rules. The clinical data is then populated into the patient’s clinical history timeline for report generation. [0886] Further details on clinical report generation are disclosed in US Patent Application No. 16/789,363 (PCT/US20/180002), filed February 12, 2020, which is hereby incorporated herein by reference in its entirety.
[0887] Any of the embodiments herein may be combined with an imaging validation process to identify the accuracy of the assay results, prediction results from a slide image, or to compare the results between the assay and the slide image. In one embodiment, a slide having a solid or liquid specimen thereon may be converted into a digital image. The slide may be stained beforehand or unstained. The digital image may be processed by one or more artificial intelligence engines trained to identify one or more biomarkers, molecular features including, for example, DNA and RNA or methylation, or imaging features. Examples of specimen types, artificial intelligence engines, training methods, biomarkers, molecular features, and imaging features are disclosed in U.S. Patent Applications 16/830,186 and 17/139,765, respectively filed March 25, 2020 and December 31, 2020 which are both incorporated by reference for all purposes.
[0888] Once a prediction is obtained from the one or more artificial intelligence engines, the prediction may be compared against the sequencing results to either validate the accuracy of the sequencing result or validate the prediction results. In one embodiment, the specimen may not be processed and/or sent to sequencing unless first identified as likely to occur (or above a likelihood threshold) by the artificial intelligence engine prediction.
[0889] Any of the results from embodiments herein may be combined with a cohort analysis engine or cohort analytics engine to identify relationships between one or more specimens contained within a cohort. A cohort may represent other specimens of similar characteristics to the current specimen or other specimens of different characteristics to the current specimen. Analysis may include survival curves to identify therapies which may improve the treatment of the patient from which the specimen was obtained. Analysis may also include identification of the origin of the specimen, for example, when the specimen is a metastasis of a tumor having no known origin at the time of biopsy. Examples of cohort analysis engines, cohort analytics engines, cohort identification, cohort selection, characteristics, and analysis algorithms including survival curves and origin identification are disclosed in U.S. Patent Applications 16/732,168 and 15/930,234, respectively filed December 31, 2020 and May 12, 2020 which are both incorporated by reference for all purposes. [0890] Molecular data, clinical data, genomic data, and other characteristics associated with either the specimen or the patient from which the specimen is obtained are disclosed herein and may be identified, for example, from an electronic medical health record or a system comprising electronic records associated with the patient.
[0891] Additional embodiments directed to retrieving patient data from a patient data store
[0892] In some embodiments, an artificial intelligence system retrieves features associated with a patient from a patient data store. In some embodiments, a patient data store includes one or more feature modules comprising a collection of features available for every patient in the system. In some embodiments, these features are used to generate predictions of the origin of a patient’s tumor. While feature scope across all patients is informationally dense, an individual patient’s feature set, in some embodiments, is sparsely populated across the entirety of the collective feature scope of all features across all patients. For example, the feature scope across all patients may expand into the tens of thousands of features while a patient’s unique feature set may only include a subset of hundreds or thousands of the collective feature scope based upon the records available for that patient.
[0893] In some embodiments, feature collections may include a diverse set of fields available within patient health records. Clinical information, such as information of health records, in some embodiments, are based upon fields which have been entered into an electronic medical record (EMR) or an electronic health record (EHR) by a physician, nurse, or other medical professional or representative. Other clinical information, in some embodiments, is curated from other sources, such as molecular fields from genetic sequencing reports. In some embodiments, sequencing may include next-generation sequencing (NGS) and comprises long-read, short-read, paired-end, or other forms of sequencing a patient’s somatic and/or normal genome. In some embodiments, a comprehensive collection of features in additional feature modules combines a variety of features together across varying fields of medicine which may include diagnoses, responses to treatment regimens, genetic profiles, clinical and phenotypic characteristics, and/or other medical, geographic, demographic, clinical, molecular, or genetic features. For example, a subset of features may comprise molecular data features, such as features derived from an RNA feature module or a DNA feature module, including sequencing results of a patient’s germline or somatic specimen(s). [0894] In some embodiments, another subset of features, imaging features from an imaging feature module, comprises features identified through review of a specimen, for example, through pathologist review, such as a review of stained H&E or IHC slides. As another example, a subset of features may comprise derivative features obtained from the analysis of the individual and combined results of such feature sets. Features derived from DNA and RNA sequencing may include genetic variants from a variant science module which are present in the sequenced tissue. Further analysis of the genetic variants may include additional steps such as identifying single or multiple nucleotide polymorphisms, identifying whether a variation is an insertion or deletion event, identifying loss or gain of function, identifying fusions, identifying splicing, calculating copy number variation (CNV), calculating microsatellite instability, calculating tumor mutational burden (TMB), or other structural variations within the DNA and RNA. Analysis of slides for H&E staining or IHC staining may reveal features such as tumor infiltration, programmed death-ligand 1 (PD-L1) status, human leukocyte antigen (HLA) status, or other immunological features.
[0895] In some embodiments, features derived from structured, curated, or electronic medical or health records may include clinical features such as diagnosis, symptoms, therapies, outcomes, patient demographics such as patient name, date of birth, gender, ethnicity, date of death, address, smoking status, diagnosis dates for cancer, illness, disease, diabetes, depression, other physical or mental maladies, personal medical history, family medical history, clinical diagnoses such as date of initial diagnosis, date of metastatic diagnosis, cancer staging, tumor characterization, tissue of origin, treatments and outcomes such as line of therapy, therapy groups, clinical trials, medications prescribed or taken, surgeries, radiotherapy, imaging, adverse effects, associated outcomes, genetic testing and laboratory information such as performance scores, lab tests, pathology results, prognostic indicators, date of generic testing, testing provider used, testing method used, such as genetic sequencing method or gene panel, gene results, such as included genes, variants, expression levels/statuses, or corresponding dates to any of the above. Clinical features may also include imaging features.
[0896] In some embodiments, an Omics feature module comprises features derived from information from additional medical- or research-based Omics fields including proteomics, transcriptomics, epigenomics, metabolomics, microbiomics, and other multi-omic fields. In some embodiments, features derived from an organoid modeling lab include the DNA and RNA sequencing information germane to each organoid and results from treatments applied to those organoids. In some embodiments, features derived from imaging data further include reports associated with a stained slide, size of tumor, tumor size differentials over time including treatments during the period of change, as well as machine learning approaches for classifying PDL1 status, HLA status, or other characteristics from imaging data. In some embodiments, other features include the additional derivative features sets from other machine learning approaches based at least in part on combinations of any new features and/or those listed above. For example, imaging results may need to be combined with MSI calculations derived from RNA expressions to determine additional further imaging features. In some embodiments a machine learning model may generate a likelihood that a patient’s cancer will metastasize to a particular organ or a patient’s future probability of metastasis to yet another organ in the body. In some embodiments, other features that are extracted from medical information are also used. There are many thousands of features, and the above listing of types of features are merely representative and should not be construed as a complete or limiting listing of features.
[0897] In some embodiments, an alterations module comprises one or more microservices, servers, scripts, or other executable algorithms which generate alteration features associated with de-identified patient features from the feature collection. In some embodiments, alterations modules retrieve inputs from the feature collection and may provide alterations for storage. Exemplary alterations modules may include one or more of the following alterations as a collection of alteration modules.
[0898] In some embodiments, an IHC (Immunohistochemistry) module identifies antigens (proteins) in cells of a tissue section by exploiting the principle of antibodies binding specifically to antigens in biological tissues. IHC staining is widely used in the diagnosis of abnormal cells such as those found in cancerous tumors. Specific molecular markers are characteristic of particular cellular events such as proliferation or cell death (apoptosis). IHC is also widely used in basic research to understand the distribution and localization of biomarkers and differentially expressed proteins in different parts of a biological tissue. Visualizing an antibody-antigen interaction can be accomplished in a number of ways. In the most common instance, an antibody is conjugated to an enzyme, such as peroxidase, that can catalyze a color-producing reaction in immunoperoxidase staining. Alternatively, the antibody can also be tagged to a fluorophore, such as fluorescein or rhodamine in immunofluorescence. In some embodiments, approximations from RNA expression data, H&E slide imaging data, or other data are generated. [0899] In some embodiments, a Therapies module identifies differences in cancer cells (or other cells near them) that help them grow and thrive and drugs that “target” these differences. Treatment with these drugs is called targeted therapy. For example, many targeted drugs are lethal to the cancer cells with inner “programming” that makes them different from normal, healthy cells, while not affecting most healthy cells. Targeted drugs may block or turn off chemical signals that tell the cancer cell to grow and divide rapidly; change proteins within the cancer cells so the cancer cells die; stop making new blood vessels to feed the cancer cells; trigger a patient’s immune system to kill the cancer cells; or carry toxins to the cancer cells to kill them, without affecting normal cells. Some targeted drugs are more “targeted” than others. Some might target only a single change in cancer cells, while others can affect several different changes. Others boost the way a patient’s body fights the cancer cells. This can affect where these drugs work and what side effects they cause. In some embodiments, matching targeted therapies may include identifying the therapy targets in the patients and satisfying any other inclusion or exclusion criteria that might identify a patient for whom a therapy is likely to be effective.
[0900] In some embodiments, a Trial module identifies and tests hypotheses for treating cancers having specific characteristics by matching features of a patient to clinical trials. These trials have inclusion and exclusion criteria that must be matched to enroll a patient and which may be ingested and structured from publications, trial reports, or other documentation.
[0901] In some embodiments, an Amplifications module identifies genes which increase in count (for example, the number of gene products present in a specimen) disproportionately to other genes. Amplifications may cause a gene having the increased count to go dormant, become overactive, or operate in another unexpected fashion. In some embodiments, amplifications may be detected at a gene level, variant level, RNA transcript or expression level, or even a protein level. In some embodiments, detections are performed across all the different detection mechanisms or levels and validated against one another.
[0902] In some embodiments, an Isoforms module identifies alternative splicing (AS), the biological process in which more than one mRNA type (isoform) is generated from the transcript of a same gene through different combinations of exons and introns. It is estimated by large-scale genomics studies that 30-60% of mammalian genes are alternatively spliced. The possible pahems of alternative splicing for a gene can be very complicated and the complexity increases rapidly as the number of introns in a gene increases. In silico alternative splicing prediction may find large insertions or deletions within a set of mRNA sharing a large portion of aligned sequences by identifying genomic loci through searches of mRNA sequences against genomic sequences, extracting sequences for genomic loci and extending the sequences at both ends up to 20 kb, searching the genomic sequences (repeat sequences have been masked), extracting splicing pairs (two boundaries of alignment gap with GT-AG consensus or with more than two expressed sequence tags aligned at both ends of the gap), assembling splicing pairs according to their coordinates, determining gene boundaries (splicing pair predictions are generated to this point), generating predicted gene structures by aligning mRNA sequences to genomic templates, and comparing splicing pair predictions and gene structure predictions to find alternatively spliced isoforms.
[0903] In some embodiments, an SNP (single-nucleotide polymorphism) module identifies a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g., greater than 1%). For example, at a specific base position, or locus, in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position and the two possible nucleotide variations, C or A, are said to be alleles for this position. SNPs underlie differences in human susceptibility to a wide range of diseases (e.g., sickle-cell anemia, b- thalassemia, and cystic fibrosis result from SNPs). The severity of illness and the way the body responds to treatments are also manifestations of genetic variations. For example, a single-base mutation in the APOE (apolipoprotein E) gene is associated with a lower risk for Alzheimer’s disease. A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. A somatic single nucleotide variation (e.g., caused by cancer) may also be called a single-nucleotide alteration. In some embodiments, an MNP (Multiple-nucleotide polymorphisms) module identifies the substitution of consecutive nucleotides at a specific position in the genome.
[0904] In some embodiments, an Indels module may identify an insertion or deletion of bases in the genome of an organism classified among small genetic variations. While indels usually measure from 1 to 10,000 base pairs in length, a microindel is defined as an indel that results in a net change of 1 to 50 nucleotides. Indels can be contrasted with a SNP or point mutation. An indel inserts and/or deletes nucleotides from a sequence, while a point mutation is a form of substitution that replaces one of the nucleotides without changing the overall number in the DNA. Indels, being insertions and/or deletions, can be used as genetic markers in natural populations, especially in phylogenetic studies. Indel frequency tends to be markedly lower than that of single nucleotide polymorphisms (SNP), except near highly repetitive regions, including homopolymers and microsatellites.
[0905] In some embodiments, an MSI (microsatellite instability) module may identify genetic hypermutability (predisposition to mutation) that results from impaired DNA mismatch repair (MMR). The presence of MSI represents phenotypic evidence that MMR is not functioning normally. MMR corrects errors that spontaneously occur during DNA replication, such as single base mismatches or short insertions and deletions. The proteins involved in MMR correct polymerase errors by forming a complex that binds to the mismatched section of DNA, excises the error, and inserts the correct sequence in its place. Cells with abnormally functioning MMR are unable to correct errors that occur during DNA replication, which causes the cells to accumulate errors in their DNA. This causes the creation of novel microsatellite fragments. Polymerase chain reaction-based assays can reveal these novel microsatellites and provide evidence for the presence of MSI. Microsatellites are repeated sequences of DNA. These sequences can be made of repeating units of one to six base pairs in length. Although the length of these microsatellites is highly variable from person to person and contributes to the individual DNA “fingerprint,” each individual has microsatellites of a set length. The most common microsatellite in humans is a dinucleotide repeat of the nucleotides C and A, which occurs tens of thousands of times across the genome. Microsatellites are also known as simple sequence repeats (SSRs).
[0906] In some embodiments, a TMB (tumor mutational burden) module may identify a measurement of mutations carried by tumor cells and is a predictive biomarker being studied to evaluate its association with response to Immuno-Oncology (I-O) therapy. Tumor cells with high TMB may have more neoantigens, with an associated increase in cancer-fighting T cells in the tumor microenvironment and periphery. These neoantigens can be recognized by T cells, inciting an anti-tumor response. TMB has emerged more recently as a quantitative marker that can help predict potential responses to immunotherapies across different cancers, including melanoma, lung cancer, and bladder cancer. TMB is defined as the total number of mutations per coding area of a tumor genome. Importantly, TMB is consistently reproducible. It provides a quantitative measure that can be used to better inform treatment decisions, such as selection of targeted or immunotherapies or enrollment in clinical trials.
[0907] In some embodiments, a CNV (copy number variation) module may identify deviations from the normal genome, especially in the number of copies of a gene, portions of a gene, or other portions of a genome not defined by a gene, and any subsequent implications from analyzing genes, variants, alleles, or sequences of nucleotides. CNV are the phenomenon in which structural variations may occur in sections of nucleotides, or base pairs, that include repetitions, deletions, or inversions.
[0908] In some embodiments, a Fusions module may identify hybrid genes formed from two previously separate genes. Hybrid genes may be a result of translocation, interstitial deletion, or chromosomal inversion. Gene fusion can play an important role in tumorigenesis. Fusion genes can contribute to tumor formation because they can produce much more active abnormal protein than non-fusion genes. Often, fusion genes are oncogenes that cause cancer; these include BCR-ABL, TEL-AML1 (ALL with t(12 ; 21)), AML1-ETO (M2 AML with t(8 ; 21)), and TMPRSS2-ERG with an interstitial deletion on chromosome 21, often occurring in prostate cancer. In the case of TMPRSS2-ERG, by disrupting androgen receptor (AR) signaling and inhibiting AR expression by oncogenic ETS transcription factor, the fusion product regulates prostate cancer. Most fusion genes are found from hematological cancers, sarcomas, and prostate cancer. BCAM-AKT2 is a fusion gene that is specific and unique to high-grade serous ovarian cancer. Oncogenic fusion genes may lead to a gene product with a new or different function from the two fusion partners. Alternatively, a proto-oncogene may be fused to a strong promoter, and thereby the oncogenic function is set to function by an upregulation caused by the strong promoter of the upstream fusion partner. The latter is common in lymphomas, where oncogenes are juxtaposed to the promoters of the immunoglobulin genes. Oncogenic fusion transcripts may also be caused by trans-splicing or read-through events. Since chromosomal translocations play such a significant role in neoplasia, a specialized database of chromosomal aberrations and gene fusions in cancer has been created. This database is called Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer.
[0909] In some embodiments, a VUS (variant of unknown significance) module may identify variants which are detected in the genome of a patient (especially in a patient’s cancer specimen) but cannot be classified as pathogenic or benign at the time of detection. VUS are catalogued from publications to identify if they may be classified as benign or pathogenic.
[0910] In some embodiments, a DNA Pathways module identifies defects in DNA repair pathways which enable cancer cells to accumulate genomic alterations that contribute to their aggressive phenotype. Cancerous tumors rely on residual DNA repair capacities to survive the damage induced by genotoxic stress which leads to isolated DNA repair pathways being inactivated in cancer cells. DNA repair pathways are generally thought of as mutually exclusive mechanistic units handling different types of lesions in distinct cell cycle phases. Recent preclinical studies, however, provide strong evidence that multifunctional DNA repair hubs, which are involved in multiple conventional DNA repair pathways, are frequently altered in cancer. Identifying pathways which may be affected may lead to important patient treatment considerations.
[0911] In some embodiments, a Raw Counts module identifies a count of the variants that are detected from the sequencing data. For DNA, in some embodiments, this comprises the number of reads from sequencing which correspond to a particular variant in a gene. For RNA, in some embodiments, this comprises the gene expression counts or the transcriptome counts from sequencing.
[0912] In some embodiments, classifications comprise classifications according to one or more trained models for generating predictions and other structural variant classification may include evaluating features from the feature collection, alterations from the alteration module, and other classifications from within itself from one or more classification modules.
Structural variant classification may provide classifications to a stored classifications storage. An exemplary classification module may include a classification of a CNV as “Reportable” may mean that the CNV has been identified in one or more reference databases as influencing the tumor cancer characterization, disease state, or pharmacogenomics, “Not Reportable” may mean that the CNV has not been identified as such, and “Conflicting Evidence” may mean that the CNV has both evidence suggesting “Reportable” and “Not Reportable.” Furthermore, a classification of therapeutic relevance is similarly ascertained from any reference datasets mention of a therapy which may be impacted by the detection (or non detection) of the CNV. Other classifications may include applications of machine learning algorithms, neural networks, regression techniques, graphing techniques, inductive reasoning approaches, or other artificial intelligence evaluations within modules. In some embodiments, a classifier for clinical trials may include evaluation of variants identified from the alteration module which have been identified as significant or reportable, evaluation of all clinical trials available to identify inclusion and exclusion criteria, mapping the patient’s variants and other information to the inclusion and exclusion criteria, and classifying clinical trials as applicable to the patient or as not applicable to the patient. In some embodiments, similar classifications are performed for therapies, loss-of-function, gain-of-function, diagnosis, microsatellite instability, tumor mutational burden, indels, SNP, MNP, fusions, CNV, splicing, and other alterations which may be classified based upon the results of the alteration modules. Additionally, in some embodiments, models trained to classify a type of tumor for patient with tumors of unknown origin are generated according to the disclosure herein. In some embodiments, classifications are generated and stored as part of a feature collection in a stored classifications database.
[0913] In some embodiments, each of the feature collection, alteration module(s), structural variant, and feature store are communicatively coupled to a data bus to transfer data between each module for processing and/or storage. In some embodiments, each of the feature collection, alteration module(s), and classifications may be communicatively coupled to each other for independent communication without sharing the data bus.
[0914] In addition to the above features and enumerated modules, in some embodiments, feature modules may further include one or more of the following modules within their respective modules as a sub-module or as a standalone module.
[0915] In some embodiments, a germline/somatic DNA feature module comprises a feature collection associated with the DNA-derived information of a patient or a patient’s tumor. These features may include raw sequencing results, such as those stored in FASTQ, BAM, VCF, or other sequencing file types known in the art; genes; mutations; variant calls; and variant characterizations. In some embodiments, genomic information from a patient’s normal sample is stored as germline and genomic information from a patient’s tumor sample is stored as somatic.
[0916] In some embodiments, an RNA feature module comprises a feature collection associated with the RNA-derived information of a patient, such as transcriptome information. These features may include raw sequencing results, transcriptome expressions, genes, mutations, variant calls, and variant characterizations.
[0917] In some embodiments, a metadata module comprises a feature collection associated with the human genome, protein structures and their effects, such as changes in energy stability based on a protein structure.
[0918] In some embodiments, a clinical module comprises a feature collection associated with information derived from clinical records of a patient and records from family members of the patient. These may be abstracted from unstructured clinical documents, EMR, EHR, or other sources of patient history. Information may include patient symptoms, diagnosis, treatments, medications, therapies, hospice, responses to treatments, laboratory testing results, medical history, geographic locations of each, demographics, or other features of the patient which may be found in the patient’s medical record. Information about treatments, medications, therapies, and the like may be ingested as a recommendation or prescription and/or as a confirmation that such treatments, medications, therapies, and the like were administered or taken.
[0919] In some embodiments, an imaging module comprises a feature collection associated with information derived from imaging records of a patient. Imaging records may include H&E slides, IHC slides, radiology images, and other medical imaging which may be ordered by a physician during the course of diagnosis and treatment of various illnesses and diseases. These features may include TMB, ploidy, purity, nuclear-cytoplasmic ratio, large nuclei, cell state alterations, biological pathway activations, hormone receptor alterations, immune cell infiltration, immune biomarkers of MMR, MSI, PDL1, CD3, FOXP3, HRD, PTEN, PIK3CA; collagen or stroma composition, appearance, density, or characteristics; tumor budding, size, aggressiveness, metastasis, immune state, chromatin morphology; and other characteristics of cells, tissues, or tumors for prognostic predictions.
[0920] In some embodiments, an epigenome module, such as an epigenome module from Omics, comprises a feature collection associated with information derived from DNA modifications which are not changes to the DNA sequence and regulate the gene expression. These modifications are frequently the result of environmental factors based on what the patient may breathe, eat, or drink. These features may include DNA methylation, histone modification, or other factors which deactivate a gene or cause alterations to gene function without altering the sequence of nucleotides in the gene.
[0921] In some embodiments, a microbiome module, such as microbiome module from Omics, comprises a feature collection associated with information derived from the viruses and bacteria of a patient. Viral genomics may be generated to identify which viruses are present in the patient’s specimen(s) based upon the genomic features which map to viral DNA or RNA (e.g., a viral reference genome(s)) instead of the human genome. These features may include viral infections which may affect treatment and diagnosis of certain illnesses as well as the bacteria present in the patient’s gastrointestinal tract which may affect the efficacy of medicines ingested by the patient. [0922] In some embodiments, a proteome module, such as proteome module from Omics, comprises a feature collection associated with information derived from the proteins produced in the patient. These features may include protein composition, structure, and activity; when and where proteins are expressed; rates of protein production, degradation, and steady-state abundance; how proteins are modified, for example, post-translational modifications such as phosphorylation; the movement of proteins between subcellular compartments; the involvement of proteins in metabolic pathways; how proteins interact with one another; or modifications to the protein after translation from the RNA such as phosphorylation, ubiquitination, methylation, acetylation, glycosylation, oxidation, or nitrosylation.
[0923] In some embodiments, additional Omics module(s) are included in Omics, such as a feature collection associated with all the different field of omics, including: cognitive genomics, a collection of features comprising the study of the changes in cognitive processes associated with genetic profiles; comparative genomics, a collection of features comprising the study of the relationship of genome structure and function across different biological species or strains; functional genomics, a collection of features comprising the study of gene and protein functions and interactions including transcriptomics; interactomics, a collection of features comprising the study relating to large-scale analyses of gene-gene, protein- protein, or protein-ligand interactions; metagenomics, a collection of features comprising the study of metagenomes such as genetic material recovered directly from environmental samples; neurogenomics, a collection of features comprising the study of genetic influences on the development and function of the nervous system; pangenomics, a collection of features comprising the study of the entire collection of gene families found within a given species; personal genomics, a collection of features comprising the study of genomics concerned with the sequencing and analysis of the genome of an individual such that once the genotypes are known, the individual’s genotype can be compared with the published literature to determine likelihood of trait expression and disease risk to enhance personalized medicine suggestions; epigenomics, a collection of features comprising the study of supporting the structure of genome, including protein and RNA binders, alternative DNA structures, and chemical modifications on DNA; nucleomics, a collection of features comprising the study of the complete set of genomic components which form the cell nucleus as a complex, dynamic biological system; lipidomics, a collection of features comprising the study of cellular lipids, including the modifications made to any particular set of lipids produced by a patient; proteomics, a collection of features comprising the study of proteins, including the modifications made to any particular set of proteins produced by a patient; immunoproteomics, a collection of features comprising the study of large sets of proteins involved in the immune response; phosphoproteomics, a collection of features comprising the study of phosphorylation patterns of proteins, including the modifications made to any particular set of proteins produced by a patient; nutriproteomics, a collection of features comprising the study of identifying molecular targets of nutritive and non-nutritive components of the diet including the use of proteomics mass spectrometry data for protein expression studies; proteogenomics, a collection of features comprising the study of biological research at the intersection of proteomics and genomics including data which identifies gene annotations; structural genomics, a collection of features comprising the study of 3 -dimensional structure of every protein encoded by a given genome using a combination of modeling approaches; gly comics, a collection of features comprising the study of sugars and carbohydrates and their effects in the patient; foodomics, a collection of features comprising the study of the intersection between the food and nutrition domains through the application and integration of technologies to improve consumer’s well-being, health, and knowledge; transcriptomics, a collection of features comprising the study of RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA, produced in cells; metabolomics, a collection of features comprising the study of chemical processes involving metabolites, or unique chemical fingerprints that specific cellular processes leave behind, and their small-molecule metabolite profiles; metabonomics, a collection of features comprising the study of the quantitative measurement of the dynamic multiparametric metabolic response of cells to pathophysiological stimuli or genetic modification; nutrigenetics, a collection of features comprising the study of genetic variations on the interaction between diet and health with implications to susceptible subgroups; cognitive genomics, a collection of features comprising the study of the changes in cognitive processes associated with genetic profiles; pharmacogenomics, a collection of features comprising the study of the effect of the sum of variations within the human genome on drugs; pharmacomicrobiomics, a collection of features comprising the study of the effect of variations within the human microbiome on drugs; toxicogenomics, a collection of features comprising the study of gene and protein activity within particular cell or tissue of an organism in response to toxic substances; mitointeractome, a collection of features comprising the study of the process by which the mitochondria proteins interact; psychogenomics, a collection of features comprising the study of the process of applying the powerful tools of genomics and proteomics to achieve a better understanding of the biological substrates of normal behavior and of diseases of the brain that manifest themselves as behavioral abnormalities, including applying psychogenomics to the study of drug addiction to develop more effective treatments for these disorders as well as objective diagnostic tools, preventive measures, and cures; stem cell genomics, a collection of features comprising the study of stem cell biology to establish stem cells as a model system for understanding human biology and disease states; connectomics, a collection of features comprising the study of the neural connections in the brain; microbiomics, a collection of features comprising the study of the genomes of the communities of microorganisms that live in the digestive tract; cellomics, a collection of features comprising the study of the quantitative cell analysis and study using bioimaging methods and bioinformatics; tomomics, a collection of features comprising the study of tomography and omics methods to understand tissue or cell biochemistry at high spatial resolution from imaging mass spectrometry data; ethomics, a collection of features comprising the study of high-throughput machine measurement of patient behavior; and videomics, a collection of features comprising the study of a video analysis paradigm inspired by genomics principles, where a continuous image sequence, or video, can be interpreted as the capture of a single image evolving through time of mutations revealing patient insights.
[0924] In some embodiments, a sufficiently robust collection of features comprises all of the features disclosed above; however, models and predictions based from the available features comprise models which are optimized and trained from a selection of features that are much more limiting than the exhaustive feature set. In some embodiments, such a constrained feature set comprises as few as tens to hundreds of features. For example, a model’s constrained feature set may include the genomic results of a sequencing of the patient’s tumor, derivative features based upon the genomic results, the patient’s tumor origin, the patient’s age at diagnosis, the patient’s gender and race, and symptoms that the patient brought to their physicians attention during a routine checkup.
[0925] In some embodiments, a feature store may enhance a patient’s feature set through the application of machine learning and analytics by selecting from any features, alterations, or calculated output derived from the patient’s features or alterations to those features. In some embodiments, such a feature store may generate new features from the original features found in feature module or may identify and store important insights or analysis based upon the features. In some embodiments, the selection of features is based at least upon an alteration or calculation to be generated, and comprises the calculation of single or multiple nucleotide polymorphisms insertion or deletions of the genome, a tumor mutational burden, a microsatellite instability, a copy number variation, a fusion, or other such calculations. In some embodiments, an exemplary output of an alteration or calculation generated which may inform future alterations or calculations includes a finding of hypertrophic cardiomyopathy (HCM) and variants in MYH7. In some embodiments, previous classified variants may be identified in the patient’s genome which may inform the classification of novel variants or indicate a further risk of disease. An exemplary approach includes the enrichment of variants and their respective classifications to identify a region in MYH7 that is associated with HCM. Novel variants detected from a patient’s sequencing localized to this region would increase the patient’s risk for HCM. In some embodiments, features which may be utilized in such an alteration detection include the structure of MYH7 and classification of variants therein. In some embodiments, a model focused on enrichment may isolate such variants. An exemplary output of an alteration or calculation generated which may inform future alterations or calculations includes a finding of lung cancer and variants in EGFR, an epidermal growth factor receptor gene that is mutated in -10% of non-small cell lung cancer and -50% of lung cancers from non-smokers. In some embodiments, previously classified variants may be identified in the patient’s genome which may inform the classification of novel variants or indicate a further risk of disease. An exemplary approach may include the enrichment of variants and their respective classifications to identify a region nearby or with evidence to interact with EGFR and associated with cancer. Novel variants detected from a patient’s sequencing localized to this region or interactions with this region would increase the patient’s risk. In some embodiments, features which may be utilized in such an alteration detection include the structure of EGFR and classification of variants therein. In some embodiments, a model focused enrichment may isolate such variants.
[0926] In some embodiments, the above referenced classification model may include one or more classification models which may be implemented as artificial intelligence engines and may include gradient boosting models, random forest models, neural networks (NN), regression models, Naive Bayes models, or machine learning algorithms (MLA). An MLA or a NN may be trained from a training data set. In an exemplary prediction profile, a training data set may include imaging, pathology, clinical, and/or molecular reports and details of a patient, such as those curated from an EHR or genetic sequencing reports. MLAs include supervised algorithms (such as algorithms where the features/classifications in the data set are annotated) using linear regression, logistic regression, decision trees, classification and regression trees, Naive Bayes, nearest neighbor clustering; unsupervised algorithms (such as algorithms where no features/classification in the data set are annotated) using Apriori, means clustering, principal component analysis, random forest, adaptive boosting; and semi-supervised algorithms (such as algorithms where an incomplete number of features/classifications in the data set are annotated) using generative approach (such as a mixture of Gaussian distributions, mixture of multinomial distributions, hidden Markov models), low density separation, graph-based approaches (such as mincut, harmonic function, manifold regularization), heuristic approaches, or support vector machines. NNs include conditional random fields, convolutional neural networks, attention based neural networks, deep learning, long short term memory networks, or other neural models where the training data set includes a plurality of tumor samples, RNA expression data for each sample, and pathology reports covering imaging data for each sample. While MLA and neural networks identify distinct approaches to machine learning, the terms may be used interchangeably herein. Thus, a mention of MLA may include a corresponding NN or a mention of NN may include a corresponding MLA unless explicitly stated otherwise. Training may include providing optimized datasets, labeling these traits as they occur in patient records, and training the MLA to predict or classify based on new inputs. Artificial NNs are efficient computing models which have shown their strengths in solving hard problems in artificial intelligence. They have also been shown to be universal approximators (can represent a wide variety of functions when given appropriate parameters). In some embodiments, some MLA may identify features of importance and identify a coefficient, or weight, to them. The coefficient may be multiplied with the occurrence frequency of the feature to generate a score, and once the scores of one or more features exceed a threshold, certain classifications may be predicted by the MLA. In some embodiments, a coefficient schema may be combined with a rule-based schema to generate more complicated predictions, such as predictions based upon multiple features. For example, ten key features may be identified across different classifications. In some embodiments, a list of coefficients may exist for the key features, and a rule set may exist for the classification. In some embodiments, a rule set may be based upon the number of occurrences of the feature, the scaled weights of the features, or other qualitative and quantitative assessments of features encoded in logic known to those of ordinary skill in the art. In other MLA, features may be organized in a binary tree structure. For example, key features which distinguish between the most classifications may exist as the root of the binary tree and each subsequent branch in the tree until a classification may be awarded based upon reaching a terminal node of the tree. For example, a binary tree may have a root node which tests for a first feature. The occurrence or non-occurrence of this feature must exist (the binary decision), and the logic may traverse the branch which is true for the item being classified. Additional rules may be based upon thresholds, ranges, or other qualitative and quantitative tests. While supervised methods are useful when the training dataset has many known values or annotations, the nature of EMR/EHR documents is that there may not be many annotations provided. When exploring large amounts of unlabeled data, unsupervised methods are useful for binning/bucketing instances in the data set. A single instance of the above models, or two or more such instances in combination, may constitute a model for the purposes of models, artificial intelligence, neural networks, or machine learning algorithms, herein.
[0927] Online Portal
[0928] In various embodiments, one or more statistical models and analyses is combined to accommodate a particular purpose and, through a variation of the initial analysis, is used to solve a number of problems. Such a combination of statistical models and analyses are, in some embodiments, stored as a notebook in the Interactive Analysis Portal. Notebook is a feature in the Interactive Analysis Portal which provides an easily accessible framework for building statistical models and analyses. Once the statistical models and analyses have been developed, they may then be shared with different users to analyze and find answers to scientific and business questions other than those for which they were initially developed.
[0929] 1) In some embodiments, the Interactive Analysis Portal allows input customization through a simple, intuitive point-and-click/drag-and-drop interface to narrow down the cohort for analysis. Cohorts which have been selected, either through the Interactive Analysis Portal, Outliers, Smart Cohorts, or other portals of the Interactive Analysis Portal, are, in some embodiments, provided to a notebook for processing.
[0930] 2) In some embodiments, a custom application interface (API) having a library of function calls which interface with the Interactive Analysis Portal, underlying authorized databases, and any supported statistical models, visualizations, arithmetic models, and other provided operations may be provided to the user to integrate a notebook or workbook with the Interactive Analysis Portal data, function calls, and other resources. Exemplary function calls may include listing authorized sources of data, selecting a datasource, filtering the datasource, listing clinical events of the patients in the current filtered cohort, identification of fusions from RNA or DNA, identification of genes from RNA or DNA, identifying matching clinical trials, DNA variants, identifying immunohistochemistry (IHC), identifying RNA expressions, identifying therapies in the cohort, identifying potential therapies that are applicable to treat patients in the cohort, and other cohort or dataset processing.
[0931] 3) In some embodiments, the Interactive Analysis Portal allows the Notebook generation to perform one or more statistical models, analysis, and visualization or reporting of results to the narrowed down cohort without having the user code anything in the notebook as the selected models, analysis, visualizations, or reports of the notebook itself are configured to accept the cohort from the Interactive Analysis Portal and provide the analysis on the cohort as is, without user intervention at the code level. Some models may have hyperparameters or tuning parameters which may be selected, or the models themselves may identify the optimal parameters to be applied based on the cohort and/or other models, analysis, visualizations, or reports during run-time.
[0932] 4) In some embodiments, the Interactive Analysis Portal displays the prepared results to the user based on the selected notebook.
[0933] 5) In some embodiments, an associated user selects a previously generated notebook which applies selected analysis to the narrowed down cohort without having the user code or recode anything in the notebook as the notebook itself is configured to accept the cohort from the Interactive Analysis Portal and provide the notebook results without user intervention.
[0934] 6) In some embodiments, users track the computation resources used by their notebooks for understanding the costs for cloud computing or hardware resources over the network and may track the popularity of their notebook to judge the effectiveness of the statistical analysis that they provide through the notebook.
[0935] In certain embodiments, notebooks provide a benefit to users by allowing the Interactive Analysis Portal to provide custom templates to their selected data and leverage pre-built healthcare statistical models to provide results to users who are not sophisticated in programming. Internal teams may analyze curated data in order to support new healthcare insights that both help improve patient care and improve life science research. Similarly, external users have easy access to this proprietary real-world data for analysis and access to proprietary statistical models.
[0936] A billing model for a user may be provided on a subscription basis or an on- demand basis. For example, a user may subscribe to one or more data sets for a period of time, such as a monthly or yearly subscription, or the user may pay on a per-access basis for data and notebook usage, such as for loading a specific cohort with corresponding notebook and paying a fee to generate the instant results for consumption. Users may desire a benchmarking and optimization portal through which they may view and optimize their storage and computing resources uses.
[0937] Generating a notebook may be performed with a GUI for notebook editing. A user may configure a reporting page for a notebook. A reporting page may include text, images, and graphs as selected and populated by the users. Preconfigured elements may be selected from a list, such as a dropdown list or a drag-and-drop menu. Preconfigured elements include statistical analysis modules and machine learning models. For example, a user may wish to perform linear regression on the data with respect to specific features. A user may select linear regression, and a menu with checkboxes may appear with features from their data set which should be supplied to the linear regression model. Once filled out, a template for reporting the linear regression results with respect to the selected features may be added to the reporting page at a location identified by the active cursor or the drop location for a drag-and drop-element. If a user wishes to solve a problem using a machine learning model, it may be added to the sheet. A header may be populated identifying the model, the hypertuning parameters, and the reported results. In some instances, a model that was previously trained may then be applied to the current cohort. In other instances, the model may be trained on the fly, for example by selecting annotated features and associated outcomes for which the model should be trained. In an unsupervised machine learning model, the model may not require selection of annotated features as the features will be identified during training. In some embodiments, if a selected statistical model requires results from a trained model which are not computed in the template, the template may automatically add the trained model to generate the required results prior to inserting the selected statistical model to the notebook.
[0938] Statistical analysis models may be predesigned for calculating the arithmetic mean of the cohort with respect to a selected feature, the standard deviation/distribution of the cohort for a selected feature, regression relationships between variables for selected features, sample size determining models for subsetting the cohort into the optimal sub-population for analysis, or t-testing modules for identifying statistically significant features and correlations in the cohort. Other precomputed statistical analysis modules may perform cohort analysis to identify significant correlations and/or features in the cohort, data mining to identify meaningful patterns, or data dredging to match statistical models to the data and report out which models may be applicable and add those models to the notebook.
[0939] Machine learning models may apply linear regression algorithms, non-linear regression, logistic regression algorithms, classification models, bootstrap resampling models, subset selection models, dimensionality reduction models, tree-based models (such as bagging, boosting, and random forest), and other supervised or unsupervised models. As each model is selected, a target output may be requested from the user specifying which feature(s) the model should identify, classify, and/or report. For example, a user may select for the model to identify which features most closely correlate to patient survival in the cohort, or which features most closely correlate with a positive treatment outcome in the cohort. The user may also select which classification labels from the classification labels of the model that they wish the model to classify. In an example where the model may classify the cohort according to five labels, the user may specify one or more labels as a binary classification (patient has label, patient does not have label) such as whether a patient with a tumor of unknown origin originated from the breast, lung, or brain. The user may select only breast to identify for any tumors of unknown origin whether the tumor may be classified as coming from the breast or not from the breast.
[0940] A system for predicting and analyzing patient cohort response, progression, and survival may include a back-end layer that includes a patient data store accessible by a patient cohort selector module in communication with a patient cohort timeline data storage. The patient cohort selector module interacts with a front-end layer that includes an interactive analysis portal that may be implemented, in one instance, via a web browser to allow for on- demand filtering and analysis of the data store.
[0941] The interactive analysis portal may include a plurality of user interfaces including an interactive cohort selection filtering interface that, as discussed in greater detail below, permits a user to query and filter elements of the data store. As discussed in greater detail below, the portal also may include a cohort funnel and population analysis interface, a patient timeline analysis user interface, a patient survival analysis user interface, and a patient event likelihood analysis user interface. The portal further may include a patient next analysis user interface and one or more patient future analysis user interfaces.
[0942] The back-end layer also may include a distributed computing and modeling layer that receives data from the patient cohort timeline data storage to provide inputs to a plurality of modules, including, a time to event modeling module that powers the patient survival analysis user interface, an event likelihood module that calculates the likelihood of one or more events received at the patient event likelihood analysis user interface for subsequent display in that user interface, a next event modeling module that generates models of one or more next events for subsequent display at the patient next event analysis user interface, and one or more future modeling modules that generate one or more future models for subsequent display at the one or more patient future analysis user interfaces.
[0943] The patient data store may be a pre-existing dataset that includes patient clinical history, such as demographics, comorbidities, diagnoses and recurrences, medications, surgeries, and other treatments along with their response and adverse effects details. The Patient Data Store may also include patient genetic/molecular sequencing and genetic mutation details relating to the patient, as well as organoid modeling results. In one aspect, these datasets may be generated from one or more sources. For example, institutions implementing the system may be able to draw from all of their records; for example, all records from all doctors and/or patients connected with the institution may be available to the institutions agents, physicians, research, or other authorized members. Similarly, doctors may be able to draw from all of their records; for example, records for all of their patients. Alternatively, certain system users may be able to buy or license aspect to the datasets, such as when those users do not have immediate access to a sufficiently robust dataset, when those users are looking for even more records, and/or when those users are looking for specific data types, such as data reflecting patients having certain primary cancers, metastases by origin site and/or diagnosis site, recurrences by origin, metastases, or diagnosis sites, etc.
[0944] Patient Cohort Filtering User Interface
[0945] A first embodiment of a patient cohort selection filtering interface may be provided as a side pane provided along a height (or, alternatively, a length) of a display screen, through which attribute criteria (such as clinical, molecular, demographic etc.) can be specified by the user, defining a patient population of interest for further analysis. The side pane may be hidden or expanded by selecting it, dragging it, double-clicking it, etc.
[0946] Additionally, or alternatively, the system may recognize one or more attributes defined for tumor data stored by the system, where those attributes may be, for example, genotypic, phenotypic, genealogical, or demographic. The various selectable attribute criteria may reflect patient-related metadata stored in the patient data store, where exemplary metadata may include, for instance: Project Name (which may reflect a database storing a list of patients), Gender, Race; Cancer, Cancer Site, Cancer Name; Metastasis, Cancer Name; Tumor Site (which may reflect where the tumor was located), Stage (such as I, II, III, IV, and unknown), M Stage (such as mO, ml, m2, m3, and unknown); Medication (such as by Name or Ingredient); Sequencing (such as gene name or variant), MSI (Microsatellite Instability) status, TMB (Tumor Mutational Burden) status; Procedure (such as, by Name); or Death (such as, by Event Name or Cause of Death).
[0947] The system also may permit a user to filter patient data according to any of the criteria listed herein including those listed under the heading “Features and Feature Modules,” and include one or more of the following additional criteria: institution, demographics, molecular data, assessments, diagnosis site, tumor characterization, treatment, or one or more internal criteria. The institution option may permit a user to filter according to a specific facility. The demographics option may permit a user to sort, for example, by one or more of gender, death status, age at initial diagnosis, or race. The molecular data option may permit a user to filter according to variant calls (for example, when there is molecular data available for the patient, what the particular gene name, mutation, mutation effect, and/or sample type is), abstracted variants (including, for example, gene name and/or sequencing method), MSI status (for example, stable, low, or high), or TMB status (for example, selectable within or outside of a user-defined ranges). Assessments may permit a user to filter according to various system-defined criteria such as smoking status and/or menopausal status. Diagnosis site may permit a user to filter according to primary and/or metastatic sites. Tumor characterization may permit the user to filter according to one or more tumor-related criteria, for example, grade, histology, stage, TNM Classification of Malignant Tumours (TNM) and/or each respective T value, N value, and/or M value. Treatment may permit the user to select from among various treatment-related options, including, for instance, an ingredient, a regimen, a treatment type, etc.
[0948] Certain criteria may permit the user to select from a plurality of sub-criteria that may be indicated once the initial criteria are selected. Other criteria may present the user with a binary option, for example, deceased or not. Still other criteria may present the user with slider or range-type options, for example, age at initial diagnosis may presented as a slider with user-selectable lower and upper bounds. Still further, for any of these options, the system may present the user with a radio button or slider to alternate between whether the system should include or exclude patients based on the selected criterion. It should be understood that the examples described herein do not limit the scope of the types of information that may be used as criteria. Any type of medical information capable of being stored in a structured format may be used as a criteria.
[0949] In another embodiment, the user interface may include a natural language search style bar to facilitate filter criteria definition for the cohort, for example, in the “Ask Gene” tab of the user interface or via a text input of the filtering interface. In one aspect, an ability to specify a query, either via keyboard-type input or via machine-interpreted dictation, may define one or more of the subsequent layers of a cohort funnel (described in greater detail in the next section). Thus, for example, when employing traditional natural language processing software or techniques, an input of “breast cancer patients” would cause the system to recognize a filter of “cancer_site == breast cancer” and add that as the next layer of filtering. Similarly, the system would recognize an input of “pancreatic patients with adverse reactions to gemcitabine” and translate it into multiple successive layers of filtering, for example, “cancer_site == pancreatic cancer” AND “medication == gemcitabine” AND “adverse reaction == not null.”
[0950] In a second aspect, the natural language processing may permit a user to use the system to query for general insights directly, thereby both narrowing down a cohort of patients via one or more funnel levels and also causing the system to display an appropriate summary panel in the user interface. Thus, in the situation that the system receives the query “What is the 5 years progression-free survival rate for stage III colorectal cancer patients, after radiotherapy?” it would translate it into a series of filters such as “cancer_site == colorectal” AND “stage == III” AND “treatment == radiotherapy” and then display five-year progression-free survival rates using, for example, the patient survival analysis user interface 30. Similarly, the query “What percentage of female lung cancer patients are post menopausal at a time of diagnosis?” would translate it into a series of patients such as “gender == female,” “cancer_site == lung,” and “temporal == at diagnosis,” determine how many of the resulting patients had data reflecting a post-menopause situation, and then determine the relevant percentage, for example, displaying the results through one or more statistical summary charts.
[0951] Cohort Funnel and Population Analysis User Interface
[0952] The cohort funnel and population analysis user interface may be configured to permit a user to conduct analysis of a cohort, for the purpose of identifying key inflection points in the distribution of patients exhibiting each attribute of interest, relative to the distributions in the general patient population or a patient population whose data is stored in the patient data store. In one aspect, the filtering and selection of additional patient-related criteria discussed above may be used in connection with the cohort funnel and population analysis user interface.
[0953] In another embodiment, the system may include a selectable button or icon that opens a dialogue box which shows a plurality of selectable tabs, each tab representing the same or similar filtering criteria discussed above (Demographics, Molecular Data, Assessments, Diagnosis Site, Tumor Characterization, and Treatment). Selection of each tab may present the user with the same or similar options for each respective filter as discussed above (for example, selecting “Demographics” may present the user with further options relating to: Gender, Death Status, Age at Initial Diagnosis, or Race). The user then may select one or more options, select “next,” and then select whether it is an inclusion or exclusion filter, and the corresponding selection is added to the funnel (discussed in greater detail below), with an icon moving to be below a next successively narrower portion of the funnel.
[0954] Additionally, or alternatively, looking at the cohort, or set of patients in a database, the system permits filtering by a plurality of clinical and molecular factors via a menu. For example, and with regard to clinical factors, the system may include filters based on patient demographics, cancer site, tumor characterization, or molecular data which further may include their own subsets of filterable options, such as histology, stage, and/or grade- based options for tumor characterization. With regard to molecular factors, the system may permit filtering according to variant calls, abstracted variants, MSI, and/or TMB.
[0955] Although the examples discussed herein provide analysis with regard to various cancer types, in other embodiments, it will be appreciated that the system may be used to indicate filtered display of other disease conditions, and it should be understood that the selection items will differ in those situations to focus particularly on the relevant conditions for the other disease.
[0956] The cohort funnel and population analysis user interface visually may depict the number of patients in the data set, either all at once or progressively upon receiving a user’s selection of multiple filtering criteria. In one aspect, the display of patient frequencies by filter attribute may be provided using an interactive funnel chart. With each selection, the user interface updates to illustrate the reduction in results matching the filter criteria; for example, as more filter criteria are added, fewer patients matching all of the selected criteria exist, upon receiving each of a user’s filtering factors.
[0957] The above filtering can be performed upon receiving each user selection of a filter criterion, the funnel updating to show the narrowing span of the dataset upon each filter selection. In that situation a filtering menu such as the one discussed above may remain visible in each tab as they are toggled, or may be collapsed to the side, or may be represented as a summary of the selected filtered options to keep the user apprised of the reduced data set/size.
[0958] With regard to each filtering method discussed above, the combination of factors may be based on Boolean-style combinations. Exemplary Boolean-style combinations may include, for filtering factors A and B, permitting the user to select whether to search for patients with “A AND B,” “A OR B,” “A AND NOT B,” “B AND NOT A,” etc.
[0959] The final filtered cohort of interest may form the basis for further detailed analysis in the modules or other user interfaces described below. The population of interest is called a “cohort”. The user interface can provide fixed functional attribute selectors pre-populated appropriately based on the available data attributes in a Patient Data Store.
[0960] The display may further indicate a geographic location clustering plot of patients and/or demographic distribution comparisons with publicly reported statistics and/or privately curated statistics.
[0961] Patient Timeline Analysis Module
[0962] Additionally, the system may include a patient timeline analysis module that permits a user to review the sequence of events in the clinical life of each patient. It will be appreciated that this data may be anonymized, as discussed above, in order to protect confidentiality of the patient data.
[0963] Once a user has provided all of his or her desired filter criteria, e.g., via the cohort funnel & population analysis user interface, the system permits the user to analyze the filtered subset of patients. With respect to the user interface depicted in the figures, this procedure may be accomplished by selecting the “Analyze Cohort” option presented in the upper right- hand comer of the interface. [0964] After requesting analysis of the filtered subset of patients, the user interface may generate a data summary window in the patient timeline analysis user interface, with one or more regions providing information about the selected patient subset, for example, a number of other distributions across clinical and molecular features. In one aspect, a first region may include demographic information such as an average patient age and/or a plot of patient ages. A second region may include additional demographic information, such as gender information, for the subset of patients. A third region may include a summary of certain clinical data, including, for example, an analysis of the medications taken by each of the patients in the subset. Similarly, a fourth region may include molecular data about each of the patients, for example, a breakdown of each genomic variant or alteration possessed by the patients in the subset.
[0965] The user interface also permits a user to query the data summary information presented in the data summary window or region in order to sort that data further, e.g., using a control panel. The system may be configured to sort the patient data based on one or more factors including, for example, gender, histology, menopausal status, response, smoking status, stage, and surgical procedures. Selecting one or more of these options may not reduce the sample size of patients, as was the case above when discussing filtering being summarized in the data summary window. Instead, the sort functions may subdivide the summarized information into one or more subcategories. For example, medication information may be sorted by having additional response data layered over it within the data summary window, along with a legend explaining the layered response data.
[0966] The subset of patients selected by the user also may be compared against a second subset (or “cohort”) of patients, e.g., via a drop-down menu, thereby facilitating a side-by- side analysis of the groups. Doing so may permit the user to quickly and easily see any similarities, as well as any noticeable differences, between the subsets.
[0967] In one embodiment, an event timeline Gantt style chart is provided for a high- level overview, coupled with a tabular detail panel. The display may also enable the visualization and comparison of multiple patients concurrently on a normalized timeline, for the purposes of identifying both areas of overlap, and potential discontinuity across a patient subset.
[0968] Patient “Survival ” Analysis Module [0969] The system further may provide survival analysis for the subset of patients through use of the patient survival analysis user interface. This modeling and visualization component may enable the user to interactively explore time until event (and probability at time) curves and their confidence intervals, for sub-groups of the filtered cohort of interest. The time series inception and target events can be selected and dynamically modified by the user, along with attributes on which to cluster patient groups within the chosen population, all while the curve visualizer reactively adapts to the provided parameters.
[0970] In order to provide the user with flexibility to define the metes and bounds of that analysis, the system may permit the user to select one or both of the starting and ending events upon which that analysis is based. Exemplary starting events include an initial primary disease diagnosis, progression, metastasis, regression, identification of a first primary cancer, an initial prescription of medication, etc. Conversely, exemplary ending events may include progression, metastasis, recurrence, death, a period of time, and treatment start/end dates. Selecting a starting event sets an anchor point for all patients from which the curve begins, and selecting an end event sets a horizon for which the curve is predicting.
[0971] The analysis may be presented to the user in the form of a plot of ending event, for example, progression free survival or overall survival, versus time. Progression for these purposes may reflect the occurrence of one or more progression events, for example, a metastases event, a recurrence, a specific measure of progression for a drug or independent of a drug, a certain tumor size or change in tumor size, or an enriched measurement (such as measurements which are indirectly extracted from the underlying clinical data set).
Exemplary enriched measurements may include detecting a stage change (such as by detecting a stage 2 categorization changed to stage 3), a regression, or via an inference (such as both stage 3 and metastases are inferred from detection of stages 2 and 4, but no detection of stage 3).
[0972] Additionally, the system may be configured to permit the user to focus or zoom in on a particular time span within the plot. In particular, the user may be able to zoom in the x- axis only, the y-axis only, or both the x- and y-axes at the same time. This functionality may be particularly useful depending on the type of disease being analyzed, as certain, aggressive diseases may benefit from analyzing a smaller window of time than other diseases. For example, survival rates for patients with pancreatic cancer tend to be significantly lower than for other types of cancer; thus, when analyzing pancreatic cancer, it may be useful to the user to zoom in to a shorter time period, for example, going from about a 5-year window to about a 1-year window.
[0973] The user interface also may be configured to modify its display and present survival information of smaller groups within the subset by receiving user inputs corresponding to additional grouping or sorting criteria. Those criteria may be clinical or molecular factors, and the user interface may include a selector such as one or more drop down menus permitting the user to select, e.g., any beginning event or ending event, as well as gender, gene, histology, regimens, smoking status, stage, surgical procedures, etc.
[0974] Selecting one of the criteria then may present the user with a plurality of options relevant to that criterion. For example, selecting “regimens” may cause the system to use one or more value sets to populate a selectable field generated within the user interface to prompt the user to select one or more of the specific medication regimens undertaken by one or more of the patients within the subset. Thus, selecting the “Gemcitabine + Paclitaxel” option, followed by the “FOLFIRINOX” option, results in the system analyzing the patient subset data, determining which patients’ records include data corresponding to either of the selected regimens, recalculating the survival statistics for those separate groups of patients, and updating the user interface to include separate survival plots for each regimen. Adding a group/adding two or more selections may result in the system plotting them on the same chart to view them side by side, and the user interface may generate a legend with name, color, and sample size to distinguish each group.
[0975] The system may permit a greater level of analysis by calculating and overlaying statistical ranges with respect to the survival analysis. In particular, the system may calculate confidence intervals with regard to each dataset requested by the user and display those confidence intervals relative to the survival plots. In one instance, the desired confidence interval may be user-established. In another instance, the confidence interval may be pre- established by the system and may be, for example, a 68% (one standard deviation) interval, a 95% (two standard deviations) interval, or a 99.7% (three standard deviations) interval. Confidence intervals may be calculated as Kaplan Meier confidence intervals or using another type of statistical analysis, as would be appreciated by one of ordinary skill in the relevant art.
[0976] As will be appreciated from the previous discussion, underpinning the utility of the system is the ability to highlight features and interaction pathways of high importance driving these predictions, and the ability to further pinpoint cohorts of patients exhibiting levels of response that significantly deviate from expected norms. In this context, high importance may be understood to be based upon feature importance to an outcome of a prediction. In particular, features that provide the greatest weight to the prediction may be designated as those of high importance. The present system and user interface provide an intuitive, efficient method for patient selection and cohort definition given specific inclusion and/or exclusion criteria. The system also provides a robust user interface to facilitate internal research and analysis, including research and analysis into the impact of specific clinical and/or molecular attributes, as well as drug dosages, combinations, and/or other treatment protocols on therapeutic outcomes and patient survival for potentially large, otherwise unwieldy patient sample sizes.
[0977] The modeling and visualization framework set forth herein may enable users to interactively explore auto-detected patterns in the clinical and genomic data of their filtered patient cohort, and to analyze the relationship of those patterns to therapeutic response and/or survival likelihood. That analysis may lead a user to more informed treatment decisions for patients, earlier in the cycle than may be the case without the present system and user interface. The analysis also may be useful in the context of clinical trials, providing robust, data-backed clinical trial inclusion and/or exclusion analysis. Backed by an extensive library of clinical and molecular data, the present system unifies and applies various algorithms and concepts relating to clinical analysis and machine learning to generate a fully integrated, interactive user interface.
[0978] User interfaces, such as those described herein, may also be provided for an interactive mobile device. Examples of mobile user interfaces are disclosed in US Patent Application 16/289,027, and filed February 28, 2019 which is incorporated by reference for all purposes.
[0979] The present disclosure describes an application interface that physicians can reference easily through their mobile or tablet device. Through the application interface, reports a physician sees may be supplemented with aggregated data (such as de-identified data from other patient reports) to provide critical decision informing statistics or metrics right to their fingertips. While a mobile or tablet device is referenced herein throughout for the sake of simplicity and consistency, it will be appreciated that the device running the application interface may include any device, such as a personal computer or other hardware connected through a server hosting the application, or devices such as mobile cameras that permit the capturing of digital images for transfer to another hardware system connected through a server hosting the application.
[0980] An exemplary device may be any device capable of receiving user input and capturing data a physician may desire to compare against an exemplary cohort to generate treatment recommendations. An exemplary cohort may be a patient cohort, such as a group of patients with similarities; those similarities may include diagnoses, responses to treatment regimens, genetic profiles, and/or other medical, geographic, demographic, clinical, molecular, or genetic features.
[0981] Generating a report supplement may be performed by a physician by opening or starting-up the application, following prompts provided by the applications to capture or upload a report or EMR, and validating any fields from the report that the application automatically populated for accuracy. Once captured, the patient’s data may be uploaded to server and analyzed in real time; furthermore, cohort statistics relating to the patient’s profile may be delivered to the application for the physician’s reference and review.
[0982] In one embodiment, a home screen of a mobile application is displayed on the mobile device. The home screen may provide access to patient records through a patient interface that may provide, for one or more patients, patient identification information, such as the patient’s name, diagnosis, and record identifiers, enabling a user (physician or staff) to identify a patient and to confirm that the record selected by the user and/or presented on the mobile device relates to the patient that the user wishes to analyze. In the event that a desired patient is not displayed on the patient interface, the home screen may also include a search indicator that, upon selection by the user, receives text input such as a patient’s name, unique identifier, or diagnosis, that permits the user to filter the patients by the search criteria of the text input to search for a specific patient. The mobile device may include a touch screen, through which a user may select a desired patient by touching the area on the application interface that includes the desired patient identification information. A cursor (not shown) may appear on the screen where the user touches to emphasize touch or gestures received.
[0983] Alternative home screens may be implemented that provide a user with options to perform other functions, as well as to access a patient identification information screen, and it will be appreciated that exemplary embodiments referenced herein are not intended to limit the interface of the application in function or design. [0984] A user may add a new patient by selecting a corresponding “add patient” icon on the application or through gesture recognition. Exemplary gestures may include swiping across the screen of the mobile device to the left or right, using several fingers to scroll or swipe, tapping or holding down on a portion of the patient interface not occupied by patient identification, or any other designated gesture. Alternatively, if no patient data is present in the patient interface, the interface may default to adding a user once active. Adding a patient may either be performed manually, by entering patient information into the application, or automatically by uploading patient data into the application. Furthermore, automatic uploading may be implemented by capturing an image of patient data at the mobile device, such as from a report.
[0985] Once a user has selected a patient, completed adding a new patient, or is adding a new patient from a report, an electronic document capture screen may appear. The system may be configured to capture images of documents that are saved in a plurality of different formats. Exemplary electronic document captures may include a structured data form (such as JSON, XML, HTML, etc.), an image (such as JPEG, PNG, etc.), a PDF of a document, report, or file, or a typeface or handwritten copy of a document, report, or file.
[0986] In order to electronically capture a physical copy of a document, the user may place the document on a surface, such as a surface that provides a contrasting color or texture to the document, and aim the mobile device’s camera at the document so that an image of the document appears in the document capture screen. The user then may select a document capture icon to begin a document capture process. In an alternative embodiment, an automatic capture may be generated once capture criteria are met. Exemplary capture criteria may include that the document bounds are identifiable and/or that the document is in focus.
[0987] In one embodiment, the document may be sent to an optical character recognition (OCR) process and a document classifier may process the OCR output of the electronic document capture to recognize document identifiers which are linked to features of the document stored in a predefined model for each document. Predefined models may also be referred to as predetermined models. Document identifiers may include Form numbers (such as Form CA217b, Patient Report Rev .17, AB12937, etc.) indicating a specific version of a document which provides key health information in each of the respective document’s features. Features of a document may include headers, columns, tables, graphs, and other standard forms which appear in the document. [0988] As a result of the OCR output, the application also may identify medical data present in the document. Medical data, or key health information, may include numerous fields including, but not limited to, patient demographics (such as patient name, date of birth, gender, ethnicity, date of death, address, smoking status, diagnosis dates, personal medical history, or family medical history), clinical diagnoses (such as date of initial diagnosis, date of metastatic diagnosis, cancer staging, tumor characterization, or tissue of origin), treatments and outcomes (such as therapy groups, medications, surgeries, radiotherapy, imaging, adverse effects, associated outcomes, or corresponding dates), and genetic testing and laboratory information (such as genetic testing, performance scores, lab tests, pathology results, prognostic indicators, or corresponding dates).
[0989] Each of the fields, for example the address, cancer staging, medications, or genetic testing may also have a plurality of subfields. The address field may have subfields for type of use (personal or business), street, city, state, zip, country, and a start or end date (date that residency at the address begins or expires). Genetic testing may have subfields for the date of genetic testing, testing provider used, test method, such as genetic sequencing method or gene panel, gene results, such as included genes, variants, expression levels/statuses, tumor mutational burden, and microsatellite instability. One type of genetic testing may be next- generation sequencing (NGS). The above-provided examples, enumerations, and lists are not intended to limit the scope of the available fields and are intended to be representative of the nature and structure that fields may take.
[0990] Thereafter one or more pages may be displayed within the application tabulating the captured information and/or presenting the captured information within the context of the analytics and/or reporting as described herein with respect to one or more embodiments.
[0991] Stand-alone Device Integration
[0992] Hardware devices incorporating one or more embodiments as described herein may be implemented. In one example, a hardware device may record progress notes or other documents, automatically converting recorded audio into features and storing them in a structured format with respect to a patient. In another example, a hardware device may broadcast a response containing one or more analytical results, patient features, or reports as described in any of the embodiments above.
[0993] It has been recognized that a relatively small and portable voice activated and audio responding interface device (hereinafter “collaboration device”) can be provided enabling oncologists to conduct at least initial database access and manipulation activities. In at least some embodiments, a collaboration device includes a processor linked to each of a microphone, a speaker and a wireless transceiver (e.g., transmitter and receiver). The processor runs software for capturing voice signals generated by an oncologist. An automated speech recognition (ASR) system converts the voice signals to a text file which is then processed by a natural language processor (NLP) or other artificial intelligence module (e.g., a natural language understanding module) to generate a data operation (e.g., commands to perform some data access or manipulation process such as a query, a filter, a memorialization, a clearing of prior queries and filter results, note etc.).
[0994] In at least some embodiments the collaboration device is used within a collaboration system that includes a server that maintains and manipulates an industry specific data repository. The data operation is received by the collaboration server and used to access and/or manipulate data the database data thereby generating a data response. In at least some cases, the data response is returned to the collaboration device as an audio file which is broadcast to the oncologist as a result associated with the original query.
[0995] In some cases, the voice signal to text file transcription is performed by the collaboration device processor while in other cases the voice signal is transmitted from the collaboration device to the collaboration server and the collaboration server does the transcription to a text file. In some cases, the text file is converted to a data operation by the collaboration device processor and in other cases that conversion is performed by the collaboration server. In some cases, the collaboration server maintains or has access to the industry specific database so that the server operates as an intermediary between the collaboration device and the industry specific database.
[0996] In at least some embodiments the collaboration device is a dedicated collaboration device that is provided solely as an interface to the collaboration server and industry specific database. In these cases, the collaboration interface device may be on all the time and may only run a single dedicated application program so that the device does not require any boot up time and can be activated essentially immediately via a single activation activity performed by an oncologist.
[0997] For instance, in some cases the collaboration device may have motion sensors (e.g., an accelerometer, a gyroscope, etc.) linked to the processor so that the simple act of picking up the device causes the processor to activate an application. In other cases, the collaboration device processor may be programmed to “listen” for the phrase “Hey query” and once received, activate to capture a next voice signal utterance that operates as seed data for generating the text file. In other cases, the processor may be programmed to listen for a different activation phrase, such as a brand name of the system or a combination of a brand name plus a command indication. For instance, if the brand name of the system is “One” then the activation phrase may be “One” or “Go One” or the like. In still other cases the collaboration device may simply listen for voice signal utterances that it can recognize as oncological queries and may then automatically use any recognized query as seed data for text generation.
[0998] In addition to providing audio responses to data operations, in at least some cases the system automatically records and stores data operations (e.g, data defining the operations) and responses as a collaboration record for subsequent access. The collaboration record may include one or the other or both of the original voice signal and broadcast response or the text file and a text response corresponding to the data response. Here, the stored collaboration record provides details regarding the oncologist’s search and data operation activities that help automatically memorialize the hypothesis or idea the oncologist was considering. In a case where an oncologist asks a series of queries, those queries and data responses may be stored as a single line of questioning so that they together provide more detail for characterizing the oncologist’s initial hypothesis or idea. At a subsequent time, the system may enable the oncologist to access the memorialized queries and data responses so that she can re-enter a flow state associated therewith and continue hypothesis testing and data manipulation using a workstation type interface or other computer device that includes a display screen and perhaps audio devices like speakers, a microphone, etc., more suitable for presenting more complex data sets and data representations.
[0999] In addition to simple data search queries, other voice signal data operation types are contemplated. For instance, the system may support filter operations where an oncologist voice signal message defines a sub-set of the industry specific database set. For example, the oncologist may voice the message “Access all medical records for male patients over 45 years of age that have had pancreatic cancer since 1990,” causing the system to generate an associated subset of data that meet the specified criteria.
[1000] Importantly, some data responses to oncological queries will be “audio suitable” meaning that the response can be well understood and comprehended when broadcast as an audio message. In other cases, a data response simply may not be well suited to be presented as an audio output. For instance, where a query includes the phrase “Who is the patient that I saw during my last office visit last Thursday?”, an audio suitable response may be “Mary Brown.” On the other hand, if a query is “List all the medications that have been prescribed for males over 45 years of age that have had pancreatic cancer since 1978” and the response includes a list of 225 medications, the list would not be audio suitable as it would take a long time to broadcast each list entry and comprehension of all list entries would be dubious at best.
[1001] In cases where a data response is optimally visually presented, the system may take alternate or additional steps to provide the response in an intelligible format to the user. The system may simply indicate as part of an audio response that response data would be more suitably presented in visual format and then present the audio response. If there is a proximate large display screen, such as a computer monitor or a television (TV) such as a smart TV, the system may pair with that display and present visual data with or without audio data. The system may simply indicate that no suitable audio response is available. In some embodiments, the system may pair with a computational device that includes a display, such as a smartphone, tablet computer, etc.
[1002] Thus, at least some inventive embodiments enable intuitive and rapid access to complex data sets essentially anywhere within a wireless communication zone so that an oncologist can initiate thought processes in real time when they occur. By answering questions when they occur, the system enables oncologists to dig deeper in the moment into data and continue the thought process through a progression of queries. Some embodiments memorialize an oncologist’s queries and responses so that at subsequent times the oncologist can re-access that information and continue queries related thereto. In cases where visual and audio responses are available, the system may adapt to provide visual responses when visual capabilities are present or may simply store the visual responses as part of a collaboration record for subsequent access when an oncologist has access to a workstation or the like.
[1003] In at least some embodiments the disclosure includes a method for interacting with a database to access data therein, the method for use with a collaboration device including a speaker, a microphone and a processor, the method comprising the steps of associating separate sets of state-specific intents and supporting information with different clinical report types, the supporting information including at least one intent-specific data operation for each state-specific intent, receiving a voice query via the microphone seeking information, identifying a specific patient associated with the query, identifying a state- specific clinical report associated with the identified patient, attempting to select one of the state-specific intents associated with the identified state-specific clinical report as a match for the query, upon selection of one of the state-specific intents, performing the at least one data operation associated with the selected state-specific intent to generate a result, using the result to form a query response and broadcasting the query response via the speaker.
[1004] In some cases the method is for use with at least a first database that includes information in addition the clinical reports, the method further including, in response to the query, obtaining at least a subset of the information in addition to the clinical reports, the step of using the result to form a query response including using the result and the additional obtained information to form the query response.
[1005] In some cases, the at least one data operation includes at least one data operation for accessing additional information from the database, the step of obtaining at least a subset includes obtaining data per the at least one data operation for accessing additional information from the database.
[1006] Some embodiments include a method for interacting with a database to access data therein, the method for use with a collaboration device including a speaker, a microphone and a processor, the method comprising the steps of associating separate sets of state-specific intents and supporting information with different clinical report types, the supporting information including at least one intent-specific primary data operation for each state-specific intent, receiving a voice query via the microphone seeking information, identifying a specific patient associated with the query, identifying a state-specific clinical report associated with the identified patient, attempting to select one of the state-specific intents associated with the identified state-specific clinical report as a match for the query, upon selection of one of the state-specific intents, performing the primary data operation associated with the selected state-specific intent to generate a result, performing a supplemental data operation on data from a database that includes data in addition to the clinical report data to generate additional information, using the result and the additional information to form a query response and broadcasting the query response via the speaker.
[1007] Some embodiments include a method of audibly broadcasting responses to a user based on user queries about a specific patient molecular report, the method comprising receiving an audible query from the user to a microphone coupled to a collaboration device, identifying at least one intent associated with the audible query, identifying at least one data operation associated with the at least one intent, associating each of the at least one data operations with a first set of data presented on the molecular report, executing each of the at least one data operations on a second set of data to generate response data, generating an audible response file associated with the response data and providing the audible response file for broadcasting via a speaker coupled to the collaboration device.
[1008] In at least some cases the audible query includes a question about a nucleotide profile associated with the patient. In at least some cases the nucleotide profile associated with the patient is a profile of the patient’s cancer. In at least some cases the nucleotide profile associated with the patient is a profile of the patient’s germline. In at least some cases the nucleotide profile is a DNA profile. In at least some cases the nucleotide profile is an RNA expression profile. In at least some cases the nucleotide profile is a mutation biomarker.
[1009] In at least some cases the mutation biomarker is a BRCA biomarker. In at least some cases the audible query includes a question about a therapy. In at least some cases the audible query includes a question about a gene. In at least some cases the audible query includes a question about a clinical data. In at least some cases the audible query includes a question about a next-generation sequencing panel. In at least some cases the audible query includes a question about a biomarker.
[1010] In at least some cases the audible query includes a question about an immune biomarker. In at least some cases the audible query includes a question about an antibody- based test. In at least some cases the audible query includes a question about a clinical trial. In at least some cases the audible query includes a question about an organoid assay. In at least some cases the audible query includes a question about a pathology image. In at least some cases the audible query includes a question about a disease type. In at least some cases the at least one intent is an intent related to a biomarker. In at least some cases the biomarker is a BRCA biomarker. In at least some cases the at least one intent is an intent related to a clinical condition. In at least some cases the at least one intent is an intent related to a clinical trial.
[1011] In at least some cases the at least one intent is related to a drug. In at least some cases the drug intent is related to a drug is chemotherapy. In at least some cases the drug intent is an intent related to a PARP inhibitor intent. In at least some cases the at least one intent is related to a gene. In at least some cases the at least one intent is related to immunology. In at least some cases the at least one intent is related to a knowledge database. In at least some cases the at least one intent is related to testing methods. In at least some cases the at least one intent is related to a gene panel. In at least some cases the at least one intent is related to a report. In at least some cases the at least one intent is related to an organoid process. In at least some cases the at least one intent is related to imaging.
[1012] In at least some cases the at least one intent is related to a pathogen. In at least some cases the at least one intent is related to a vaccine. In at least some cases the at least one data operation includes an operation to identify at least one treatment option. In at least some cases the at least one data operation includes an operation to identify knowledge about a therapy. In at least some cases the at least one data operation includes an operation to identify knowledge related to at least one drug (e.g., “What drugs are associated with high CD40 expression?”). In at least some cases the at least one data operation includes an operation to identify knowledge related to mutation testing (e.g., “Was Dwayne Holder’s sample tested for a KMT2D mutation?”). In at least some cases the at least one data operation includes an operation to identify knowledge related to mutation presence (e.g., “Does Dwayne Holder have a KMT2C mutation?”). In at least some cases the at least one data operation includes an operation to identify knowledge related to tumor characterization (e.g. “Could Dwayne Holder’s tumor be a BRCA2 driven tumor?”). In at least some cases the at least one data operation includes an operation to identify knowledge related to testing requirements (e.g., “What tumor percentage does Tempus require for TMB results?”). In at least some cases the at least one data operation includes an operation to query for definition information (e.g., “What is PDL1 expression?”). In at least some cases the at least one data operation includes an operation to query for expert information (e.g., “What is the clinical relevance of PDL1 expression?”; “What are the common risks associated with the Whipple procedure?”). In at least some cases the at least one data operation includes an operation to identify information related to recommended therapy (e.g., “Dwayne Holder is in the 88th percentile of PDL1 expression, is he a candidate for immunotherapy?”). In at least some cases the at least one data operation includes an operation to query for information relating to a patient (e.g., Dwayne Holder). In at least some cases the at least one data operation includes an operation to query for information relating to patients with one or more clinical characteristics similar to the patient (e.g., “What are the most common adverse events for patients similar to Dwayne Holder?”). [1013] In at least some cases the at least one data operation includes an operation to query for information relating to patient cohorts (e.g., “What are the most common adverse events for pancreatic cancer patients?”). In at least some cases the at least one data operation includes an operation to query for information relating to clinical trials (e.g., “Which clinical trials is Dwayne the best match for?”).
[1014] In at least some cases the at least one data operation includes an operation to query about a characteristic relating to a genomic mutation. In at least some cases the characteristic is loss of heterozygosity. In at least some cases the characteristic reflects the source of the mutation. In at least some cases the source is germline. In at least some cases the source is somatic. In at least some cases the characteristic includes whether the mutation is a tumor driver. In at least some cases the first set of data comprises a patient name.
[1015] In at least some cases the first set of data comprises a patient age. In at least some cases the first set of data comprises a next-generation sequencing panel. In at least some cases the first set of data comprises a genomic variant. In at least some cases the first set of data comprises a somatic genomic variant. In at least some cases the first set of data comprises a germline genomic variant. In at least some cases the first set of data comprises a clinically actionable genomic variant. In at least some cases the first set of data comprises a loss of function variant. In at least some cases the first set of data comprises a gain of function variant.
[1016] In at least some cases the first set of data comprises an immunology marker. In at least some cases the first set of data comprises a tumor mutational burden. In at least some cases the first set of data comprises a microsatellite instability status. In at least some cases the first set of data comprises a diagnosis. In at least some cases the first set of data comprises a therapy. In at least some cases the first set of data comprises a therapy approved by the U.S. Food and Drug Administration. In at least some cases the first set of data comprises a drug therapy. In at least some cases the first set of data comprises a radiation therapy. In at least some cases the first set of data comprises a chemotherapy. In at least some cases the first set of data comprises a cancer vaccine therapy. In at least some cases the first set of data comprises an oncolytic virus therapy.
[1017] In at least some cases the first set of data comprises an immunotherapy. In at least some cases the first set of data comprises a pembrolizumab therapy. In at least some cases the first set of data comprises a CAR-T therapy. In at least some cases the first set of data comprises a proton therapy. In at least some cases the first set of data comprises an ultrasound therapy. In at least some cases the first set of data comprises a surgery. In at least some cases the first set of data comprises a hormone therapy. In at least some cases the first set of data comprises an off-label therapy. In at least some cases, the first set of data comprises a gene editing therapy. In at least some cases, the gene editing therapy is clustered regularly interspaced short palindromic repeats (CRISPR) therapy.
[1018] In at least some cases the first set of data comprises an on-label therapy. In at least some cases the first set of data comprises a bone marrow transplant event. In at least some cases the first set of data comprises a cryoablation event. In at least some cases the first set of data comprises a radiofrequency ablation. In at least some cases the first set of data comprises a monoclonal antibody therapy. In at least some cases the first set of data comprises an angiogenesis inhibitor. In at least some cases the first set of data comprises a PARP inhibitor.
[1019] In at least some cases the first set of data comprises a targeted therapy. In at least some cases the first set of data comprises an indication of use. In at least some cases the first set of data comprises a clinical trial. In at least some cases the first set of data comprises a distance to a location conducting a clinical trial. In at least some cases the first set of data comprises a variant of unknown significance. In at least some cases the first set of data comprises a mutation effect.
[1020] In at least some cases the first set of data comprises a variant allele fraction. In at least some cases the first set of data comprises a low coverage region. In at least some cases the first set of data comprises a clinical history. In at least some cases the first set of data comprises a biopsy result. In at least some cases the first set of data comprises an imaging result. In at least some cases the first set of data comprises an MRI result.
[1021] In at least some cases the data comprises a CT result. In at least some cases the first set of data comprises a therapy prescription. In at least some cases the first set of data comprises a therapy administration. In at least some cases the first set of data comprises a cancer subtype diagnosis. In at least some cases the first set of data comprises a cancer subtype diagnosis by RNA class. In at least some cases the first set of data comprises a result of a therapy applied to an organoid grown from the patient’s cells. In at least some cases the first set of data comprises a tumor quality measure. In at least some cases the first set of data comprises a tumor quality measure selected from at least one of the set of PD-L1, MMR, tumor infiltrating lymphocyte count, and tumor ploidy. In at least some cases the first set of data comprises a tumor quality measure derived from an image analysis of a pathology slide of the patient’s tumor. In at least some cases the first set of data comprises a signaling pathway associated with a tumor of the patient.
[1022] In at least some cases the signaling pathway is a HER pathway. In at least some cases the signaling pathway is a MAPK pathway. In at least some cases the signaling pathway is a MDM2-TP53 pathway. In at least some cases the signaling pathway is a PI3K pathway. In at least some cases the signaling pathway is a mTOR pathway.
[1023] In at least some cases the at least one data operations includes an operation to query for a treatment option, the first set of data comprises a genomic variant, and the associating step comprises adjusting the operation to query for the treatment option based on the genomic variant. In at least some cases the at least one data operations includes an operation to query for a clinical history data, the first set of data comprises a therapy, and the associating step comprises adjusting the operation to query for the clinical history data element based on the therapy. In at least some cases the clinical history data is medication prescriptions, the therapy is pembrolizumab, and the associating step comprises adjusting the operation to query for the prescription of pembrolizumab.
[1024] In at least some cases the second set of data comprises clinical health information. In at least some cases the second set of data comprises genomic variant information. In at least some cases the second set of data comprises DNA sequencing information. In at least some cases the second set of data comprises RNA information. In at least some cases the second set of data comprises DNA sequencing information from short-read sequencing. In at least some cases the second set of data comprises DNA sequencing information from long- read sequencing. In at least some cases the second set of data comprises RNA transcriptome information. In at least some cases the second set of data comprises RNA full-transcriptome information. In at least some cases the second set of data is stored in a single data repository. In at least some cases the second set of data is stored in a plurality of data repositories.
[1025] In at least some cases the second set of data comprises clinical health information and genomic variant information. In at least some cases the second set of data comprises immunology marker information. In at least some cases the second set of data comprises microsatellite instability immunology marker information. In at least some cases the second set of data comprises tumor mutational burden immunology marker information. In at least some cases the second set of data comprises clinical health information comprising one or more of demographic information, diagnostic information, assessment results, laboratory results, prescribed or administered therapies, and outcomes information.
[1026] In at least some cases the second set of data comprises demographic information comprising one or more of patient age, patient date of birth, gender, race, ethnicity, institution of care, comorbidities, and smoking history. In at least some cases the second set of data comprises diagnosis information comprising one or more of tissue of origin, date of initial diagnosis, histology, histology grade, metastatic diagnosis, date of metastatic diagnosis, site or sites of metastasis, and staging information. In at least some cases the second set of data comprises staging information comprising one or more of TNM, ISS, DSS, FAB, RAI, and Binet. In at least some cases the second set of data comprises assessment information comprising one or more of performance status (including ECOG or Kamofsky status), performance status score, and date of performance status.
[1027] In at least some cases the second set of data comprises laboratory information comprising one or more of type of lab (e.g., CBS, CMP, PSA, CEA), lab results, lab units, date of lab service, date of molecular pathology test, assay type, assay result (e.g. positive, negative, equivocal, mutated, wild type), molecular pathology method (e.g., IHC, FISH, NGS), and molecular pathology provider. In at least some cases the second set of data comprises treatment information comprising one or more of drug name, drug start date, drug end date, drug dosage, drug units, drug number of cycles, surgical procedure type, date of surgical procedure, radiation site, radiation modality, radiation start date, radiation end date, radiation total dose delivered, and radiation total fractions delivered.
[1028] In at least some cases the second set of data comprises outcomes information comprising one or more of Response to Therapy (e.g., CR, PR, SD, PD), RECIST score, Date of Outcome, date of observation, date of progression, date of recurrence, adverse event to therapy, adverse event date of presentation, adverse event grade, date of death, date of last follow-up, and disease status at last follow up. In at least some cases the second set of data comprises information that has been de-identified in accordance with a de-identification method permitted by HIPAA.
[1029] In at least some cases the second set of data comprises information that has been de-identified in accordance with a safe harbor de-identification method permitted by HIPAA. In at least some cases the second set of data comprises information that has been de-identified in accordance with a statistical de-identification method permitted by HIPAA. In at least some cases the second set of data comprises clinical health information of patients diagnosed with a cancer condition.
[1030] In at least some cases the second set of data comprises clinical health information of patients diagnosed with a cardiovascular condition. In at least some cases the second set of data comprises clinical health information of patients diagnosed with a diabetes condition. In at least some cases the second set of data comprises clinical health information of patients diagnosed with an autoimmune condition. In at least some cases the second set of data comprises clinical health information of patients diagnosed with a lupus condition.
[1031] In at least some cases the second set of data comprises clinical health information of patients diagnosed with a psoriasis condition. In at least some cases the second set of data comprises clinical health information of patients diagnosed with a depression condition. In at least some cases the second set of data comprises clinical health information of patients diagnosed with a rare disease.
[1032] In at least some embodiments, a method of audibly broadcasting responses to a user based on user queries about a specific patient’s molecular report is provided by the disclosure. The method can be used with a collaboration device that includes a processor and a microphone and a speaker linked to the processor. The method can include storing molecular reports for a plurality of patients in a system database, receiving an audible query from the user via the microphone, identifying at least one intent associated with the audible query, identifying at least one data operation associated with the at least one intent, accessing the specific patient’s molecular report, executing at least one of the identified at least one data operations on a first set of data included in the specific patient’s molecular report to generate a first set of response data, using the first set of response data to generate an audible response file, and broadcasting the audible response file via the speaker.
[1033] In at least some cases the method can further include identifying qualifying parameters in the audible query, the step of identifying at least one data operation including identifying the at least one data operation based on both the identified intent and the qualifying parameters.
[1034] In at least some cases at least one of the qualifying parameters includes a patient identity. [1035] In at least some cases at least one of the qualifying parameters includes a patient’s disease state.
[1036] In at least some cases at least one of the qualifying parameters includes a genetic mutation.
[1037] In at least some cases at least one of the qualifying parameters includes a procedure type.
[1038] In at least some cases the method further includes identifying qualifying parameters in the specific patient’s molecular report, the step of identifying at least one data operation including identifying the at least one data operation based on both the identified intent and the qualifying parameters.
[1039] In at least some cases the method further includes the step of storing a general knowledge database that includes non-patient specific data about specific topics, wherein the step of identifying at least one data operation associated with the at least one intent includes identifying at least first and second data operations associated with the at least one intent, the first data operation associated with the specific patient’s molecular report and the second data operation associated with the general knowledge database.
[1040] In at least some cases the second data operation associated with the general knowledge database is executed first to generate second data operation results, the second data operation results is used to define the first data operation and the first data operation associated with the specific patient’s molecular report is executed second to generate the first set of response data.
[1041] In at least some cases the first data operation associated with the specific patient’s molecular report is executed first to generate first data operation results, the first data operation results is used to define the second data operation and the second data operation associated with the general knowledge database is executed second to generate the first set of response data.
[1042] In at least some cases the step of identifying at least one intent includes determining that the audible query is associated with the specific patient, accessing the specific patient’s molecular report, determining the specific patient’s cancer state from the molecular report and then selecting an intent from a pool of cancer state related intents. [1043] In at least some cases the method further includes the step of storing a general knowledge database that includes non-patient specific data about specific topics, the method further including the steps of, upon determining that the audible query is not associated with any specific patient, selecting an intent that is associated with the general knowledge database.
[1044] In at least some cases the collaboration device includes a portable wireless device that includes a wireless transceiver.
[1045] In at least some cases the collaboration device is a handheld device.
[1046] In at least some cases the collaboration device includes at least one visual indicator, the processor linked to the visual indicator and controllable to change at least some aspect of the appearance of the visual indicator to indicate different states of the collaboration device.
[1047] In at least some cases the processor is programmed to monitor microphone input to identify a “wake up” phrase, the processor monitoring for the audible query after the wake up phrase is detected.
[1048] In at least some cases a series of audible queries is received via the microphone, and the at least one of the identified data operations includes identifying a subset of data that is usable with subsequent audio queries to identify intents associated with the subsequent queries.
[1049] In at least some cases the method can further include the steps of, based on at least one audible query received via the microphone and related data in a system database, identifying at least one activity that a collaboration device user may want to perform and initiating the at least one activity.
[1050] In at least some cases the step of initiating the at least one activity includes generating a second audible response file and broadcasting the second audible response file to the user seeking verification that the at least one activity should be performed and monitoring the microphone for an affirmative response and, upon receiving an affirmative response, initiating the at least one activity.
[1051] In at least some cases the at least one activity includes periodically capturing health information from electronic health records included in the system database. [1052] In at least some cases the at least one activity includes checking status of an existing clinical or lab order.
[1053] In at least some cases the at least one activity includes ordering a new clinical or lab order.
[1054] In at least some cases the collaboration device is one of a smartphone, a tablet computer, a laptop computer, a desktop computer, or an Amazon Echo.
[1055] In at least some cases the step of initiating the at least one activity includes automatically initiating the at least one activity without any initiating input from the user.
[1056] In at least some cases the method further includes storing and maintaining a general cancer knowledge database, persistently updating the specific patient’s molecular report, automatically identifying at least one intent and associated data operation related to the general cancer knowledge database based on the specific patient’s molecular report data, persistently executing the associated data operation on the general cancer knowledge database to generate a new set of response data not previously generated and, upon generating a new set of response data, using the new set of response data to generate another audible response file and broadcasting the another audible response file via the speaker.
[1057] In at least some cases the method is also used with an electronic health records system that maintains health records associated with a plurality of patients including the specific patient, the method further including identifying at least another data operation associated with the at least one intent and executing the another data operation on the specific patient’s health record to generate additional response data.
[1058] In at least some cases the step of using the first set of response data to generate an audible response file includes using the response data and the additional response data to generate the audible response file.
[1059] In at least some embodiments, a method of audibly broadcasting responses to a user based on user queries about a specific patient’s molecular report, the method for use with a collaboration device that includes a processor and a microphone and a speaker linked to the processor is provided by the disclosure. The method includes storing a separate molecular report for each of a plurality of patients in a system database, storing a general cancer knowledge database that includes non-patient specific data about cancer topics, receiving an audible query from the user via the microphone, identifying at least one intent associated with the audible query, identifying at least a first data operation associated with the at least one intent and the specific patient’s molecular report, identifying at least a second data operation associated with the at least one intent and the general cancer knowledge database, accessing the specific patient’s molecular report and the general cancer knowledge database, executing the at least a first data operation on a first set of data included in the specific patient’s molecular report to generate a first set of response data, executing the at least a second data operation of the general cancer knowledge database to generate a second set of response data, using at least one of the first and second sets of response data to generate an audible response file, and broadcasting the audible response file via the speaker.
[1060] In at least some embodiments, a method of audibly broadcasting responses to a user based on user queries about a specific patient’s molecular report, the method for use with a collaboration device that includes a processor and a microphone and a speaker linked to the processor is provided by the disclosure. The method includes storing molecular reports for a plurality of patients in a system database, receiving an audible query from the user via the microphone, determining that the audible query is associated with the specific patient, accessing the specific patient’s molecular report, determining the specific patient’s cancer state from the molecular report, identifying at least one intent from a pool of intents related to the specific patient’s cancer state and the audible query, identifying at least one data operation associated with the at least one intent, executing at least one of the identified at least one data operations on a first set of data included in the specific patient’s molecular report to generate a first set of response data, using the first set of response data to generate an audible response file, and broadcasting the audible response file via the speaker.
[1061] In at least some embodiments, a method of audibly broadcasting responses to a user based on user queries about a patient, the method for use with a collaboration device that includes a processor and a microphone and a speaker linked to the processor is provided by the disclosure. The method includes storing health records for a plurality of patients in a system database and storing a general cancer knowledge database, receiving an audible query from the user via the microphone, identifying a specific patient associated with the audible query, accessing the health records for the specific patient, identifying cancer related data in the specific patient/s health records, identifying at least one intent related to the identified cancer related data, identifying at least one data operation related to the at least one intent, executing the at least one data operation on the general cancer knowledge database to generate a first set of response data, using the first set of response data to generate an audible response file, and broadcasting the audible response file via the speaker. [1062] In at least some embodiments, a method of audibly broadcasting responses to a user based on user queries about a specific patient molecular report is provided by the disclosure. The method includes receiving an audible query from the user to a microphone coupled to a collaboration device, identifying at least one intent associated with the audible query, identifying at least one data operation associated with the at least one intent, associating each of the at least one data operations with a first set of data presented on the molecular report, executing each of the at least one data operations on a second set of data to generate response data, generating an audible response file associated with the response data, and providing the audible response file for broadcasting via a speaker coupled to the collaboration device.
[1063] In at least some cases the audible query includes a question about a nucleotide profile associated with the patient.
[1064] In at least some cases the nucleotide profile associated with the patient is a profile of the patient’s cancer.
[1065] In at least some cases the nucleotide profile associated with the patient is a profile of the patient’s germline.
[1066] In at least some cases the nucleotide profile is a DNA profile.
[1067] In at least some cases the nucleotide profile is an RNA expression profile.
[1068] In at least some cases the nucleotide profile is a mutation biomarker.
[1069] In at least some cases the mutation biomarker is a BRCA biomarker.
[1070] In at least some cases the audible query includes a question about a therapy.
[1071] In at least some cases the audible query includes a question about a gene.
[1072] In at least some cases the audible query includes a question about clinical data.
[1073] In at least some cases the audible query includes a question about a next- generation sequencing panel.
[1074] In at least some cases the audible query includes a question about a biomarker.
[1075] In at least some cases the audible query includes a question about an immune biomarker. [1076] In at least some cases the audible query includes a question about an antibody - based test.
[1077] In at least some cases the audible query includes a question about a clinical trial.
[1078] In at least some cases the audible query includes a question about an organoid assay.
[1079] In at least some cases the audible query includes a question about a pathology image.
[1080] In at least some cases the audible query includes a question about a disease type.
[1081] In at least some cases the at least one intent is an intent related to a biomarker.
[1082] In at least some cases the biomarker is a BRCA biomarker.
[1083] In at least some cases the at least one intent is an intent related to a clinical condition.
[1084] In at least some cases the at least one intent is an intent related to a clinical trial.
[1085] In at least some cases the at least one intent includes a drug intent related to a drug.
[1086] In at least some cases the drug intent is related to chemotherapy.
[1087] In at least some cases the drug intent is an intent related to a PARP inhibitor.
[1088] In at least some cases the at least one intent is related to a gene.
[1089] In at least some cases the at least one intent is related to immunology.
[1090] In at least some cases the at least one intent is related to a knowledge database.
[1091] In at least some cases the at least one intent is related to testing methods.
[1092] In at least some cases the at least one intent is related to a gene panel.
[1093] In at least some cases the at least one intent is related to a report.
[1094] In at least some cases the at least one intent is related to an organoid process.
[1095] In at least some cases the at least one intent is related to imaging.
[1096] In at least some cases the at least one intent is related to a pathogen.
[1097] In at least some cases the at least one intent is related to a vaccine. [1098] In at least some cases the at least one data operation includes an operation to identify at least one treatment option.
[1099] In at least some cases the at least one data operation includes an operation to identify knowledge about a therapy.
[1100] In at least some cases the at least one data operation includes an operation to identify knowledge related to at least one drug.
[1101] In at least some cases the at least one data operation includes an operation to identify knowledge related to mutation testing.
[1102] In at least some cases the at least one data operation includes an operation to identify knowledge related to mutation presence.
[1103] In at least some cases the at least one data operation includes an operation to identify knowledge related to tumor characterization.
[1104] In at least some cases the at least one data operation includes an operation to identify knowledge related to testing requirements.
[1105] In at least some cases the at least one data operation includes an operation to query for definition information.
[1106] In at least some cases the at least one data operation includes an operation to query for expert information.
[1107] In at least some cases the at least one data operation includes an operation to identify information related to recommended therapy.
[1108] In at least some cases the at least one data operation includes an operation to query for information relating to a patient.
[1109] In at least some cases the at least one data operation includes an operation to query for information relating to patients with one or more clinical characteristics similar to the patient.
[1110] In at least some cases the at least one data operation includes an operation to query for information relating to patient cohorts.
[1111] In at least some cases the at least one data operation includes an operation to query for information relating to clinical trials. [1112] In at least some cases the at least one data operation includes an operation to query about a characteristic relating to a genomic mutation.
[1113] In at least some cases the characteristic is loss of heterozygosity.
[1114] In at least some cases the characteristic can reflect the source of the mutation.
[1115] In at least some cases the source is germline.
[1116] In at least some cases the source is somatic.
[1117] In at least some cases the characteristic includes whether the mutation is a tumor driver.
[1118] In at least some cases the first set of data includes a patient name.
[1119] In at least some cases the first set of data includes a patient age.
[1120] In at least some cases the first set of data includes a next-generation sequencing panel.
[1121] In at least some cases the first set of data includes a genomic variant.
[1122] In at least some cases the first set of data includes a somatic genomic variant.
[1123] In at least some cases the first set of data includes a germline genomic variant
[1124] In at least some cases the first set of data includes a clinically actionable genomic variant.
[1125] In at least some cases the first set of data includes a loss of function variant.
[1126] In at least some cases the first set of data includes a gain of function variant.
[1127] In at least some cases the first set of data includes an immunology marker.
[1128] In at least some cases the first set of data includes a tumor mutational burden.
[1129] In at least some cases the first set of data includes a microsatellite instability status.
[1130] In at least some cases the first set of data includes a diagnosis.
[1131] In at least some cases the first set of data includes a therapy.
[1132] In at least some cases the first set of data includes a therapy approved by the U.S.
Food and Drug Administration. [1133] In at least some cases the first set of data includes a drug therapy.
[1134] In at least some cases the first set of data includes a radiation therapy.
[1135] In at least some cases the first set of data includes a chemotherapy.
[1136] In at least some cases the first set of data includes a cancer vaccine therapy.
[1137] In at least some cases the first set of data includes an oncolytic virus therapy.
[1138] In at least some cases the first set of data includes an immunotherapy.
[1139] In at least some cases the first set of data includes a pembrolizumab therapy.
[1140] In at least some cases the first set of data includes a CAR-T therapy.
[1141] In at least some cases the first set of data includes a proton therapy.
[1142] In at least some cases the first set of data includes an ultrasound therapy.
[1143] In at least some cases the first set of data includes a surgery.
[1144] In at least some cases the first set of data includes a hormone therapy.
[1145] In at least some cases the first set of data includes an off-label therapy.
[1146] In at least some cases the first set of data includes an on-label therapy.
[1147] In at least some cases the first set of data includes a bone marrow transplant event.
[1148] In at least some cases the first set of data includes a cryoablation event.
[1149] In at least some cases the first set of data includes a radiofrequency ablation.
[1150] In at least some cases the first set of data includes a monoclonal antibody therapy.
[1151] In at least some cases the first set of data includes an angiogenesis inhibitor.
[1152] In at least some cases the first set of data includes a PARP inhibitor.
[1153] In at least some cases the first set of data includes a targeted therapy.
[1154] In at least some cases the first set of data includes an indication of use.
[1155] In at least some cases the first set of data includes a clinical trial.
[1156] In at least some cases the first set of data includes a distance to a location conducting a clinical trial.
[1157] In at least some cases the first set of data includes a variant of unknown significance. [1158] In at least some cases the first set of data includes a mutation effect.
[1159] In at least some cases the first set of data includes a variant allele fraction.
[1160] In at least some cases the first set of data includes a low coverage region.
[1161] In at least some cases the first set of data includes a clinical history.
[1162] In at least some cases the first set of data includes a biopsy result.
[1163] In at least some cases the first set of data includes an imaging result.
[1164] In at least some cases the first set of data includes an MRI result.
[1165] In at least some cases the first set of data includes a CT result.
[1166] In at least some cases the first set of data includes a therapy prescription.
[1167] In at least some cases the first set of data includes a therapy administration.
[1168] In at least some cases the first set of data includes a cancer subtype diagnosis.
[1169] In at least some cases the first set of data includes a cancer subtype diagnosis by
RNA class.
[1170] In at least some cases the first set of data includes a result of a therapy applied to an organoid grown from the patient’s cells.
[1171] In at least some cases the first set of data includes a tumor quality measure.
[1172] In at least some cases the first set of data includes a tumor quality measure selected from at least one of the set of PD-L1, MMR, tumor infiltrating lymphocyte count, and tumor ploidy.
[1173] In at least some cases the first set of data includes a tumor quality measure derived from an image analysis of a pathology slide of the patient’s tumor.
[1174] In at least some cases the first set of data includes a signaling pathway associated with a tumor of the patient.
[1175] In at least some cases the signaling pathway is a HER pathway.
[1176] In at least some cases the signaling pathway is a MAPK pathway.
[1177] In at least some cases the signaling pathway is a MDM2-TP53 pathway.
[1178] In at least some cases the signaling pathway is a PI3K pathway. [1179] In at least some cases the signaling pathway is a mTOR pathway.
[1180] In at least some cases the at least one data operations includes an operation to query for a treatment option, the first set of data includes a genomic variant, and the associating step includes adjusting the operation to query for the treatment option based on the genomic variant.
[1181] In at least some cases the at least one data operations includes an operation to query for a clinical history data, the first set of data includes a therapy, and the associating step includes adjusting the operation to query for the clinical history data element based on the therapy.
[1182] In at least some cases the clinical history data is medication prescriptions, the therapy is pembrolizumab, and the associating step includes adjusting the operation to query for the prescription of pembrolizumab.
[1183] In at least some cases the second set of data includes clinical health information.
[1184] In at least some cases the second set of data includes genomic variant information.
[1185] In at least some cases the second set of data includes DNA sequencing information.
[1186] In at least some cases the second set of data includes RNA information.
[1187] In at least some cases the second set of data includes DNA sequencing information from short-read sequencing.
[1188] In at least some cases the second set of data includes DNA sequencing information from long-read sequencing.
[1189] In at least some cases the second set of data includes RNA transcriptome information.
[1190] In at least some cases the second set of data includes RNA full-transcriptome information.
[1191] In at least some cases the second set of data is stored in a single data repository.
[1192] In at least some cases the second set of data is stored in a plurality of data repositories. [1193] In at least some cases the second set of data includes clinical health information and genomic variant information.
[1194] In at least some cases the second set of data includes immunology marker information.
[1195] In at least some cases the second set of data includes microsatellite instability immunology marker information.
[1196] In at least some cases the second set of data includes tumor mutational burden immunology marker information.
[1197] In at least some cases the second set of data includes clinical health information including one or more of demographic information, diagnostic information, assessment results, laboratory results, prescribed or administered therapies, and outcomes information.
[1198] In at least some cases the second set of data includes demographic information including one or more of patient age, patient date of birth, gender, race, ethnicity, institution of care, comorbidities, and smoking history.
[1199] In at least some cases the second set of data includes diagnosis information including one or more of tissue of origin, date of initial diagnosis, histology, histology grade, metastatic diagnosis, date of metastatic diagnosis, site or sites of metastasis, and staging information.
[1200] In at least some cases the second set of data includes staging information including one or more of TNM, ISS, DSS, FAB, RAI, and Binet.
[1201] In at least some cases the second set of data includes assessment information including one or more of performance status comprising at least one of ECOG status or Kamofsky status, performance status score, and date of performance status.
[1202] In at least some cases the second set of data includes laboratory information including one or more of types of lab, lab results, lab units, date of lab service, date of molecular pathology test, assay type, assay result, molecular pathology method, and molecular pathology provider.
[1203] In at least some cases the second set of data includes treatment information including one or more of drug name, drug start date, drug end date, drug dosage, drug units, drug number of cycles, surgical procedure type, date of surgical procedure, radiation site, radiation modality, radiation start date, radiation end date, radiation total dose delivered, and radiation total fractions delivered.
[1204] In at least some cases the second set of data includes outcomes information including one or more of Response to Therapy, RECIST score, Date of Outcome, date of observation, date of progression, date of recurrence, adverse event to therapy, adverse event date of presentation, adverse event grade, date of death, date of last follow-up, and disease status at last follow up.
[1205] In at least some cases the second set of data includes information that has been de- identified in accordance with a de-identification method permitted by HIPAA.
[1206] In at least some cases the second set of data includes information that has been de- identified in accordance with a safe harbor de-identification method permitted by HIPAA.
[1207] In at least some cases the second set of data includes information that has been de- identified in accordance with a statistical de-identification method permitted by HIPAA.
[1208] In at least some cases the second set of data includes clinical health information of patients diagnosed with a cancer condition.
[1209] In at least some cases the second set of data includes clinical health information of patients diagnosed with a cardiovascular condition.
[1210] In at least some cases the second set of data includes clinical health information of patients diagnosed with a diabetes condition.
[1211] In at least some cases the second set of data includes clinical health information of patients diagnosed with an autoimmune condition.
[1212] In at least some cases the second set of data includes clinical health information of patients diagnosed with a lupus condition.
[1213] In at least some cases the second set of data includes clinical health information of patients diagnosed with a psoriasis condition.
[1214] In at least some cases the second set of data includes clinical health information of patients diagnosed with a depression condition.
[1215] In at least some cases the second set of data includes clinical health information of patients diagnosed with a rare disease. [1216] In at least some cases the method is performed in conjunction with a digital and laboratory health care platform.
[1217] In at least some cases the digital and laboratory health care platform can generate a molecular report as part of a targeted medical care precision medicine treatment.
[1218] In at least some cases the method can operate on one or more micro-services.
[1219] In at least some cases the method is performed in conjunction with one or more microservices of an order management system.
[1220] In at least some cases the method is performed in conjunction with one or more microservices of a medical document abstraction system.
[1221] In at least some cases the method is performed in conjunction with one or more microservices of a mobile device application.
[1222] In at least some cases the method is performed in conjunction with one or more microservices of a prediction engine.
[1223] In at least some cases the method is performed in conjunction with one or more microservices of a cell-type profiling service.
[1224] In at least some cases the method is performed in conjunction with a variant calling engine to provide information to a query involving variants.
[1225] In at least some cases the method is performed in conjunction with an insight engine.
[1226] In at least some cases the method is performed in conjunction with a therapy matching engine.
[1227] In at least some cases the method is performed in conjunction with a clinical trial matching engine.
[1228] Embodiments of the information that is catalogued, stored, analyzed, or reported according to any embodiments described herein may also be provided through a stand-alone hardware device. Examples of stand-alone hardware devices are disclosed in US Patent App. No. 16/852,194, filed April 17, 2020 which is incorporated by reference for all purposes.
[1229] Specific Embodiments of the Disclosure
[1230] In some aspects, the systems and methods disclosed herein may be used to support clinical decisions for personalized treatment of cancer. For example, in some embodiments, the methods described herein identify actionable genomic variants and/or genomic states with associated recommended cancer therapies. In some embodiments, the recommended treatment is dependent upon whether or not the subject has a particular actionable variant and/or genomic status. Recommended treatment modalities can be therapeutic drugs and/or assignment to one or more clinical trials. Generally, current treatment guidelines for various cancers are maintained by various organizations, including the National Cancer Institute and Merck & Co., in the Merck Manual.
[1231] In some embodiments, the methods described herein further includes assigning therapy and/or administering therapy to the subject based on the identification of an actionable genomic variant and/or genomic state, e.g., based on whether or not the subject’s cancer will be responsive to a particular personalized cancer therapy regimen. For example, in some embodiments, when the subject’s cancer is classified as having a first actionable variant and/or genomic state, the subject is assigned or administered a first personalized cancer therapy that is associated with the first actionable variant and/or genomic state, and when the subject’s cancer is classified as having a second actionable variant and/or genomic state, the subject is assigned or administered a second personalized cancer therapy that is associated with the second actionable variant. Assignment or administration of a therapy or a clinical trial to a subject is thus tailored for treatment of the actionable variants and/or genomic states of the cancer patient.
Examples
[1232] Example 1 - The Cancer Genome Atlas (TCGA).
[1233] The Cancer Genome Atlas (TCGA) is a publicly available dataset comprising more than two petabytes of genomic data for over 11,000 cancer patients, including clinical information about the cancer patients, metadata about the samples (e.g., the weight of a sample portion, etc.) collected from such patients, histopathology slide images from sample portions, and molecular information derived from the samples (e.g., mRNA/miRNA expression, protein expression, copy number, etc.). The TCGA dataset includes data on 33 different cancers: breast (breast ductal carcinoma, bread lobular carcinoma) central nervous system (glioblastoma multiforme, lower grade glioma), endocrine (adrenocortical carcinoma, papillary thyroid carcinoma, paraganglioma & pheochromocytoma), gastrointestinal (cholangiocarcinoma, colorectal adenocarcinoma, esophageal cancer, liver hepatocellular carcinoma, pancreatic ductal adenocarcinoma, and stomach cancer), gynecologic (cervical cancer, ovarian serous cystadenocarcinoma, uterine carcinosarcoma, and uterine corpus endometrial carcinoma), head and neck (head and neck squamous cell carcinoma, uveal melanoma), hematologic (acute myeloid leukemia, Thymoma), skin (cutaneous melanoma), soft tissue (sarcoma), thoracic (lung adenocarcinoma, lung squamous cell carcinoma, and mesothelioma), and urologic (chromophobe renal cell carcinoma, clear cell kidney carcinoma, papillary kidney carcinoma, prostate adenocarcinoma, testicular germ cell cancer, and urothelial bladder carcinoma).
[1234] Example 2 - Identification of Focal Copy Number Variation
[1235] Figures 7A1, 7B1, and 7C1 collectively illustrate identification of non-focal and focal copy number variations in biological samples, in accordance with some embodiments of the present disclosure. As defined above, focal copy number variations refer to small segments that consist of only a few exons of a gene or several genes that deviate significantly from neighboring segments.
[1236] A method in accordance with some embodiments of the present disclosure was performed using, as inputs, a test sample BAM file comprising aligned sequence reads from a sequencing of nucleic acids from a test sample, a target region BED file comprising at least the genes MYC and BRCA2, and a pool of process matched normal samples for comparison with test sample sequencing data. The method further used, as inputs for initial reference pool construction, a plurality of normal sample BAM files comprising aligned sequence reads from a sequence of nucleic acids from a plurality of process matched normal samples, a human reference genome file for alignment, a list of mappable regions of the genome and a blacklist comprising recurrent problematic areas of the genome.
[1237] The method was performed for three different test samples. For each test sample, CNVkit was performed utilizing targeted captured sequencing reads and non-specifically captured off-target sequencing reads to infer copy number information. The targeted genomic region specified in the probe target BED file was divided into target bins with an average size of 100 base pairs. The genomic regions between the target regions (excluding regions that could not be mapped reliably) were automatically divided into off-target bins with an average size of 150 kilobase pairs. Raw log2-transformed read depths for the plurality of sequence reads in each of the target and off-target bins were then calculated from the alignments in the input BAM file and written to two tab-delimited .cnn files.
[1238] A pooled reference was constructed from a panel of process matched normal samples. The raw log2-transformed read depths for the plurality of sequence reads in each of the target and off-target bins in each normal sample were computed as described above, and each log2 read depth was median-centered and corrected for bias including GC content, genome sequence repetitiveness, target size and spacing. The corrected target and off-target log2 read depths were combined across samples, and a weighted average and spread were calculated for each bin using Tukey’s biweight location and midvariance. These values were written to a tab-delimited reference .cnn file, which was used to normalize the binned sequencing data for the test sample.
[1239] The raw log2 read depths of each test sample were median-centered and bias- corrected as described in the reference construction. The corrected log2 read depth of each bin in the test sample .cnn file was then subtracted by the log2 read depth of each corresponding bin in the reference .cnn file, thus generating log2 copy ratios indicating differential copy numbers between the test sample and the reference pool. These values were written to a tab-delimited .cnr file.
[1240] The log2 copy ratios were then segmented via a circular binary segmentation (CBS) algorithm, in which adjacent bins were grouped to larger genomic regions (e.g., segments) of equal copy number. Each segment’s copy ratio was calculated as the weighted mean of all bins within the segment, and the confidence interval of the mean for each respective segment was estimated by bootstrapping the bin-level copy ratios within the segment. The segments’ genomic ranges, copy ratios and confidence intervals were then written to a tab-delimited .cns file. The copy ratios of each segment were used to generate an initial copy number status annotation for the respective segment, for subsequent validation.
[1241] The validation of each copy number status annotation for each segment was performed using annotation and filtering. First, bin-level copy ratios (.cnr) and segment-level copy ratios with their confidence intervals (.cns) from the CNVkit outputs, as well as the probe target region file (.bed) were passed to a python script (e.g., annotate_cnvs_xf.py), in which each segment is examined and amplification/deletion is called for the segment if a plurality of criteria (e.g., filters) were met. If the plurality of criteria were not met (e.g., if any of the filters were fired), the copy number status annotation was rejected. [1242] The plurality of criteria included a first requirement that the respective segment’s copy ratio be greater than 0.03 for an amplification, or less than -0.5 for a deletion. Both thresholds were determined empirically. More stringent values (e.g., a higher value for amplifications or a lower one for deletions) were found to increase specificity but decrease sensitivity, especially in low tumor fraction cases.
[1243] The plurality of criteria also included a second requirement that the median copy ratio of all target bins within the respective segment be greater than 0.03 for an amplification, or less than -0.5 for a deletion.
[1244] The plurality of criteria also included a third requirement that, to validate an amplification annotation, the lower bound of the segment’s copy ratio confidence interval be greater than the mean copy ratios of all preceding and all subsequent segments in the same chromosome. To validate a deletion annotation, the higher bound of the segment’s copy ratio confidence interval must be less than the mean copy ratios of all preceding and all subsequent segments in the same chromosome. Specifically, in such embodiments, the third criteria required satisfaction of two threshold values (e.g., mean copy ratio of all preceding segments and mean copy ratio of all following segments) to pass the filter.
[1245] In some alternative embodiments of the third requirement where the segment was the first or the last segment in a chromosome, then the lower (higher) bound of the segment’s copy ratio confidence interval must be greater (less) than the mean copy ratios of all other segments excluding itself for an amplification (a deletion). Specifically, in such embodiments, the third criteria required satisfaction of a single threshold value (e.g., the mean copy ratio of all segments other than the segment under examination).
[1246] Finally, the plurality of criteria further included a fourth requirement that, to validate an amplification, the median copy ratio of all target bins in the segment be no less than the median plus the median absolute deviation (MAD) of all bins' copy ratios on the same chromosome. To validate a deletion, the median copy ratio of all target bins in the segment must be no greater than the median minus 0.75 of the MAD of all bins' copy ratios on the same chromosome. A scaling factor of 0.75 of the MAD was selected for deletion annotations in order to account for lower signal-to-noise ratios observed in deletions, which were not observed in amplifications.
[1247] The final amplification status of a segment was then mapped to each target bin in the segment under examination and a CSV file was generated with the following columns: “amp” - amplification status (amplified, neutral, or deleted); “chrom” - chromosome, “start”
- start position of a target bin; “end” - end position of a target bin; “gene” - gene in the target bin; “meant” - mean coverage of the target bin; “corrected_log2_ratio” - copy ratio of the bin; “segment_log2_ratio” - copy ratio of the segment comprising the target bin; “seg_id” - numeric index of the segment; and “order id” - order id of the input sample.
[1248] Figures 7A1 and 7B1 illustrate the amplification status of a first test sample and a second test sample comprising the MYC gene, validated using the above method. Figure 7A1 illustrates a scatter plot of bin-level copy ratios (“+”: off-target bins; “·”: target bins) and segment-level copy ratios (horizontal lines) located on chromosome 8 of the first test sample, where bin-level and segment-level copy ratios were generated by CNVkit as described above. The vertical line highlights the position of the MYC gene on chromosome 8. As illustrated in Figure 7A1, the MYC gene locus in the first test sample was identified as having a copy ratio of 1.2, suggesting that the sample comprises a copy number variation associated with the MYC gene in comparison with pooled reference samples. However, application of the annotation and filtering of the method disclosed herein indicated that the MYC gene was located in a non-focal amplified segment, as is visually represented by the solid horizontal line comprising the MYC gene (e.g., the surrounding segments also exhibit a copy ratio of 1.2). Figure 7A1 illustrates that the amplification status attributed to the MYC gene locus by CNVkit is likely to be an artifact resulting from its inclusion in a non-focal amplified segment, and thus is not therapeutically actionable.
[1249] Figure 7B1 illustrates a scatter plot of bin-level copy ratios (“+”: off-target bins; target bins) and segment-level copy ratios (horizontal lines) on chromosome 8 of the second test sample, where bin-level and segment-level copy ratios were generated by CNVkit as described above. In contrast to Figure 7A1, the vertical line in Figure 7B1 highlights the position of the MYC gene on chromosome 8, which is in a focal amplified segment with a copy ratio of 0.97. Notably, the validation of the MYC gene amplification status as occurring in a focal amplified segment illustrates that the amplification status called for the second test sample represents a real and therapeutically actionable copy number variation.
[1250] Figures 7A1 and 7B1 thus illustrate a key application of the disclosed method, in which a focal amplification of a gene of therapeutic interest can be distinguished from a non- focal amplification (e.g., an artifactual or erroneous call) of the same gene. By achieving higher confidence in the identification of copy number variations of disease-associated genes, it is possible to avoid misdiagnoses and make accurate, informed decisions on necessary treatments or therapies.
[1251] In an alternative embodiment, Figure 7C1 illustrates the deletion status of a third test sample comprising the BRCA2 gene, validated using the above method.
[1252] Figure 7C1 illustrates a scatter plot of bin-level copy ratios (“+”: off-target bins; target bins) and segment-level copy ratios (horizontal lines) on chromosome 13 of the third test sample, where bin-level and segment-level copy ratios were generated by CNVkit as described above. The vertical line highlights the position of the BRCA2 gene on chromosome 13. As observed in Figure 7B1, visual inspection of the scatter plot confirms that the BRCA2 gene in the third test sample is contained in a focal deleted segment with a copy ratio of -1.1 (e.g., as illustrated by the solid horizontal lines on either side of the segment comprising the BRCA2 gene with a copy ratio of approximately zero). Thus, in addition to validating copy number status annotations to identify focal amplifications, the method can also be used to identify focal deletions, for genes of interest associated with human disease.
[1253] Example 3 - Method of Validating a Liquid Biopsy Assay
[1254] Conducting sample collection, storage, nucleic acid isolation, and library preparation.
[1255] To validate a liquid biopsy assay in accordance with some embodiments of the present disclosure, 188 unique specimens were sequenced. These unique specimens included 10 blood specimens purchased from BioIVT, 56 residual plasma samples, 39 whole-blood samples, 4 cfDNA reference standards set in synthetic plasma (Horizon Discovery’s Multiplex I cfDNA Reference Standards HD812, HD813, HD814, HD815), and 2 cfDNA reference standard isolates (Horizon Discovery’s Structural Multiplex cfDNA reference standard HD786, and 100% Multiplex I Wild Type Reference Standard HD776).
Furthermore, an additional 55 blood samples with matched tumor samples were utilized to compare the liquid biopsy and solid tumor tests, and 375 blood samples were sequenced for low-pass whole-genome sequencing (LPWGS) analysis. Sequence data from an additional 1,000 patient samples that were previously sequenced were utilized for retrospective and clinical analyses. All blood was received in Cell-free DNA BCT® blood collection tubes (Streck). Plasma was prepared immediately after accessioning and stored at -80 °C until later nucleic acid extraction and library preparation. At this time, cfDNA was isolated from plasma using the Qiagen QIAamp MinElute ccfDNA Midi Kit (QIAGEN), conducted according to instructions provided by the manufacturer. Automated library preparation was performed on a SciClone NGSx (Perkin Elmer). All cfDNA samples were normalized with molecular grade water to a maximum of 50 microliters (pL).
[1256] Conducting the liquid biopsy sequencing assay.
[1257] The liquid biopsy assay utilized New England BioLab's NEBNext® Ultra™ II DNA Library Prep Kit for Illumina®, IDT's xGen CS Adapters, unique molecular indices (UMI), and 96 pairs of barcodes to prepare cfDNA sequencing libraries with unique sample identifiers (IDs). Each sample was ligated to a dual unique index. The dual unique index enables multiplexed sequencing of up to 7 patients and 1 positive control per SP NovaSeq flow cell, 16 patients and 1 positive control per SI NovaSeq flow cell, 34 patients and 1 positive control per S2 NovaSeq flow cell, and 84 patients and 1 positive control per S4 NovaSeq flow cell. The library preparation protocol is optimized for greater than or equal to 20 nanograms (ng) cfDNA input to maximize mutation detection sensitivity. The final library was sequenced on an Illumina NovaSeq sequencer. Furthermore, analysis was performed using a bioinformatics pipeline and analysis server.
[1258] The bioinformatics pipeline.
[1259] Adapter-trimmed FASTQ files are aligned to the nineteenth edition of the human reference genome build (hgl9) using Burrows-Wheeler Aligner (BWA). Li el al, 2009,
“Fast and accurate short read alignment with Burrows-Wheeler transform,” Bioinformatics, (25), pg. 1754. Following alignment, reads were grouped by alignment position and UMI family, and collapsed into consensus sequences using fgbio tools (available online at fulcrumgenomics.github.io/fgbio/). Bases with insufficient quality or significant disagreement among family members were reverted to N's. Phred scores were scaled based on initial base calling estimates combined across all family members. Following single strand consensus sequence generation, duplex consensus sequences were generated by comparing the forward and reverse oriented PCR products with mirrored UMI sequences. Consensus sequences were re-aligned to the human reference genome using BWA. BAM files are generated and indexed after the re-alignment.
[1260] SNV and indel variants were detected using VarDict. Lai etal, 2016, “VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research,” Nucleic Acids Res, (44), pg. 108. SNVs were called down to 0.1% VAF for specified hotspot target regions and 0.25% VAF at all other base positions across the panel. Indels were called down to 0.5% VAF for variants within specific regions of interest. Any indels outside of these regions were called down to 5% VAF. All SNVs and indels were then sorted, deduplicated, normalized, and annotated accordingly. Following annotation, variants were classified as germline, somatic, or uncertain using a Bayesian model based on prior expectations informed by various internal and external databases of germline and cancer variants. Uncertain variants are treated as somatic for filtering and reporting purposes. Following classification, variants were filtered based on a plurality of quality metrics including coverage, VAF, strand bias, and genomic complexity. Additionally, variants were filtered with a Bayesian tri nucleotide context-based model with position level background error rates estimated from a pool of process matched healthy controls. Furthermore, known artifactual variants were removed.
[1261] Copy number variants (CNVs) were analyzed utilizing CNVkit and a CNV annotation and filtering algorithm provided by the present disclosure. Talevich et al, 2016, “CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing,” PLoS Comput Biol, (12), pg. 1004873. This CNVkit provides genomic region binning, coverage calculation, bias correction, normalization to a reference pool, segmentation, and visualization. The log2 ratios between the tumor sample and a pool of process matched healthy samples from the CNVkit output were annotated and filtered using statistical models, such that the amplification status (e.g., amplified or not-amplified) of each gene is predicted and non-focal amplifications are removed.
[1262] Rearrangements were detected using the SpeedSeq analysis pipeline. Chiang et al. , 2015, “SpeedSeq: ultra-fast personal genome analysis and interpretation,” Nat Methods, (12), pg. 966. Briefly, FASTQ files were aligned to hgl9 using BWA. Split reads mapped to multiple positions and read pairs mapped to discordant positions were identified and separated, then utilized to detect gene rearrangements by LUMPY. Layer et al. , 2014, “LUMPY: a probabilistic framework for structural variant discovery,” Genome Biol, (15), pg. 84. Fusions were then filtered according to the number of supporting reads.
[1263] Predicted functional effect and clinical interpretation for each variant was curated by automated software using information from both internal and external databases. A weighted-heuristic model was used, which has logic-based recommendations from the AMP/ASCO/CAP/ClinGen Somatic working group and ACMG guidelines. Li et al, 2017, “Standards and Guidelines for the Interpretation and Reporting of Sequence Variants in Cancer: A Joint Consensus Recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists,” The Journal of molecular diagnostics, (19), pg. 4; Kalia et al., 2017, “Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics,” Genetics in Medicine, (19), pg. 249.
[1264] The relative frequency and distribution are determined for any read containing repetitive sequences to detect microsatellite instability. To predict the probability of an unstable locus, a k-nearest neighbors model (with k = 100) was utilized along with normalized percent lower, mean lower, and mean log-likelihood metrics. The percentage of unstable loci was calculated from the probabilities of each sample, with greater than 50% unstable loci considered microsatellite instability -high (MSI-H).
[1265] The validation approach.
[1266] The present disclosure conducted extensive validation studies to establish robust technical perform of the liquid biopsy assay. Limit of detection (LOD) was determined by assessing analytical sensitivity in reference standards with 5%, 1%, 0.5%, 0.25%, and 0.1% VAF generated from the Horizon Discovery reference set. The Horizon Discovery set includes 160 bp cfDNA fragments from human cell lines in an artificial plasma matrix to closely resemble cfDNA extracted from human plasma. VAFs of SNVs and indels, including EGFR (DE746 - A750), EGFR (V769 - D770insASV), EGFR A767_V769dup, EGFR (L858R), EGFR (T790M), KRAS (G12D), NRAS (A59T), NRAS (Q61K), AKT1 E17K, PIK3CA (E545K), and GNA11 Q209L, and CNVs and rearrangements, including CCDC6/RET, SLC34A2/ROS1, MET, MYC, and MYCN, were measured in reference samples by the liquid biopsy assay of the present disclosure. Each measurement was conducted with a minimum of three replicates at 10 ng, 30 ng, and 50 ng of DNA. Sensitivity was determined by the number of detected variants divided by the total number of variants present in the reference samples. Samples with an on-target rate of less than 30% were excluded from the instant analysis, and MET (4.5 copies) was included in CNV sensitivity determinations. Sensitivity of greater than 90% was considered reliable detection.
[1267] Analytical specificity was determined using 44 normal samples titrated at 1%, 2.5%, or 5% from a wild-type cfDNA reference standard with a list of confirmed true negative SNVs, indels, CNVs and rearrangements. Specificity was determined by the number of known true-negative variants divided by the number of true-negative variants plus false-positive variants identified by the liquid biopsy assay.
[1268] To assess inter-instrument concordance between the sequencing instruments, 10 patient libraries were sequenced on each instrument (3 NovaSeqs). Variants seen below the lower limit of detection (LLOD) (0.25% for SNVs and 0.50% for indels) were excluded from concordance analysis.
[1269] To establish analytical accuracy, the results of 40 validation samples were compared to the results of an orthogonal reference method (Roche's AVENIO ctDNA assay). Analytical accuracy was determined by the number of detected variants divided by the total number of variants present in the sample. Variants that were off-target or below LLOD (0.25% for SNVs and 0.5% for indels) were excluded from the instant analysis.
[1270] Conducting digital droplet polymerase chain reaction (ddPCR).
[1271] Five variants were validated on the ddPCR platform: KRAS G12D (Integrated DNA Technologies, IDT, published sequences); TERT promoter mutations C.-1240T (C228T) & C.-1460T (C250T) (Thermo Fisher Scientific); and TP53 p.R273H and TP53 p.R175H (Thermo Fisher Scientific). Each amplification reaction was performed in 25 pL and contained IX Genotyping Master Mix (Thermo Fisher Scientific), IX droplet stabilizer (RainDance), IX of primer/probe mixture for TERT and TP53 (for KRAS: 800 nM of each primer and 500 nM of each probe) plus template. To improve the lower limit of detection, 4- cycle amplification was conducted prior to droplet generation. Amplification for KRAS was conducted using the cycling conditions of: 1 cycle of 95°C (0.6°C/s ramp) for 10 minutes, 4 cycles of 95°C (0.6°C/s ramp) for 15 seconds and 60°C for 2 minutes, followed by 1 cycle of 98°C (0.6°C/s ramp) for 10 minutes. Cycling conditions for the TP53 variants were the same as those for KRAS with the exception of the annealing and extension temperature, which was set at 55°C for 2 minutes. Amplification for TERT followed Thermo Fisher’s recommendation as follows: 1 cycle of 96°C (1.6°C/s ramp) for 10 minutes, 4 cycles of 98°C (1.6°C/s ramp) for 30 seconds and 55°C for 2 minutes, followed by 1 cycle of 55°C (1.6°C/s ramp) for 2 minutes. Accordingly, droplets generated on the RainDance Source, and amplification performed following the above cycling conditions with cycle numbers of 45 for both KRAS and TP53, and 54 for TERT. Furthermore, droplets were analyzed on a RainDance Sense droplet reader. Additionally, RainDrop Analyst II vl.1.0 analysis software was utilized to acquire and analyze data. [1272] The concordance between liquid biopsy and solid tumor assays.
[1273] Matched liquid biopsy and solid tumor sample pairs (n = 55) were used to determine analytical sensitivity and specificity. Solid tumor and matched normal samples obtained from peripheral blood huffy coat were analyzed with the solid tumor assay, and corresponding blood plasma samples were analyzed with the liquid biopsy assay of the present disclosure. Only variants in the reportable range of both the solid tumor and liquid biopsy panels were included in these analyses ( e.g ., genes in the liquid biopsy gene panel is a subset of genes in the solid tumor gene panel). Germline, intronic, and synonymous variants identified in the solid tumor assay and the liquid biopsy assay were excluded from analysis with the exception of intronic splice variants. To determine analytical sensitivity, the number of variants called in both the liquid biopsy assay and the solid tumor assay (e.g., true positives) was divided by the sum of true positives and those called only in the solid tumor assay. To determine analytical specificity the number of positions reported in neither the liquid biopsy assay nor the solid tumor assay (e.g., true negatives) was divided by the sum of true negatives and variants only called in the liquid biopsy assay.
[1274] To improve variant calling in the liquid biopsy assay, a strategy that dynamically determines local sequence errors using Bayes Theorem and the likelihood ratio test was developed. The dynamic threshold was determined using a sample-specific error rate, the error rate from healthy control samples, and from a reference cohort of solid tumor samples. Accordingly, the method of the present disclosure was conducted on 55 matched liquid biopsy/solid tumor tissue samples, with variants detected in the solid tumor assay as the source of truth. Using sensitivity thresholds defined by the LOD analysis, fixed post-test- odds (e.g., equal to the P(post-test) / [1 - P(post-test)]), as well as pre-test-odds. The Pre-test- odds were determined using historical data from the solid tumor assay with an equation identical to the post-test-odds calculation). Accordingly, the following formula was determined based on the above: specificity = 1 - pre-test-odds * sensitivity / post-test-odds
[1275] The specificity was input to a beta-binomial function and yielded the minimum number of alternate alleles to call a variant at a particular depth. The pre-test-odds metric was specific to individual cancer cohorts and individual genes, allowing for cancer-specific pre-test-odds to be applied to individual exons.
[1276] Conducting low-pass whole genome sequencing and analysis. [1277] Blood samples from 375 patients were sequenced using low-pass whole-genome sequencing (LPWGS) across four flow cells. Sequencing coverage metrics for these samples were determined using Picard CollectWgsMetrics. The tumor fraction and ploidy values for each sample were estimated using ichorCNA with a specific reference panel of 47 normal samples. Adalsteinsson et al, (2017), “Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors” Nat Commun, (8), pg. 1324. Reported variants from the corresponding liquid biopsy analysis of each sample were utilized to assess the accuracy of the tumor fraction estimates.
[1278] Determining estimation of circulating tumor fraction.
[1279] Circulating tumor fraction estimate (ctFE) was determined using a novel method, Off-Target Tumor Estimation Routine (OTTER), from off-target reads uniformly distributed across the human reference genome. As described above, the CNVkit was conducted on each sample, and segments were assigned via circular binary segmentation (CBS). Olshen et al. , 2004, “Circular binary segmentation for the analysis of array -based DNA copy number data,” Biostatistics, (5), pg. 557. Segments were then fit to integer copy states via an expectation- maximization algorithm using the sum of squared error of the segment log2 ratios (e.g., normalized to genomic interval size) to expected ratios given a putative copy state and tumor purity. Estimates were confirmed by comparing results against LPWGS of the original patient isolate. As such, results are shown using randomly selected, de-identified samples.
[1280] Clinical profiling of liquid biopsy patients.
[1281] De-identified molecular and abstracted clinical data were evaluated in a cohort of 1,000 patients randomly selected from a specific reference clinicogenomic database. All data were de-identified in accordance with the Health Insurance Portability and Accountability Act (HIPAA). Dates used for analyses were relative to the first liquid biopsy sequencing date of each patient, and year of the first sequencing date was randomly off-set. Variants included in the analyses were those classified as pathogenic or likely pathogenic, and further divided into actionable if matched to diagnostic, prognostic or therapeutic evidence or biologically relevant. Outcomes were determined according to the most recent clinical response noted in patient records. The study protocol was submitted to the Advarra Institutional Review Board (IRB), which determined the research was exempt from IRB oversight and approved a waiver of HIPAA authorization for this study.
[1282] Example 4 - Results of Validating Liquid Biopsy Assay [1283] Liquid biopsy validation summary.
[1284] The liquid biopsy oncology assay is a 105-gene hybrid capture NGS panel designed to detect actionable somatic variant targets in plasma. Referring to Figures 16A through 16C, the liquid biopsy assay detects mutations in four variant classes, including: single nucleotide variants (SNVs) and insertion-deletions (indels) in all 105 genes, copy number variants (CNVs) in 6 genes, and chromosomal rearrangements in 7 genes. To validate the liquid biopsy assay, a total of 188 samples were sequenced. The runs generated an average of 261.7 M ± 40.7 M total reads with 130.7 M ± 20.3 M read pairs and a unique median read depth of 4999.128 ± 1288.843. The average percent of mapped reads across all runs was 99.876% ± 0.0078.
[1285] Referring to Figure 13, determined analytical sensitivity for all SNVs, indels, CNVs, and rearrangements targeted in the reference samples is provided. Accordingly,
SNVs were reliably detected at greater than or equal to 0.25% VAF with 30 ng of input DNA (93.75% [45/48] sensitivity), indels at greater than or equal to 0.5% VAF with 30 ng (95.83% [23/24] sensitivity), CNVs at greater than or equal to 0.5% VAF with 10 ng (100.00% [8/8] sensitivity), and rearrangements at greater than or equal to 1% VAF with 30 ng (90% [9/10] sensitivity). Referring to Figure 14, analytical specificity is provided in which 100% for SNVs, indels, and rearrangements; and 96.2% for CNVs on samples with greater than or equal to 0.25% VAF with 30 ng of input DNA.
[1286] Accordingly, intra-assay and inter-assay concordance between the replicates in the present disclosure was 100% for SNVs, indicating a high degree of repeatability and reproducibility. Moreover, the inter-instrument concordance was 96.70% for SNVs and 100% for indels, with a combined concordance of 96.83% across instruments. Additionally, interfering substances including genomic DNA, ethanol, and isopropanol did not cause a change in the detection of variants. Concordance between controls and samples with interfering substances was high ( e.g 100%) among samples that passed filtering, and were above the LOD.
[1287] The accuracy of the liquid biopsy assay compared to orthogonal assays.
[1288] Referring to Figure 15, to evaluate analytical accuracy, the present disclosure compared the liquid biopsy assay to the Roche AVENIO ctDNA assay. In 30 ng cfDNA samples analyzed by liquid biopsy assay and AVENIO cfDNA assay (n = 40), sensitivity for SNVs, indels, CNVs and rearrangements was 94.8%, 100%, 100%, and 100%, respectively. In the 6 SNVs that were not detected, 5 were called but filtered out due to insufficient evidence. In 10 ng samples, sensitivity for SNV, indel, CNV, and rearrangements was 91.9%, 100%, 80%, and 100%, respectively. Of the 7 SNVs that were not detected, 6 were present in sequencing data but filtered out due to insufficient evidence.
[1289] Referring to Figures 8A and 8B, to further validate the liquid biopsy assay results, patients with reported variants KRAS G12D (n = 12), TERT C.-124 (n = 7), TERT C.-146 (n = 5), TP53 R273H (n = 7), and TP53 R175H (n = 7) were selected for analysis by ddPCR. Liquid biopsy NGS VAF was compared with ddPCR VAF to determine concordance. Accordingly, 100% PPV and a high correlation between ddPCR results and liquid biopsy VAF (R2 = 0.892), as well as individual variants such as KRAS G12D (R2 = 0.970), as shown in Figures 8A and 8B. These results indicate the liquid biopsy assay of the present disclosure can be used to accurately identify hotspot mutations. Specifically, Figure 8A illustrates results of an inter-assay comparison between liquid biopsy, ddPCR, and solid tumor results for patients samples with selected variants (n = 38) analyzed by ddPCR and compared with liquid biopsy variant allele fraction (VAF), resulting in high correlation overall (R2 = 0.892). Figure 8B illustrates results of an inter-assay comparison between liquid biopsy, ddPCR, and solid tumor results for patient samples with individual variants such as KRAS G12D (n = 12, R2 = 0.970).
[1290] The concordance between liquid biopsy and solid tumor tissue assay.
[1291] Comparisons between analytical sensitivity and specificity in matched solid tumor and liquid biopsy tests from 55 patients were determined. Since solid tumor matched samples include both tumor tissue and huffy coat (e.g., normal comparator), a specific classification strategy was utilized to determine and exclude germline variants from the analysis. Beaubier et al, 2019, “Clinical validation of the xT next-generation targeted oncology sequencing assay,” Onctotarget, 10(24), pg. 2384. Removing intronic and synonymous variants, benign and likely benign variants, as well as variants below the LOD for solid tumor and liquid biopsy assays resulted in 145 concordant SNVs, 20 concordant indels, and 11 concordant CNVs. 66 SNVs, 11 indels, and 8 CNVs were identified that were reported in the solid tumor assay but not the liquid biopsy assay, as well as 209 SNVs, 14 indels, and 7 CNVs that were reported in the liquid biopsy assay but not the solid tumor assay. Accordingly, the specificity of the liquid biopsy assay was 100.00% for SNVs and indels and 96.67% for CNVs. Referring to Figure 17, a Bayesian dynamic filtering methodology was utilized to further reduce discordance by 11.45%, improving the specificity of variant calling in the liquid biopsy assay. The overall sensitivity of the liquid biopsy assay compared to the solid tumor assay was 68.18% for SNVs and indels and 57.89% for CNVs. When limiting analysis to clinically actionable targets, 107 concordant variants and 37 discordant, for a sensitivity of 74.31%, were reported.
[1292] Furthermore, comparisons between the sample classification of reportable variants between matched samples with liquid biopsy and solid tumor testing were determined. Referring to Figure 8C, variants were considered CH variants if found in the plasma as well as in the solid tumor normal sample but were not present at levels consistent with germline variation. Accordingly, this classification of germline and CH variants in liquid biopsy is possible with a corresponding solid tumor assay or a germline sequencing analysis from the huffy coat. Notably, two samples have a large number of variants only detected in liquid biopsy, many of which are at low VAFs. These samples were subsequently determined to have very high tumor mutational burdens (TMBs) in their corresponding solid tumor analyses. Accordingly, the large number of liquid biopsy variants at low VAFs and high TMBs suggest that these tumors may be more heterogeneous and that some variants are more easily detected in blood. Specifically, Figure 8C illustrates results of an inter-assay comparison between liquid biopsy, ddPCR, and solid tumor results for sample classification of reportable variants, in which microsatellite instability (MSI) was detected by the liquid biopsy assay in six out of sixteen MSI-high patents, with 100% as indicated by the one or more blue dots depicted above the dotted line.
[1293] Finally, liquid biopsy validation samples were utilized to assess microsatellite instability in patients whose MSI status was previously confirmed by a specific reference clinically validated solid tumor MSI test or immunohistochemistry. Referring to Figure 8D, the liquid biopsy assay reported MSI-H status in 37.5% (6/16) of orthogonally confirmed MSI-H patients at 100% (6/6) positive predictive value. Accordingly, comparisons between the solid tumor and liquid biopsy assays demonstrate the strengths of the liquid biopsy assay and the added value of using multiple assays to detect genomic drivers of cancer.
Specifically, Figure 8D illustrates results from liquid biopsy and solid tumor assays compared in patients who received both tests (n = 55) of Figure 8A and Figure 8B, in which the percent circulating tumor DNA VAF, depicted above the dashed line, and number of reportable variants detected, depicted below the dashed line, for each individual patient were categorized by assay type and CHIP or germline status.
[1294] OTTER, a novel method for estimating tumor fraction. [1295] An accurate measure of tumor fraction can provide an improved understanding of variants identified through liquid biopsy testing. In the present disclosure, a novel method, Off-Target Tumor Estimation Routine (OTTER), for determining a more accurate circulating tumor fraction estimate (ctFE) was developed. Referring to Figures 9A and 9B, comparisons between OTTER ctFE with VAFs from 1,000 random patient samples across cancer types were determined, such that liquid biopsy ctFE correlates with max pathogenic VAF and median VAF. Referring to Figures 9C through 9F, removing germline variants and amplified regions from these analyses further increased the correlation. Plausible liquid biopsy ctFE estimates are expected to be greater than or equal to the maximal somatic VAF in a sample that is not on an amplified region. Referring to Figure 9H, overall, after removing germline variants and variants on amplified regions, 90.8% of median VAFs are less than or equal to the corresponding liquid biopsy ctFEs. Referring to Figure 9H, the distribution of liquid biopsy ctFE for the liquid biopsy 1,000 cohort is provided. Accordingly, the median ctFE was 0.07 with a mean ctFE of 0.12.
[1296] In addition to VAF, LPWGS is increasingly utilized to estimate tumor fractions and thought to be a more accurate measure than VAF. Adalsteinsson et al, 2017; Chen el al, 2019, “Next-generation sequencing in liquid biopsy: cancer screening and early detection,” Hum Genomics, (13), pg. 34. Referring to Figure 9G, comparisons between LPWGS ichorCNA-predicted circulating tumor fraction to the OTTER ctFE in matched patient samples (n = 375) determined a strong correlation between methods (R2 = 0.843, P = 4.71e- 152). Accordingly, this correlation indicates that OTTER ctFEs are highly concordant with estimates using LPWGS but can be determined directly from the targeted-panel sequencing without requiring additional sequencing.
[1297] Specifically, Figure 9A illustrates results from circulating tumor fraction estimate (ctFE) and variant allele fraction (VAF) in which ctFE of liquid biopsy-sequenced patients (n = 1,000) was correlated with max pathogenic VAF (R2 = 0.38). Figure 9B illustrates results from ctFE and VAF in which ctFE of liquid biopsy-sequenced patients (n = 1,000) was correlated with medium VAF (R2 = 0.35). Figure 9C illustrates results from ctFE and VAF in which ctFE of liquid biopsy-sequenced patients (n = 1,000) in which germline variants were removed, increasing the correlation with max pathogenic VAF (R2 = 0.40). Figure 9D illustrates results from ctFE and VAF in which ctFE of liquid biopsy-sequenced patients (n = 1,000) in which germline variants were removed, without increasing the correlation with medium VAF (R2 = 0.35). Figure 9E illustrates results from ctFE and VAF in which ctFE of liquid biopsy-sequenced patients (n = 1,000) in which amplified regions from these analyses were removed, increasing the correlation with max pathogenic VAF (R2 = 0.41). Figure 9F illustrates results from ctFE and VAF in which ctFE of liquid biopsy-sequenced patients (n = 1,000) in which amplified regions from these analyses were removed, increasing the correlation with medium VAF (R2 = 0.36). Figure 9G illustrates results from ctFE and VAF in which ctFE of liquid biopsy-sequenced patients (n = 1,000) in which samples that also underwent low-pass whole genome sequencing (LPWGS, n = 375), a strong correlation between LPWGS-predicted tumor fraction and ctFE (R2 = 0.843) is found. Furthermore, Figure 9G illustrates results from ctFE and VAF in which ctFE of liquid biopsy-sequenced patients (n = 1,000) and the overall distribution of ctFE across the cohort (median ctFE =
0.07, mean ctFE = 0.12, and standard deviation = 0.15).
[1298] Retrospective clinical profiling of the liquid biopsy assay against a 1 ,000-subject cohort.
[1299] To evaluate the clinical utility of the liquid biopsy, de-identified molecular and clinical data from 1,000 samples across cancer types were selected for clinical profiling. This included 55.7% female and 44.3% male patients, with a median age of 66 years, and interquartile range of 15. Referring to Figure 18, this cohort included patients from 24 cancer categories, with breast (n = 254), colorectal (n = 98), lung (n = 241), pancreatic (n = 83), and prostate (n = 96) being the most common. Referring to Figure 10A, the median ctFE predicted by OTTER was 0.07 for all cancer types, with the exception of prostate, which was 0.06. Referring to Figure 10B, in this cohort, 8,099 mutations were reported, of which 2,732 were pathogenic, and 2,238 were clinically actionable. Specifically, Figure 10A illustrates circulating tumor fraction estimate (ctFE) and mutational landscape by cancer type, in which median ctFE among the most common cancer types was 0.07, with the exception of prostate (ctFE = 0.06). Figure 10B illustrates circulating tumor fraction estimate (ctFE) and mutational landscape by cancer type, in which variants are categorized as reportable, pathogenic, or actionable. Across all patients, the most commonly mutated gene was TP53. The heatmap was normalized within rows to depict the most prevalent variants detected for each common cancer type in the cohort (breast n = 254, colorectal n = 98, lung n = 241, pancreatic n = 83, and prostate n = 96).
[1300] Accordingly, the most frequently mutated gene in the liquid biopsy 1,000 cohort was TP53 (51.1% of patients). The most commonly mutated genes were TP53, PIK3CA, ESR1, BRCA2, NF1, ATM and APC in breast cancer, TP53, EGFR, ATM and KRAS in lung cancer, and TP53, APC, and KRAS in colorectal cancer. These findings are consistent with existing literature on commonly mutated genes in each cancer type and suggest the liquid biopsy test accurately detects variants of interest to the broader cancer community van Helden et al, 2019; Dal Maso et al, 2019; Savli et al, 2019, “TP53, EGFR and PIK3CA gene variations observed as prominent biomarkers in breast and lung cancer by plasma cell- free DNA genomic testing,” J Biotechnol, (300), pg. 87; Cheng et al, 2019, “Liquid Biopsy Detects Relapse Five Months Earlier than Regular Clinical Follow-Up and Guides Targeted Treatment in Breast Cancer,” Case Rep Oncol Med, pg. 6545298; Keup et al, 2019, “Targeted deep sequencing revealed variants in cell-free DNA of hormone receptor-positive metastatic breast cancer patients,” Cell Mol Life Sci, print.; Li et al, 2019, “Genomic profiling of cell-free circulating tumor DNA in patients with colorectal cancer and its fidelity to the genomics of the tumor biopsy,” J Gastrointest Oncol, (10), pg. 831.
[1301] Advanced disease is associated with higher estimated tumor fraction.
[1302] A goal of liquid biopsy assays of the present disclosure is to more efficiently monitor treatment response and predict disease progression in patients over time. To establish proof of concept, the association of ctFE with advanced disease states was investigated. Accordingly, referring to Figure 11 A, a significant difference in ctFE between stages ( P = 2.97e-5) was determined. However, since the majority of patients had advanced disease at the time of testing, more early stage samples are necessary to further verify these findings. Referring to Figure 1 IB, ctFE in patients with metastatic disease was evaluated to determined that ctFE increases when distant sites are affected. Indeed, referring to Figure llC, patients with no metastatic lesions had a significantly lower ctFE than patients with one or more distant sites ( P = 4.77e-7), further highlighting the potential of ctFE for disease monitoring. Specifically, Figure 11 A illustrates circulating tumor fraction estimate (cfTE) according to stage and number of distant metastases among the liquid biopsy 1,000 cohort, in which there was a significant difference in ctFE between stages (Kruskal -Wallis P = 2.97e-5). Accordingly, patients with stage 4 cancer (n = 879, median ctFE = 0.07) had a higher ctFE than those with stages 1 (n = 20, median ctFE = 0.06), 2 (n = 25, median ctFE = 0.06), or 3 (n = 76, median ctFE = 0.06). Figures 1 IB and 11C illustrate that ctFE increased with the number of metastatic distant sites (Mann-Whitney U test P = 7.57e-7), and there was a significant difference in ctFE between patients with no metastatic lesions (n = 116) and those with 1 or more distant sites affected (n = 884, Mann-Whitney U test P = 2.12e-5). The sensitivity and specificity shown to the right-hand side of the Figure 11C represent the probability that a binary metastasis status prediction is correct at a given ctFE threshold. Accordingly, the model predicts metastasis with greater confidence at higher ctFE.
[1303] Estimated tumor fraction correlates with response to treatment.
[1304] To determine how ctFE changes in response to treatment, comparisons between ctFE with the most recent clinical response outcome were determined. Accordingly, referring to Figure 12A, patients classified as having complete response were determined to have a significantly lower median ctFE of 0.05, compared to 0.06, 0.06, and 0.08 in patients with stable disease, partial response, and progressive disease, respectively. Additionally, referring to Figure 12B, patients with multiple liquid biopsy tests were determined to have large differences in ctFE between test dates. For example, referring to Figure 12C, one breast cancer case had a ctFE of 0.05 at initial liquid biopsy testing. After treatment with bevacizumab and paclitaxel, clinical notes indicate the patient was classified as having stable disease. Eribulin treatment was started shortly after, but the patient was later diagnosed with progressive disease. A second liquid biopsy test, which was performed approximately 200 days after the initial liquid biopsy test, revealed a ctFE of 0.26, which supports the progressive disease diagnosis. Alternatively, in a breast cancer patient with progressive disease who was treated with investigational new drug therapies, the patient’s status was updated to stable disease shortly after the first liquid biopsy test, which revealed a ctFE of 0.05. Approximately 100 days later, the patient’s second liquid biopsy test revealed a ctFE of 0.09. The patient likely received no further treatment before the third liquid biopsy test, which revealed a ctFE of 0.27, suggesting this patient’s disease had progressed. Specifically, Figure 12A illustrates circulating tumor fraction estimate (cfTE) and abstracted clinical outcomes in a sub-cohort of the liquid biopsy 1000 (n = 388) in which patients with complete response (n = 9, ctFE = 0.05) exhibited lower ctFE than those with progressive disease (n = 298, ctFE = 0.08), partial response (n = 56, ctFE = 0.06), or stable disease (n = 25, ctFE = 0.06). Figure 12B illustrates that ctFE was also assessed temporally among a few randomly selected patients with multiple liquid biopsy tests throughout the course of treatment (n = 26), with most patients showing large differences in ctFE between test dates. Figure 12C illustrates four exemplary cases highlighting the utility of ctFE in relation to treatment course and disease status.
[1305] In the case of a lung cancer patient who underwent multiple rounds of treatment, including carboplatin, pemetrexed, and etoposide, a decrease in ctFE between liquid biopsy tests (0.72 to 0.47) was determined. However, the ctFE was still extremely high after treatment, making progressive disease likely. Indeed, the patient was classified as having progressive disease by their oncologist shortly before the second liquid biopsy test date. Alternatively, a patient who had undergone treatment with osimertinib and crizotinib approximately 50 days before the first liquid biopsy test showed very little change in ctFE between test dates (0.3-0.11) and was classified as stable shortly before the second liquid biopsy test. Referring to Figures 11 A through 12C, while conclusions about the larger population based on these individual cases cannot be determined, the changes in ctFE in response to treatment is consistent with the above analyses showing that higher ctFEs are associated with advanced disease. Additionally, these results illustrate how serial testing can be beneficial for precision oncology in individual patients. These results further highlight the need for longitudinal studies with serial liquid biopsy testing in a larger cohort of patients.
[1306] While liquid biopsy is a promising tool for improving outcomes in precision oncology, there are challenges that must be overcome before it can replace large panel NGS tissue genotyping. For example, in early stage disease, when treatments have much higher success rates, many patients have low ctDNA fractions that may be below the LOD for liquid biopsies, limiting clinical utility because of the risk of false negatives. Bettegowda el al, 2014, “Detection of circulating tumor DNA in early- and late-stage human malignancies,” Sci Transl Med, (6), pg. 224; Xue el al. , 2019, “Early detection and monitoring of cancer in liquid biopsy: advances and challenges,” Expert Rev Mol Diagn, (19), pg. 273; Hennigan et al. , 2019, “Low Abundance of Circulating Tumor DNA in Localized Prostate Cancer,” JCO Precis Oncol, (3), print; Abbosh et al, 2018, “Early stage NSCLC - challenges to implementing ctDNA-based screening and MRD detection,” Nat Rev Clin Oncol, (15), pg. 577. Consequently, most studies to date have focused on late stage patients for assay validation and research. Furthermore, while validation studies of existing liquid biopsy assays have shown high sensitivity and specificity, few studies have corroborated results with orthogonal methods, or between NGS testing platforms. Cheng et al, 2019, “Clinical Validation of a Cell-Free DNA Gene Panel,” J Mol Diagn, (21), pg. 632; Hanibuchi et al, 2019, “Development, validation, and comparison of gene analysis methods for detecting EGFR mutation from non-small cell lung cancer patients-derived circulating free DNA,” Oncotarget, (10), pg. 3654; Van Laar et al, 2018, “Development and validation of a plasma- based melanoma biomarker suitable for clinical use,” Br J Cancer, (118), pg. 857; Odegaard et al, 2018, “Validation of a Plasma-Based Comprehensive Cancer Genotyping Assay Utilizing Orthogonal Tissue- and Plasma-Based Methodologies,” Clin Cancer Res, (24), pg. 3539; Clark et al, 2018, “Analytical Validation of a Hybrid Capture-Based Next-Generation Sequencing Clinical Assay for Genomic Profiling of Cell-Free Circulating Tumor DNA,” J Mol Diagn, (20), pg. 686; Plagnol et al, 2018, “Analytical validation of a next generation sequencing liquid biopsy assay for high sensitivity broad molecular profiling,” PLoS One, (13), pg. 0193802. Kuderer et al. compared commercially available liquid and tissue NGS platforms and found only 22% concordance in genetic alterations. Kuderer et al. , 2017, “Comparison of 2 Commercially Available Next-Generation Sequencing Platforms in Oncology,” JAMA Oncol, (3), pg. 996. Other reports of liquid biopsy based studies are limited by comparison to non-comprehensive tissue testing algorithms including Sanger sequencing, small NGS hotspot panels, PCR and FISH, which may not contain all NCCN guideline genes in their reportable range, thus suffering in comparison to a more comprehensive liquid biopsy assay. Leighl etal, 2019. Since the 105 gene liquid biopsy assay is a subset of the 648 gene solid tumor tissue-based assay, the concordance data presented herein (74.31% for actionable variants) represents a direct comparison to a comprehensive NGS test which includes the entire reportable range of the liquid biopsy assay. Beaubier et al. , 2019, “Integrated genomic profiling expands clinical options for patients with cancer,” Nat Biotechnol, (37), pg. 1351. While this concordance is high relative previous reports, 25.69% of actionable variants would have been missed if only one of the tests were performed. Thus, liquid biopsies provide the greatest value to patients when used in combination with standard tissue genotyping. Furthermore, having both tests enabled additional analyses to exclude germline and CH variants, significantly improving specificity.
[1307] Accordingly, the systems and methods of the present disclosure provides analytical and clinical validation of the liquid biopsy assay. The systems and methods of the present disclosure provide high accuracy compared to orthogonal methods, including tissue biopsy, Avenio liquid biopsy, ddPCR, and LPWGS. The systems and methods of the present disclosure also provide improvements upon existing methodologies for estimating circulating tumor fraction. Notably, in combination with real-world clinical data, the systems and methods of the present disclosure demonstrate the value and suitability of liquid biopsy testing for monitoring disease progression, predicting objective measures of response, and assessing treatment outcomes. As such, the results obtained through validating the systems and methods of the present disclosure strongly support utilizing the liquid biopsy assay in routine monitoring of cancer patients with advanced disease. [1308] Example 5 - A retrospective analysis on the prognostic value of Off-Target Tumor Estimation Routine, a novel circulating tumor fraction (ctFE) calculation, in patients with advanced prostate cancer.
[1309] Prostate-specific antigen (PSA) is a biomarker for monitoring tumor burden and treatment response. However, due to multiple variable factors (e.g., variation in PSA production by prostate cancer (PC) cells, PSA level variation between patients, and PSA level variation during the course of the disease), non-invasive biomarkers are needed for better prognostication and assessing therapeutic response. We recently developed the Off-Target Tumor Estimation Routine method, described herein, which calculates circulating tumor fraction estimates (ctFE) using on- and off-target reads from a targeted-panel liquid biopsy assay (DNA-Seq of 105 genes at 5,000x depth in circulating tumor DNA [ctDNA] from peripheral blood samples). Here, we analyze the prognostic value of ctFE for advanced PC patients undergoing liquid biopsy testing.
[1310] We retrospectively analyzed 108 NGS results from 80 patients treated at Ben Taub Hospital (BTH) with locally advanced, biochemically recurrent or metastatic prostate cancer. We calculated ctFE for all patients using this method, which evaluates the copy state of regions across the genome. Survival analysis was based on a 6-month follow-up. For prognostic analysis, the highest ctFE was used for each patient with >1 xF result. Patients were classified as: 1. Low (ctFE-L: ctFE < 0.02); 2. High (ctFE-H: ctFE > 0.02); or 3. Converters (ctFE-H to L: ctFE drop below 0.02 during follow-up). In 16 metastatic PC patients receiving first-line androgen deprivation therapy (1LADT, augmented with abiraterone/prednisone), pre-treatment and on-treatment ctFE data as well as clinical follow up (median: 12 months) were examined.
[1311] Results: 65/80 (81%) patients were classified as ctFE-L. Of these, 64 (98%) were alive at the 6-month follow-up, and one was deceased due to a non-PC-related cause. 15/80 (19%) patients had a least one ctFE-H estimate. Of these, 7 (47%) were deceased due to PC- related causes within 6 months (range: 2-172 days, median: 15 days), while the remaining 8 (53%) showed ctFE-H to -L conversion in response to treatment and were alive at the 6- month follow-up. Among 16 metastatic PC patients, 1LADT lowered ctFE in 12 patients; of these, 10 patients continued responding to treatment during the follow-up period. The 4 patients whose ctFE did not drop became castration-resistant during this period. [1312] Conclusions: Our data suggest that ctFE may predict PC patient overall survival. ctFE-L status is associated with patient survival at 6-month follow-up. Conversely, ctFE-H status is associated with death unless the implementation of a new active treatment can convert the patient to ctFE-L upon rechecking. Changes in ctFE may also correlate with response to 1LADT. Our study illustrates the potential of using ctFE as a tool for PC prognostication.
REFERENCES CITED AND ALTERNATIVE EMBODIMENTS
[1313] All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.
[1314] Log2 transformed copy ratios, log2 copy ratios, log2-transformed depths, log2- transformed read depths, log2 depths, corrected log2 depths, log2 ratios, log2 read depths, and log2 depth correction values have been discussed herein by way of example. In each instance where such a term is used, it will be appreciated that log base 2 is presented by way of example only and that the present disclosure is not so limited. Indeed, logarithms to any base N may be used, (e.g., where N is a positive number greater than 1 for instance), and thus the present disclosure fully supports logN transformed copy ratios, logN copy ratios, logN- transformed depths, logN-transformed read depths, logN depths, corrected logN depths, logN ratios, logN read depths, and logN depth correction values as respective substitutes for log2 transformed copy ratios, log2 copy ratios, log2-transformed depths, log2-transformed read depths, log2 depths, corrected log2 depths, log2 ratios, log2 read depths, and log2 depth correction values.
[1315] The present invention can be implemented as a computer program product that comprises a computer program mechanism embedded in a non-transitory computer readable storage medium. These program modules can be stored on a CD-ROM, DVD, magnetic disk storage product, USB key, or any other non-transitory computer readable data or program storage product.
[1316] Many modifications and variations of this invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. The specific embodiments described herein are offered by way of example only. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The invention is to be limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims (163)

WHAT IS CLAIMED IS:
1. A method of validating a copy number variation in a test subject, the method comprising: at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors:
(A) obtaining, from a first sequencing reaction, a corresponding sequence of each cell-free DNA fragment in a first plurality of cell-free DNA fragments in a liquid biopsy sample of the test subject, thereby obtaining a first plurality of sequence reads, wherein the first plurality of sequence reads comprises at least 100,000 sequence reads;
(B) aligning each respective sequence read in the first plurality of sequence reads to a reference sequence for the species of the subject;
(C) determining:
(1) a plurality of bin-level sequence ratios, each respective bin-level sequence ratio in the plurality of bin-level sequence ratios corresponding to a respective bin in a plurality of bins, wherein: each respective bin in the plurality of bins represents a corresponding region of a reference genome for the species of the subject, and each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is determined from a comparison of the first plurality of sequence reads to sequence reads from one or more reference samples;
(2) a plurality of segment-level sequence ratios, each respective segment-level sequence ratio in the plurality of segment-level sequence ratios corresponding to a segment in a plurality of segments, wherein: each respective segment in the plurality of segments represents a corresponding region of the reference genome for the species of the subject encompassing a subset of adjacent bins in the plurality of bins, and each respective segment-level sequence ratio in the plurality of segment-level sequence ratios is determined from a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion (i) corresponding to a respective segment in the plurality of segments and (ii) determined using the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment; and
(D) validating a copy number status annotation of a respective segment in the plurality of segments that is annotated with a copy number variation by applying the first dataset to an algorithm having a plurality of filters, the plurality of filters comprising:
(1) a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds;
(2) a confidence filter that is fired when the segment-level measure of dispersion corresponding to the respective segment fails to satisfy a confidence threshold; and
(3) a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds, wherein the one or more measure of central tendency-plus-deviation bin-level copy ratio thresholds are derived from (i) a measure of central tendency of the bin-level sequence ratios corresponding to the plurality of bins that map to the same chromosome of the reference genome for the species of the subject as the respective segment, and (ii) a measure of dispersion across the bin-level sequence ratios corresponding to the plurality of bins that map to the respective chromosome; wherein rejecting or validating the copy number status annotation of the respective segment is based on a predetermined pattern of firing or lack of firing of each of the filters in the plurality of filters.
2. A method of validating a copy number variation in a test subject, the method comprising: at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors:
(A) obtaining, from a first sequencing reaction, a corresponding sequence of each cell-free DNA fragment in a first plurality of cell-free DNA fragments in a liquid biopsy sample of the test subject, thereby obtaining a first plurality of sequence reads;
(B) aligning each respective sequence read in the first plurality of sequence reads to a reference sequence for the species of the subject, wherein the reference sequence for the species represents at least 1 Mb of the genome for the species;
(C) determining:
(1) a plurality of bin-level sequence ratios, each respective bin-level sequence ratio in the plurality of bin-level sequence ratios corresponding to a respective bin in a plurality of bins, wherein: each respective bin in the plurality of bins represents a corresponding region of a reference genome for the species of the subject, and each respective bin-level sequence ratio in the plurality of bin-level sequence ratios is determined from a comparison of the first plurality of sequence reads to sequence reads from one or more reference samples;
(2) a plurality of segment-level sequence ratios, each respective segment-level sequence ratio in the plurality of segment-level sequence ratios corresponding to a segment in a plurality of segments, wherein: each respective segment in the plurality of segments represents a corresponding region of the reference genome for the species of the subject encompassing a subset of adjacent bins in the plurality of bins, and each respective segment-level sequence ratio in the plurality of segment-level sequence ratios is determined from a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each respective segment-level measure of dispersion in the plurality of segment-level measures of dispersion (i) corresponding to a respective segment in the plurality of segments and (ii) determined using the plurality of bin-level sequence ratios corresponding to the subset of adjacent bins encompassed by the respective segment; and
(D) validating a copy number status annotation of a respective segment in the plurality of segments that is annotated with a copy number variation by applying the first dataset to an algorithm having a plurality of filters, the plurality of filters comprising:
(1) a measure of central tendency bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more bin-level sequence ratio thresholds;
(2) a confidence filter that is fired when the segment-level measure of dispersion corresponding to the respective segment fails to satisfy a confidence threshold; and (3) a measure of central tendency-plus-deviation bin-level sequence ratio filter that is fired when a measure of central tendency of the plurality of bin-level sequence ratios corresponding to the subset of bins encompassed by the respective segment fails to satisfy one or more measure of central tendency-plus-deviation bin-level sequence ratio thresholds, wherein the one or more measure of central tendency-plus-deviation bin-level copy ratio thresholds are derived from (i) a measure of central tendency of the bin-level sequence ratios corresponding to the plurality of bins that map to the same chromosome of the reference genome for the species of the subject as the respective segment, and (ii) a measure of dispersion across the bin-level sequence ratios corresponding to the plurality of bins that map to the respective chromosome; wherein rejecting or validating the copy number status annotation of the respective segment is based on a predetermined pattern of firing or lack of firing of each of the filters in the plurality of filters.
3. The method of claim 1 or 2, wherein the liquid biopsy sample is a blood sample.
4. The method of any one of claims 1-3, wherein the liquid biopsy sample comprises blood, whole blood, peripheral blood, plasma, serum, or lymph of the test subject.
5. The method of any one of claims 1-4, further comprising obtaining the liquid biopsy sample from a sample repository or database.
6. The method of any one of claims 1-5, wherein the test subject is a patient in a clinical trial.
7. The method of any one of claims 1-6, wherein the test subject is a patient with a cancer.
8. The method of claim 7, wherein the cancer is a solid tumor cancer.
9. The method of claim 7 or 8, wherein the cancer is Ovarian Cancer, Cervical Cancer,
Uveal Melanoma, Colorectal Cancer, Chromophobe Renal Cell Carcinoma, Liver Cancer, Endocrine Tumor, Oropharyngeal Cancer, Retinoblastoma, Biliary Cancer, Adrenal cancer, Neural, Neuroblastoma, Basal Cell Carcinoma, Brain Cancer, Breast Cancer, Melanoma, Non-Clear Cell Renal Cell Carcinoma, Glioblastoma, Glioma, Tumor of Unknown Origin, Kidney Cancer, Gastrointestinal Stromal Tumor, Medulloblastoma, Bladder Cancer, Gastric Cancer, Bone Cancer, Non-Small Cell Lung Cancer, Thymoma, Low Grade Glioma, Prostate Cancer, Clear Cell Renal Cell Carcinoma, Skin Cancer, Thyroid Cancer, Sarcoma, Testicular cancer, Head and Neck Cancer, Head and Neck Squamous Cell Carcinoma, Meningioma, Peritoneal cancer, Endometrial Cancer, Pancreatic Cancer, Mesothelioma, Esophageal Cancer, Small Cell Lung Cancer, Her2 Negative Breast Cancer, Solid Tumor, Ovarian Serous Carcinoma, HR+ Breast Cancer, Uterine Serous Carcinoma, Endometrial Cancer, Uterine Corpus Endometrial Carcinoma, Gastroesophageal Junction Adenocarcinoma, Gallbladder Cancer, Chordoma, o