EP4320618A2 - Zellfreies dna-sequenzdatenanalyseverfahren zur untersuchung des nukleosomschutzes und der chromatzugänglichkeit - Google Patents

Zellfreies dna-sequenzdatenanalyseverfahren zur untersuchung des nukleosomschutzes und der chromatzugänglichkeit

Info

Publication number
EP4320618A2
EP4320618A2 EP22785557.4A EP22785557A EP4320618A2 EP 4320618 A2 EP4320618 A2 EP 4320618A2 EP 22785557 A EP22785557 A EP 22785557A EP 4320618 A2 EP4320618 A2 EP 4320618A2
Authority
EP
European Patent Office
Prior art keywords
cancer
cell
determining
fragment
read data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP22785557.4A
Other languages
English (en)
French (fr)
Inventor
Gavin HA
David Macpherson
Peter S. Nelson
Anna-Lisa DOEBLEY
Joseph B. HIATT
Navonil DE SARKAR
Robert Patton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fred Hutchinson Cancer Center
Original Assignee
Fred Hutchinson Cancer Research Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fred Hutchinson Cancer Research Center filed Critical Fred Hutchinson Cancer Research Center
Publication of EP4320618A2 publication Critical patent/EP4320618A2/de
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/10ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Definitions

  • Metastatic cancer is a late stage of cancer that often leads to cancer-related deaths.
  • treatment options are often based on clinical diagnostics from the primary tumor.
  • molecular changes in the tumor such as genetic alterations or phenotype changes, can emerge during metastatic progression or the development of treatment resistance.
  • hormone receptor conversions in breast cancer are frequent observed during the development of targeted treatment resistance. Therefore, it is important to classify tumor subtypes and identify patterns of transcriptional regulation that drive tumor phenotype changes during therapy. This type of work has critical implications for studying mechanisms of resistance to therapies and informing clinical treatment decisions in order to provide patients with life-prolonging treatment and care.
  • breast cancer is among the most common causes of cancer, accounting for 23% of cancer diagnoses and 14% of cancer-related deaths among women worldwide.
  • Targeted therapy is guided by tumor subtype, including the expression of three hormone receptors: ER, PR and HER2.
  • ER ER
  • PR HER2
  • breast cancer tumors will undergo a switch in hormone subtype during tumor recurrence or as a mechanism of resistance to endocrine therapy.
  • clinical determination of tumor subtype remains restricted to use of tissue biopsies, which are not routinely collected in late-stage cancers or repeatedly taken during the course of therapy.
  • prostate cancer is the second most common cause of cancer mortality among men with an estimated 33,000 deaths in the United States in 2020.
  • Castration- resistant prostate cancer describes the stage in which the disease has developed resistance to androgen deprivation therapy and progression to metastatic CRPC (mCRPC), which is an invariably lethal stage with no curative treatment.
  • mCRPC is recognized to comprise multiple distinct subtype lineages and molecular subtypes, which are generally classified by specific genomic or epigenetic modifications.
  • Prostate cancer can be categorized by phenotypic features that includes spectrum of trans-differentiated disease state including neuroendocrine (NE) carcinomas, low androgen regulated disease state (ARlowPC), double negative prostate cancer (DNPC, AR negative NE negative).
  • NE neuroendocrine
  • ARlowPC low androgen regulated disease state
  • DNPC double negative prostate cancer
  • the disclosure provides a computer-implemented method of enhancing sequence read data from cell-free DNA samples for cell type prediction.
  • the method comprises: receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, wherein each fragment read has a fragment length and a GC content indicating a percentage of bases in the fragment read that are G or C; determining, by the computing system, GC bias values for each fragment read based on the fragment length and the GC content of the fragment read; generating, by the computing system, a genomic coverage distribution that is adjusted for GC bias using the sequence read data and the GC bias values; and predicting, by the computing system, the cell type based on the genomic coverage distribution.
  • predicting the cell type based on the genomic coverage distribution includes predicting a cell phenotype. In one embodiment, predicting the cell phenotype includes predicting a tissue type, a cancer type, or a cancer subtype. In one embodiment, predicting the cell phenotype includes predicting expression of one or more genes of interest.
  • determining the GC bias value based on the fragment length and the GC content of the fragment read includes: counting a number of observed reads of each combination of fragment length and GC content to determine GC counts for the sequence read data; dividing the GC counts by corresponding GC frequencies in a GC frequency matrix to determine a GC bias for each fragment length; normalizing a mean GC bias for each fragment length to determine rough GC bias values; and smoothing the rough GC bias values to determine the GC bias values.
  • the GC frequency matrix stores a frequency for each GC content for each fragment length of a plurality of fragment lengths in mappable regions of a reference genome.
  • the plurality of fragment lengths includes each fragment length from a short length threshold to a long length threshold.
  • the short length threshold is in a range of 10-20 base pairs
  • the long length threshold is in a range of 450-550 base pairs.
  • the short length threshold is 15 base pairs
  • the long length threshold is 500 base pairs.
  • the method further comprises: determining genomic regions of interest for a cell type; and filtering the genomic regions of interest to identify cell-type-informative sites.
  • determining the genomic regions of interest includes: determining a mean mappability in a fixed size window around each genomic region of interest; and discarding genomic regions of interest having a mean mappability less than a predetermined threshold.
  • filtering the genomic regions of interest to identify cell-type-informative sites includes determining sites that have differential signals between a first cell type and a second cell type.
  • generating the genomic coverage distribution includes: determining fragment midpoints in a window around each cell-type-informative site; assigning a weight for each fragment read based on an inverse of the GC bias value for each fragment read; using the weighted fragment reads to determine GC-corrected midpoint coverage profiles; excluding positions that overlap excluded regions; determining a mean profile based on determining an average of GC-corrected midpoint coverage profiles for all sites; smoothing the mean profile to generate a smoothed mean profile; and normalizing the smoothed mean profile by dividing by a mean of surrounding coverage to determine a normalized mean profile.
  • the excluded regions include one or more regions that are within an encode unified GROG 8 exclusion list, centromeres, gaps in human genome assembly, fix patches, alternative haplotypes, regions of zero mappability, or have coverage of at least 10 standard deviations above a mean.
  • predicting the cell type based on the genomic coverage distribution includes: generating one or more features based on the genomic coverage distribution; providing the one or more features as input to a classifier model; and determining the cell type based on an output of the classifier model.
  • the one or more features include a mean of coverage in a first predetermined window around each cell-type-informative site, a mean of coverage in a second predetermined window of a different size than the first predetermined window around each cell-type- informative site, and an amplitude of the genomic coverage distribution around each cell- type-informative site.
  • the first predetermined window is larger than the second predetermined window.
  • the first predetermined window has a width in a range of 1800-2200 base pairs
  • the second predetermined window has a width in a range of 40-80 base pairs.
  • the first predetermined window has a width of 2000 base pairs
  • the second predetermined window has a width of 60 base pairs.
  • the amplitude of the genomic coverage distribution around each cell-type-informative site is determined by: trimming the genomic coverage distribution to a window that contains 10 peaks; performing a fast Fourier transform on the window of the genomic coverage distribution; and determining a magnitude of the 10th frequency.
  • the classifier model includes a logistic regression model, an artificial neural network, a decision tree, a support vector machine, or a Bayesian network.
  • the disclosure provides a method of determining a chromatin accessibility profile for a cell of interest from a sample comprising cell-free DNA derived from the cell of interest.
  • the method comprises: obtaining sequence read data from the cell-free DNA; receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, wherein each fragment read has a fragment length and a GC content indicating a percentage of bases in the fragment read that are G or C; determining, by the computing system, GC bias values for each fragment read based on the fragment length and the GC content of the fragment read; generating, by the computing system, a genomic coverage distribution that is adjusted for GC bias using the sequence read data and the GC bias values; and determining the chromatin accessibility profile from the genomic coverage distribution.
  • the method further comprises determining a phenotype of the cell of interest based on the chromatin occupancy profile. In one embodiment, determining the cell phenotype comprises determining a tissue type, a cancer type, a cancer subtype, a malignancy aggressiveness phenotype, and/or a drug responsivity phenotype. In one embodiment, the method further comprises performing one or more steps of the computer implemented method described herein.
  • the disclosure provides a method for determining a cell type of a cell of interest from a sample comprising cell-free DNA derived from the cell of interest.
  • the method comprises: obtaining sequence read data generated from the sample comprising cell-free
  • DNA DNA; performing the computer-implemented method described herein; and determining the cell type of the cell of interest based on the prediction provided by the computing system.
  • determining the cell type comprises determining a cell phenotype. In one embodiment, determining the cell phenotype comprises determining a tissue type, a cancer type, a cancer subtype, a malignancy aggressiveness phenotype, and/or a drug responsivity phenotype. In one embodiment, determining the cell phenotype includes determining expression of one or more genes of interest.
  • the disclosure provides a method of detecting the presence of a cancer cell in a subject, comprising: obtaining sequence read data generated from the sample comprising cell-free DNA obtained from the subject; performing the computer-implemented method described herein; and determining the presence of a cancer cell in the subject based on the prediction provided by the computing system.
  • the method is performed a plurality of times over time, wherein the detected cancer cell(s) in the subject at each performance of the method are further characterized to determine a cancer subtype or phenotype of the detected cancer cell(s) based on the prediction provided by the computing system.
  • the method is performed a plurality of times over time, and the method further comprises detecting a change in phenotype of the detected cancer cell(s) over time.
  • the subject receives a cancer therapy between performances of the method, and the method further comprises determining the responsivity of the cancer cell(s) to the treatment.
  • the disclosure provides a method of determining a cancer subtype of a target cancer cell from a sample comprising cell-free DNA derived from the target cancer cell.
  • the method comprises: obtaining sequence read data generated from the sample comprising cell-free
  • the sample is obtained from a subject with cancer.
  • the cancer is characterized as metastatic breast cancer.
  • determining the cancer subtype comprises determining whether the cancer is ER+ versus ER-. In one embodiment, determining the cancer subtype comprises determining whether the cancer is PR+ versus PR-. In one embodiment, determining the cancer subtype comprises determining whether the cancer is HER2+ versus HER2-. In one embodiment, determining the cancer subtype comprises determining two or all of: whether the cancer is ER+ versus ER-, whether the cancer is PR+ versus PR-, and whether the cancer is HER2+ versus HER2-.
  • cancer is characterized as metastatic prostate cancer.
  • determining the cancer subtype comprises determining whether the cancer is AR+ (ARPC) versus AR-. In one embodiment, determining the cancer subtype comprises determining whether the cancer is ARPC versus AR-low. In one embodiment, determining the cancer subtype comprises determining whether the cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not. In one embodiment, determining the cancer subtype comprises determining whether the cancer is amphicrine.
  • ARPC AR+
  • determining the cancer subtype comprises determining whether the cancer is ARPC versus AR-low.
  • determining the cancer subtype comprises determining whether the cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not. In one embodiment, determining the cancer subtype comprises determining whether the cancer is amphicrine.
  • NEPC neuroendocrine prostate cancer
  • determining the cancer subtype comprises determining two or all of: whether the cancer is AR+ (ARPC) or AR-, whether the cancer is AR-low or ARPC, whether the cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not, whether the cancer is AR-low or NEPC, whether the cancer is amphicrine or ARPC or NEPC.
  • ARPC AR+
  • NEPC neuroendocrine prostate cancer
  • the cancer is characterized as lung cancer.
  • determining the cancer subtype comprises determining whether the cancer is small cell lung cancer (SCLC) or non-small cell lung cancer (NSCLC).
  • the method further comprises determining whether the NSCLC is adenocarcinoma or squamous cell carcinoma.
  • the sequence read data is generated from a panel of genomic targets.
  • the panel of genomic targets comprises transcription factor binding sites (TFBSs) of one or more transcription factors associated with SCLC.
  • the one or more transcription factors associated with SCLC comprise one or more of ASLC, NEUROD1, POU2F3, REST, and the like, and the method comprises determining the nucleosome occupancy of the TFBSs.
  • the TFBSs are identified by ChIP-seq data, or the like, and are retained in the panel if they are proximal to a transcription start site of a gene associated with lung cancer.
  • the panel of genomic targets comprise transcription start sites (TSSs) for one or more markers associated with lung cancer, wherein the method comprises determining the nucleosome occupancy of the TSSs.
  • the sample is obtained from a subject.
  • the method further can further comprise administering an effective treatment to the subject based on the determined cancer subtype.
  • the method further comprises performing the method on a plurality of samples obtained from the subject at a plurality of distinct time points after an initial diagnosis of cancer.
  • the sequence read data is generated by ultra-low pass whole genome sequencing.
  • sequence read data is generated by a chromatin accessibility assay.
  • sequence read data is generated in an ATAC-seq method.
  • sequence read data is generated in a ChIP-seq method.
  • sequence read data is generated in a DNAse sensitivity assay.
  • sequence read data is generated in a CUT&RUN assay.
  • CUT&RUN assay incorporates an affinity reagent that targets a post-translational modification to one or more of H3K27ac, H3K4mel and H3K27ac.
  • the method can further comprises generating the sequence read data.
  • the sequence read data comprises sequence read data generated from a panel of genomic targets.
  • the panel of genomic targets comprises transcription factor binding sites (TFBSs) of one or more transcription factors associated with a cancer type of interest.
  • the method comprises determining the nucleosome occupancy of the TFBSs.
  • the TFBSs are identified by ChIP-seq data, or the like, and are retained in the panel if they are proximal to a transcription start site of a gene associated with the cancer type of interest.
  • the panel of genomic targets comprise transcription start sites (TSSs) for one or more markers associated with the cancer type of interest, wherein the method comprises determining the nucleosome occupancy of the TSSs.
  • the sample can be blood, plasma, or serum, and the like.
  • the disclosure provides a computer-implemented method of enhancing sequence read data from cell-free DNA samples for cell type prediction.
  • the method comprises: receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, and wherein each fragment read has a fragment length; determining, by the computing system, a fragment size variability for at least one gene associated with a cell type; and predicting, by the computing system, the cell type based on the fragment size variability for the at least one gene.
  • determining the fragment size variability includes determining a fragment size coefficient of variation.
  • predicting the cell type based on the genomic coverage distribution includes predicting a cell phenotype.
  • predicting the cell phenotype includes predicting a cancer subtype.
  • predicting the cell phenotype includes predicting a cancer subtype of prostate cancer.
  • predicting the cancer subtype includes distinguishing between ARPC and NEPC.
  • predicting the cell type based on the fragment size variability includes: generating one or more features based on the fragment size variability; providing the one or more features as input to a classifier model; and determining the cell type based on an output of the classifier model.
  • generating the one or more features based on the fragment size variability includes generating a log2 fold change value of a fragment size coefficient of variation in a first cell type versus a second cell type.
  • the log2 fold change value predicts at least one of gene expression and gene transcriptional activity between the first cell type and the second cell type.
  • the first cell type is an ARPC cell and the second cell type is an NEPC cell.
  • the classifier model includes a logistic regression model, an artificial neural network, a decision tree, a support vector machine, or a Bayesian network.
  • the disclosure provides a method for determining a cell type of a cell of interest from a sample comprising cell-free DNA derived from the cell of interest, comprising: obtaining sequence read data generated from the sample comprising cell-free
  • DNA DNA; performing the computer-implemented method described herein (e.g., relating to predicting the cell type based on the fragment size variability); and determining the cell type of the cell of interest based on the prediction provided by the computing system.
  • determining the cell type comprises determining a cell phenotype. In one embodiment, determining the cell phenotype comprises determining a cancer subtype. In one embodiment, determining the cancer subtype includes distinguishing between ARPC and NEPC.
  • the disclosure provides a method of detecting the presence of a cancer cell in a subject, comprising: obtaining sequence read data generated from a sample comprising cell-free DNA obtained from the subject; performing the computer-implemented method described herein (e.g., relating to predicting the cell type based on the fragment size variability); and determining the presence of a cancer cell in the subject based on the prediction provided by the computing system.
  • the method is performed a plurality of times over time, wherein the detected cancer cell(s) in the subject at each performance of the method are further characterized to determine a cancer subtype or phenotype of the detected cancer cell(s) based on the prediction provided by the computing system. In one embodiment, the method is performed a plurality of times over time, and wherein the method further comprises detecting a change in phenotype of the detected cancer cell(s) over time. In one embodiment, the subject receives a cancer therapy between performances of the method, wherein the method further comprises determining the responsivity of the cancer cell(s) to the treatment. In another aspect, the disclosure provides a method of determining a cancer subtype of a target cancer cell from a sample comprising cell-free DNA derived from the target cancer cell, the method comprising: obtaining sequence read data generated from the sample comprising cell-free
  • DNA DNA; performing the computer-implemented method described herein (e.g., relating to predicting the cell type based on the fragment size variability); and determining the cell type of the originating cell based on the predicted cancer subtype provided by the computing system.
  • the sample is obtained from a subject with cancer.
  • the cancer is characterized as metastatic prostate cancer.
  • determining the cancer subtype comprises determining whether the cancer is AR+ (ARPC) versus AR-.
  • determining the cancer subtype comprises determining whether the cancer is ARPC versus AR-low prostate cancer (ARLPC).
  • determining the cancer subtype comprises determining whether the cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not.
  • the sample is obtained from a subject and the method further comprises administering an effective treatment to the subject based on the determined cancer subtype.
  • the method further comprises performing the method on a plurality of samples obtained from the subject at a plurality of distinct time points after an initial diagnosis of cancer.
  • the sequence read data is generated by ultra-low pass whole genome sequencing. In one embodiment, the sequence read data is generated by a chromatin accessibility assay. In one embodiment, the sequence read data is generated in an ATAC-seq method. In one embodiment, the sequence read data is generated in a ChlP- seq method. In one embodiment, the sequence read data is generated in a DNAse sensitivity assay. In one embodiment, the sequence read data is generated in a CUT&RUN assay. In one embodiment, the CUT&RUN assay incorporates an affinity reagent that targets a post-translational modification to one or more of H3K27ac, H3K4mel and H3K27ac. In one embodiment, the method further comprises generating the sequence read data.
  • the sequence read data is generated from a panel of genomic targets.
  • the panel of genomic targets comprises transcription factor binding sites (TFBSs) of one or more transcription factors associated with a cancer type of interest.
  • the method comprises determining the nucleosome occupancy of the TFBSs.
  • TFBSs are identified by ChIP-seq data, or the like, and are retained in the panel if they are proximal to a transcription start site of a gene associated with the cancer type of interest.
  • the panel of genomic targets comprise transcription start sites (TSSs) for one or more markers associated with the cancer type of interest, wherein the method comprises determining the nucleosome occupancy of the TSSs.
  • the sample is blood, plasma, or serum.
  • FIGURE 1 is a flowchart that illustrates a non-limiting example embodiment of a method of cancer subtype prediction according to various aspects of the present disclosure.
  • FIGURE 2 is a flowchart that illustrates a non-limiting example embodiment of a procedure for determining informative sites for tissue, cell-type, cancer-type, or cancer- subtype of interest and filtering to identify cancer subtype-specific informative sites according to various aspects of the present disclosure.
  • FIGURE 3 is a flowchart that illustrates a non-limiting example embodiment of a procedure for determining a GC frequency matrix for a genome according to various aspects of the present disclosure.
  • FIGURE 4 is a flowchart that illustrates a non-limiting example embodiment of a procedure for using a GC frequency matrix to determine GC bias values for sequence read data according to various aspects of the present disclosure.
  • FIGURE 5 is a flowchart illustrating a non-limiting example embodiment of a procedure for using GC bias values to generate a nucleosome profile of sequence read data for subtype- specific informative sites according to various aspects of the present disclosure.
  • FIGURE 6 is a block diagram that illustrates aspects of an exemplary computing device appropriate for use as a computing device of the present disclosure.
  • FIGURES 7A and 7B illustrate the Griffin framework for cfDNA nucleosome profiling to predict cancer subtypes and tumor phenotype.
  • FIGURE 7 A is an illustration of a group of accessible sites (left panel) and inaccessible sites (right panel), such as a TFBS.
  • the nucleosomes (in grey) are positioned in an organized manner around the accessible sites (box; left panel), but not around the inaccessible ones (right panel). These nucleosomes protect the DNA from degradation when it is released into peripheral blood.
  • the protected fragments from the plasma are sequenced and aligned, leading to a coverage profile which reflects the nucleosome protection in the cells of origin.
  • FIGURE 7B is a schematic showing the Griffin workflow for cfDNA nucleosome profiling analysis.
  • cfDNA whole genome sequencing (WGS) data with > O.lx coverage is aligned to hg38 genome build.
  • Sites of interest are selected from any assay. Paired-end reads aligned to each site are collected, fragment midpoint coverage is counted, and corrected for GC bias to produce a coverage profile.
  • Coverage profiles from all sites in a group e.g., open chromatin for tumor subtype
  • Composite profiles are normalized using the surrounding region (-5 kb to +5 kb).
  • FIGURES 8A to 8G illustrate that Griffin GC bias correction improves detection of tissue specific accessibility from cfDNA.
  • FIGURE 8A graphically illustrates the aggregated GC content at 10,000 GRHF2 binding sites and its surrounding 2kb region. Mean GC content (line) and interquartile range (shading) are shown.
  • FIGURE 8B graphically illustrates cfDNA GC bias is unique to each sample and each fragment length. GC bias computed for cfDNA from a healthy donor (HD_46; dashed shades) and a metastatic breast cancer (MBC_315; solid shades) sample are shown for various fragment sizes.
  • HD_46 healthy donor
  • MCC_315 metastatic breast cancer
  • FIGURE 8C graphically illustrates composite coverage profile of 10,000 GRHF2 binding sites before and after GC correction, shown for HD_46 (dashed) and MBC_315 (solid).
  • the 'central coverage' has a higher value due to effects of GC bias, which can obscure differential signals between samples.
  • the central coverage of the MBC sample has lower value, which is consistent with increased GRHL2 activity in breast cancer but not immune cells making up the healthy donor sample.
  • FIGURE 8D graphically illustrates composite coverage profiles of 10,000 LYL1 sites before and after GC correction, shown for two MBC samples with deep WGS (9-25x, orange), two healthy donors (17-20x, green), and 191 MBC samples with ULP- WGS (0.1-0.3x, blue).
  • cfDNA contains a mixture of tumor and blood cells; therefore, central coverage value is expected to be positively correlated with tumor fraction (lower represents increased accessibility).
  • the boxed range represents the median ⁇ IQR
  • whiskers represent the range of the non-outlier data (maximum extent is 1.5x the IQR).
  • Outliers are plotted in grey p-value was calculated using the Wilcoxon signed-rank test (two-sided).
  • FIGURE 8G illustrates boxplots showing the distribution of the mean absolute deviation (of the central coverage across 215 healthy donors [l-2x WGS]) across the 377 TFs, before and after GC correction. Box elements are the same as (8F). p-value was calculated using the Wilcoxon signed-rank test (two-sided).
  • FIGURES 9A and 9B illustrate that Griffin enables accurate cancer detection and tissue-of-origin prediction.
  • FIGURE 9A illustrates receiver operator characteristic (ROC) curve for logistic regression classification of cancer vs. healthy controls in three datasets, the DELFI dataset (Cristiano, S. et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 570, 385-389 (2019)), LUCAS dataset, and LUCAS validation dataset (Mathios, D. et al. Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat Commun 12, 5060 (2021)).
  • ROC receiver operator characteristic
  • FIGURE 9B illustrates boxplots of the AUC values for 1000 bootstrap iterations.
  • the boxed range represents the median ⁇ IQR
  • whiskers represent the range of the non-outlier data (maximum extent is 1.5x the IQR). Values below the boxplots show the median and 95% confidence interval.
  • FIGURES 10A to 10H illustrate that Griffin enables accurate prediction of breast cancer estrogen receptor subtypes from ultra-low pass WGS.
  • FIGURE IOC illustrates a comut (Crowdis, J., He, M. X., Reardon, B. & Van Allen, E. M. CoMut: visualizing integrated molecular information with comutation plots. Bioinformatics 36, 4348-4349 (2020)) plot showing information about 101 MBC patients with >0.10 tumor fraction. Top row shows the ER status used for training and assessing the regression model. For most patients, this was the metastatic ER status obtained from IHC, if the metastatic ER status was not available, the primary ER status was used.
  • FIGURE 10D is a receiver operator characteristic (ROC) curve for a logistic regression model predicting ER+ and ER- subtype. ROC curve, accuracy and AUC are shown for all patients and for patients grouped by tumor fraction (TFx), 0.05- 0.1 and >0.1. 95% CIs were obtained by bootstrapping.
  • ROC receiver operator characteristic
  • FIGURE 10E graphically illustrates performance of the model on samples from three validation cohorts. For patients with multiple timepoints, the first sample was used.
  • FIGURE 10F graphically illustrates subtype prediction in patients separated by clinical metastatic ER status and clinical primary tumor ER status. P-values were calculated using a Fisher's exact test (two-sided).
  • FIGURE 10G illustrates ROC curve for predicting ER loss among patients with primary ER positive tumor. 95% Cl was obtained by bootstrapping.
  • FIGURE 10H illustrates the timeline for two patients (MBC 1413 and MBC 1099) with multiple biopsies of different subtypes and multiple cfDNA samples.
  • ER+ prediction probability is shown for all cfDNA samples that passed the >0.05 tumor fraction and O.lx coverage thresholds. Decision boundary for ER+ (>0.5) and ER- ( ⁇ 0.5) is indicated with dotted line. Timelines in months from metastatic diagnosis to death are shown for each patient. For patient MBC_1413, a metastatic biopsy (pleural fluid) was taken on the day of metastatic diagnosis and indicated ER- disease. However, approximately 7 months later, another metastatic biopsy (liver) showed weak ER+ staining (5%). A final biopsy (pleural fluid) taken at approximately 12 months and showed ER- staining once again.
  • two ER- biopsies were taken at 0 months (bone) and 7 months (liver).
  • cfDNA was drawn after this point, however between the two cfDNA draws, another biopsy (liver) indicated the presence of low level ER+ disease.
  • FIGURES 11A and 11B illustrate the workflow for characterizing advanced prostate cancer through matched tumor and liquid biopsies from PDX models.
  • FIGURE 11 A top panel, illustrates that blood and tissue samples were taken from 26 patient- derived xenograft (PDX) mouse models with tumors originating from metastatic castration-resistant prostate cancer (mCRPC) with AR-positive adenocarcinoma (ARPC), neuroendocrine prostate carcinoma (NEPC) and AR-low non neuroendocrine prostate carcinoma (ARLPC) phenotypes.
  • mCRPC metastatic castration-resistant prostate cancer
  • ARPC AR-positive adenocarcinoma
  • NEPC neuroendocrine prostate carcinoma
  • ARLPC AR-low non neuroendocrine prostate carcinoma
  • cfDNA was extracted from pooled plasma collected from 7-10 mice and whole genome sequencing (WGS) was performed.
  • FIGURE 11 A middle panel, illustrates two distinct ctDNA features that were analyzed at transcription factor binding sites (TFBSs) and open chromatin sites throughout the genome using Griffin (see Example 1 and Doebley et al. (2021). Griffin: Framework for clinical cancer subtyping from nucleosome profiling of cell-free DNA. MedRxiv 2021.08.31.21262867 and Methods).
  • FIGURE 11 A bottom right panel, shows phenotype classification using a probabilistic model that accounted for ctDNA tumor content and informed by PDX features was applied to 159 samples in three patient cohorts.
  • FIGURE 11B illustrates PDX phenotypes and mouse plasma sequencing. Inclusion status based on final mean depth after mouse read subtraction ( ⁇ 3x coverage were excluded unless AR coordinate amplification signal was reliably detected; lower dotted line). Phenotype status, including 6 NEPC, 18 ARPC (2 excluded), and 2 ARLPC. Average depth of coverage before and after mouse subtraction (mean coverage 20.5x; upper dotted line). Percentage of the cfDNA sample that contains human ctDNA after mouse read subtraction.
  • FIGURES 12A to 12G illustrate the analysis of tumor histone modifications and ctDNA reveals nucleosome patterns consistent with transcriptional regulation in CRPC phenotype-specific genes.
  • FIGURE 12A illustrates H3K27ac peak signals between ARLPC, ARPC, and NEPC PDX tumor phenotypes at 10,000 AR binding sites (left) and at ASCL1 binding sites (right). Binding sites were selected from the GTRD (Yevshin et al. (2019). GTRD: a database on gene transcription regulation — 2019 update. Nucleic Acids Res 47, D100-D105) (Methods).
  • FIGURES 12B and 12C graphically illustrate composite coverage profiles at 1000 AR (12B) and ASCL1 (12C) binding sites in ctDNA analyzed using Griffin. Coverage profile means (lines) and 95% confidence interval with 1000 bootstraps (shading) are shown. The region ⁇ 150 bp is indicated with vertical dotted line and yellow shading.
  • FIGURE 12D is a heatmap of log2 fold change in key genes up and down regulated between ARPC and NEPC established through RNA-Seq (left) grouped by the type of histone modification which dictates translation levels: Group 1 shows genes where the predominate PTM mark is attributed to H3K27ac or H3K4mel active marks in the gene promoters or putative distal enhancers, lacking H3K27me3 heterochromatic mark in the gene body; Group 2 features gene body spanning H3K27me3 repression marks. Central columns show differential peak intensity for each of the assayed histone modifications, separated by whether they appear upstream or in the promoter or the body of each gene.
  • FIGURE 12E graphically illustrates a comparison of the log2 fold change (ARPC vs. NEPC) of mean mRNA expression vs mean coefficient of variation (CV) in the 47 phenotypic lineage marker genes' promoter regions.
  • FIGURE 12F (top) provides illustrations of expected ctDNA coverage profiles for Group 1 genes with and without H3K27ac or H3K4mel modification leading to active and inactive transcription, respectively.
  • FIGURE 12F (bottom) ⁇ 1000 bp surrounding the promoter region for AR and ASCL1 in ARPC and NEPC.
  • FIGURE 12G is an illustration of expected ctDNA coverage profiles for Group 2 genes with repressed transcription caused by H3K27me3 modifications in the gene body.
  • Neuronal gene UNC13A has increased nucleosome phasing in ctDNA of ARPC samples compared to NEPC.
  • This list of TFs was initially selected as having differential expression between ARPC and NEPC from LuCaP PDX RNA-Seq analysis.
  • Heatmap colors indicate increased accessibility (low values; lighter) and decreased accessibility (higher values; darker) in ctDNA.
  • TFs with increased accessibility in NEPC samples (log2-fold-change > 0.05, Mann- Whitney U test p ⁇ 0.05) are indicated with red text; increased accessibility in ARPC (log2-fold-change ⁇ -0.05, p ⁇ 0.05) are indicated with blue text.
  • FIGURES 14A to 14G illustrate comprehensive evaluation of ctDNA features throughout the genome for CRPC phenotype classification in PDX models.
  • FIGURE 14A illustrates a volcano plot of log2-fold change of ATAC-Seq peak intensity between 5 ARPC and 5 NEPC lines; the dotted line demarcates sites by q-value ⁇ 0.05.
  • FIGURES 14B and 14C graphically illustrate composite coverage profiles at open chromatin sites specific to ARPC (14B) and NEPC (14C) PDX tumors analyzed by Griffin. Sites from (14A) were filtered for overlap with known TFBSs in 338 factors from GTRD (Yevshin et al. (2019). Nucleic Acids Res 47, D100-D105).
  • FIGURE 14E graphically illustrates performance of classifying ARPC vs NEPC PDX from ctDNA using supervised machine learning (XGBoost) in various region types (all genes, TFBSs, and open regions, Methods). Area under the receiver operating characteristic curve (AUC) with 95% confidence interval (100 repeats of stratified cross validation) is shown for performance of all feature types.
  • FIGURE 14F is an example composite coverage profiles at open chromatin sites specific to ARPC (left) and NEPC (right) identified in 14B-14C. Simulated admixtures generated using ARPC mixed with healthy donor (HD) (left) and NEPC mixed with HD (right) are shown for varying tumor fractions.
  • HD healthy donor
  • HD NEPC mixed with HD
  • FIGURE 14G graphically illustrates performance for classification on admixtures samples using the probabilistic mixture model.
  • Five ctDNA admixtures were generated for each phenotype from PDX lines, each at various sequencing coverages and tumor fractions. In total, 125 admixtures were evaluated. The mean AUC across the 5 admixtures is shown for each configuration.
  • FIGURES 15A to 15C illustrate accurate classification of NEPC phenotypes from plasma in three patient cohorts using a probabilistic model informed by PDX ctDNA features.
  • FIGURE 15A graphically illustrates receiver operating characteristic (ROC) curve for 101 mCRPC patients (DFCI cohort I) with ultra- low-pass WGS (ULP-WGS) data. The optimal performance of 90.4% sensitivity (for predicting NEPC) and 97.5 % specificity (for predicting ARPC) corresponding to a prediction score cutoff of 0.3314 is indicated with horizontal and vertical dotted lines, respectively.
  • FIGURE 15B illustrates prediction scores for 11 plasma samples from seven patients (DFCI cohort II) with both WGS and ULP-WGS data.
  • the 0.3314 score cutoff threshold (dotted line) was used for classifying NEPC and ARPC. Tumor fractions were estimated by ichorCNA from WGS data. Patients were treated for adenocarcinoma (ARPC) or had high PSA values.
  • FIGURE 15C illustrates prediction scores for 47 plasma samples with clinical phenotypes comprising 26 ARPC, 5 NEPC, and 16 mixed or ambiguous phenotypes (triangles), including double-negative prostate cancer (DNPC). Scores are shown for WGS and ULP- WGS (0.1X) for the same ctDNA sample.
  • the cutoff threshold of 0.3314 (dotted line) was used for classifying NEPC and ARPC. Tumor fractions were estimated by ichorCNA on the WGS data.
  • FIGURE 16 is a schematic of an integrated, non-invasive targeted sequencing assay based on cfDNA for detection of genetic mutations and prediction of key tumor epigenetic features in SCLC.
  • FIGURES 17A and 17B illustrate the detection of transcription factor (TF) expression in SCLC models using targeted sequencing of cfDNA.
  • FIGURE 17A is a schematic of experimental workflow for proof-of-concept negative control ("healthy donor") and positive control ("flank tumors" from SCLC cellular models) samples.
  • FIGURE 17B graphically illustrates aggregated coverage across TFBSs in targeted sequencing data for healthy donors (top row) and flank tumors (bottom row).
  • the TFBS is expected to be located at position 0 on the x axis. Data are color-coded by expected TF expression. Healthy donor-derived cfDNA is expected to reflect REST expression but not ASCL1, NEUROD1, or POU2F3. In SCLC models, systematic differences in coverage distribution as a function of TF expression are apparent.
  • FIGURES 18A to 18C illustrate transcription factor activity inference using TFBS coverage distributions from SCLC patient samples with available matched tumor gene expression data.
  • FIGURE 18A graphically illustrates aggregated coverage across TFBSs in targeted sequencing data for healthy donors (top row) and patients with SCLC (bottom row) for whom matched tumor tissue with gene expression data was available. Samples are color-coded by expected TF expression. Systematic differences in coverage distribution as a function of expected TF expression are again apparent.
  • FIGURE 18B illustrates gene expression of key genes in selected patient samples displayed as a heatmap. Cells are color coded by Z-score and the inset text is the log2(TPM+l).
  • FIGURE 18C illustrates peak to trough amplitude calculated from coverage distributions at TFBS in each patient sample displayed as a heatmap. The amplitude is displayed by color and also as inset text. Trough depth magnitude corresponds to gene expression of the key TFs in these bona fide SCLC patient samples.
  • FIGURE 19 is a series of graphs illustrating quantification of transcription factor binding site peak to trough amplitude sample types. Distribution of TFBS peak to trough amplitude calculated from aggregated coverage distributions according to expected ground truth of TF expression.
  • ASCL1 site peak to trough amplitude is associated with both SCLC status and ASCL1 positivity, while NEUROD1 and POU2F3 peak to trough amplitude is associated only with TF positivity.
  • FIGURES 20A and 20B graphically illustrate gene expression inference using TSS coverage distributions in flank tumor positive control samples.
  • FIGURE 20A illustrates TSS coverage distribution from targeted sequencing of cfDNA, grouped by gene expression quintile in SCLC flank tumor models (quintiles 1-5) and blood ("B", dark blue). Shown are 1,912 TSS corresponding to 1,213 genes, which were selected based on low expression in whole blood and correlation between TSS coverage distribution and gene expression. TSS coverage distribution varies systematically according to expression of the corresponding gene.
  • FIGURE 20B illustrates receiver operating characteristic curves for prediction of gene expression as above or below a threshold value (shown for thresholds of 0.1, 0.5, 1.0, and 2.0), as inferred from the coverage distribution of the corresponding TSS.
  • An estimator of gene expression was calculated from the TSS coverage profile as the magnitude of the difference of the average coverage depth at positions +130 and +145 relative to the TSS minus the average depth at positions -45, -30, and -15 (shown as a dotted line in 20A).
  • the AUC of the ROC curve is shown in parentheses for each gene expression cutoff. TSS coverage distributions can be used to predict whether a gene is expressed above or below a certain value with good test characteristics in this preliminary analysis that is restricted to especially variable, and therefore challenging, genes.
  • FIGURES 21A to 21C are a series of graphs illustrating use of aggregated coverage profiles across large rationally selected subsets of the TSS panel for prediction of SCLC vs NSCLC status in lung cancer Pdx models and Patient samples.
  • an amplitude feature was calculated from each coverage distribution curve as the difference between the coverage at the -45 position and the +120 position relative to the TSS, facilitating comparison within and between samples.
  • FIGURES 22A and 22B are a series of graphs illustrating use of aggregated coverage profiles across large rationally selected subsets of the TSS panel for prediction of SCLC vs NSCLC status in lung cancer Pdx models (22A) and Patient samples (22B.
  • An SCLC PDX that transdifferentiated from an adenocarcinoma is identified with a thick red line.
  • FIGURE 23 is a flowchart that illustrates a non-limiting example embodiment of a method of cell (e.g., cancer, e.g., prostate cancer) subtype prediction according to an aspect of the present disclosure.
  • a method of cell e.g., cancer, e.g., prostate cancer
  • the present disclosure is based on the inventors' development of a facile and sensitive approach to assess the chromatin architecture from cell-free DNA (cfDNA), and to provide accurate signal to detect and differentiate cell and/or tissue phenotypes based on the determined chromatin architecture.
  • cfDNA cell-free DNA
  • cfDNA Cell-free DNA
  • cfDNA circulating tumor DNA
  • ctDNA circulating tumor DNA
  • Sequencing analysis of ctDNA to detect genomic alterations have also served to classify some subset of tumors based on genetic differences.
  • studying the tumor phenotype from ctDNA remains challenging and is still a nascent area of research.
  • cfDNA in the bloodstream, cfDNA is protected from degradation by nucleosomes and other DNA binding proteins, leading to a coverage pattern that reflects the genomic organization in the cells-of-origin.
  • the genomic organization includes patterns of chromatin accessibility and transcriptional regulation, which, in turn, drive the differential phenotypes of the cells of origin.
  • cfDNA can provide a non-invasive route to identify tumor subtypes through the analysis of tumor phenotypes beyond the traditional analysis of genotype, which involves DNA alterations.
  • the inventors have addressed the shortcomings of the art to produce a facile, robust, and sensitive approach to detecting and differentiating cell phenotypes.
  • the approach is based in part on a core method, called "Griffin", to examine nucleosome protection and chromatin accessibility by quantifying cfDNA fragments around accessible sites.
  • Griffin implements critical approaches to consider fragment length-based GC correction to remove GC biases that obscure signals, which is especially prevalent in ULP-WGS applications (e.g., as low as O.lx coverage of WGS).
  • Griffin is flexible to analyze any region throughout the genome that may be informative for differential chromatin accessibility between cell/tissue/cancer phenotype settings. For example, key transcriptional factors distinguishing between tumor subtypes can be predicted using Griffin via the analysis at binding sites of these transcription factors. Furthermore, Griffin can be applied to a variety of input data developed different assay approaches to study chromatin architecture and accessibility, including ATAC-seq, ChIP-seq, transcription factor profiling data, CUT & RUN, and the like. Moreover, in sharp contrast to existing technologies, Griffin can address countless hypotheses by enabling the analysis multiple 'omics', such as the following:
  • the Griffin approach is adaptable to existing ctDNA sequencing techniques and, thus, permits scalability, adaptability, and accessibility, even from ULP-WGS data, which is highly susceptible to bias and signal obfuscation.
  • Major applications of the approach include tumor (subtype) classification, identification of mixed histologies/phenotypes, detection of potential subtype switches (transdifferentiation) during therapy in "real time”, and prediction of biomarkers (e.g., ARv7 splice variant) that can signal therapy resistance.
  • ctDNA circulating tumor DNA
  • ARPC androgen receptor active
  • NEPC neuroendocrine
  • the disclosure provides a computer-implemented method of enhancing sequence read data from cell-free DNA samples for cell type prediction.
  • cell type prediction is used in a general sense to refer to predicting the identity of, or a characteristic of, a cell of origin (i.e., a cell contributing DNA in the cfDNA sample).
  • the characteristic can be a distinguishable phenotype compared to cells with a same or similar developmental lineage, including developmental lineages with a transformation event (i.e., for cancer cells).
  • the characteristic can be a distinguishable developmental lineage compared to a distinct developmental lineage.
  • the method encompasses predicting or differentiating among different cell lineages, different tissue types, different tissue subtypes, different cancer types, difference cancer subtypes (i.e., subtypes of the same cancer type), and the like.
  • the only requirement is that the cell type, as broadly defined, be distinguishable by a unique nucleosome occupancy and/or chromatin accessibility profile.
  • the method comprises: receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, wherein each fragment read has a fragment length and a GC content indicating a percentage of bases in the fragment read that are G or C; determining, by the computing system, GC bias values for each fragment read based on the fragment length and the GC content of the fragment read; generating, by the computing system, a genomic coverage distribution that is adjusted for GC bias using the sequence read data and the GC bias values; and predicting, by the computing system, the cell type based on the genomic coverage distribution.
  • FIG. 1 is a flowchart that illustrates a non-limiting example embodiment of a method of cell type prediction according to various aspects of the present disclosure.
  • the method 100 includes use of the GRIFFIN techniques described elsewhere herein to enable meaningful features to be extracted from short nucleic acid sequences of cancer DNA obtained from sequencing of cell-free DNA fragments in a sample.
  • the method 100 may be used for various different types of cell type prediction, including but not limited to tissue type prediction, cell type prediction, cancer type prediction, and cancer subtype prediction.
  • the method 100 proceeds to subroutine block 102, where genomic regions of interest are determined and filtered to identify cell-type-informative sites.
  • genomic regions of interest are determined and filtered to identify cell-type-informative sites.
  • Any suitable technique for determining and filtering cell-type-informative sites may be used, and different techniques will likely be used for different types of cancer, different molecular subtypes of a cancer type, different tissues, different cell types, and different types of assays.
  • One non-limiting example embodiment of a suitable procedure for determining and filtering cell-type-informative sites is illustrated in FIG. 2 and described in further detail below.
  • a GC frequency matrix is determined for combinations of fragment lengths and GC content.
  • fragments having certain amounts of G and C bases (“GC content”) will be overrepresented in the sequence read data. This bias is not constant, as fragments of different sizes will have different GC biases.
  • GC content fragments having certain amounts of G and C bases
  • This bias is not constant, as fragments of different sizes will have different GC biases.
  • FIG. 3 One non-limiting example technique for determining a GC frequency matrix is illustrated in FIG. 3 and described in further detail below.
  • subroutine block 102 and subroutine block 104 may be performed on reference genome data before obtaining a sample or sequence data to be analyzed.
  • sequence read data is received.
  • the sequence read data represents sequence reads generated for a sample obtained from a subject.
  • the sequence read data may be obtained from an archive or other previously obtained sample.
  • the GC frequency matrix is used to determine GC bias values for the sequence read data. Any suitable technique may be used in subroutine block 108, including but not limited to the non-limiting example illustrated in FIG. 4 and described in further detail below.
  • the GC bias values are used to generate a genomic coverage distribution of the sequence read data for the cell-type-informative sites.
  • any suitable technique may be used in subroutine block 110, including but not limited to the non-limiting example illustrated in FIG. 5 and described in further detail below.
  • features are extracted from the genomic coverage distribution. Any features suitable for use with a classifier model may be extracted, and may depend on the type of classifier model used, the assay that generated the sequence reads, and/or the cell type (e.g., type of cancer, cancer subtypes, tissue, or cell type) to be detected. As one non-limiting example, for estrogen receptor (ER) subtyping in breast cancer, three features may be extracted: mean coverage, central coverage, and amplitude.
  • ER estrogen receptor
  • Mean coverage may be extracted by determining the mean coverage in a window around an informative site.
  • the window around the informative site for determining mean coverage may be any suitable size, including but not limited to a range from 1800-2200 bp (from +/- 900 bp to +/- 1100 bp).
  • a suitable size for the window for determining mean coverage is 2000 bp (+/- 1000 bp).
  • Central coverage may be extracted by determining the mean coverage in a smaller window around the informative site.
  • the window around the informative site for determining central coverage may be any suitable size, including but not limited to a range from 40-80 bp (from +/- 20 bp to +/- 40 bp).
  • a suitable size for the window for determining mean coverage is 60 bp (+/- 30 bp).
  • Amplitude may be extracted by trimming the genomic coverage distribution to an area that includes a given number of peaks (such as an area of +/- 960 bp that contains 10 peaks), performing a fast Fourier transform, and taking the magnitude of a frequency based on the given number of peaks (e.g., the 10th frequency for the area that contains 10 peaks).
  • a given number of peaks such as an area of +/- 960 bp that contains 10 peaks
  • the features are provided as input to a classifier model to predict the cell subtype.
  • a classifier model may be used.
  • the classifier model may be a logistic regression model.
  • the method 100 then proceeds to an end block and terminates ⁇
  • further action may be taken once the cancer subtype is determined, including but not limited to an appropriate cancer diagnosis, identifying cancer subtype change or switch, recommending a new course of treatment, altering an existing course of treatment, or any other appropriate action.
  • cfDNA released by hematopoietic cells which leads to a lower ctDNA fraction (i.e., tumor fraction).
  • tumor fraction i.e., tumor fraction
  • an unsupervised probabilistic model was developed to estimate the proportion of cell types contributing to an individual plasma sample.
  • This model is the explicit modeling of the ctDNA tumor fraction in patients.
  • the input into this model includes signals generated from patient-derived xenografts (PDXs).
  • PDXs provide a resource that is ideal for studying the properties of ctDNA, developing new analytical tools, and validating both genetic and phenotypic features by comparison to matching tumors.
  • the model uses estimates of ctDNA fraction and these input PDX signals, the model applies a statistical mixture model approach to estimate the mixture weight parameter that represents the proportion of cell types.
  • the mixture weight parameter may be used as a prediction score to classify cell types, such as ARPC and NEPC, as discussed below in Example 2 and illustrated in FIG. 14-15.
  • Other cell types, such as phenotypes and subtypes can also be modeled and predicted using this framework.
  • FIG. 2 is a flowchart that illustrates a non-limiting example embodiment of a procedure for determining genomic regions of interest and filtering to identify cell-type- informative sites according to various aspects of the present disclosure.
  • the cell types of interest for which the cell-type- informative sites are determined and filtered are different cancer types, different cancer subtypes, different tissue types, or different cell types.
  • the procedure 200 advances to block 202, where a list of sites likely to be informative in the cell type of interest is selected.
  • Sites may be selected using available data, including but not limited to public research databases and repositories, published scientific and sequencing data. These data may be derived from assays, including but not limited to sequencing techniques for Assay for Transposase-Accessible Chromatin (ATACs-eq), micrococcal nuclease (MNase-seq), DNAse hypersensitivity sites, chromatin immunoprecipitation (ChIP-seq), cleavage under targets & release using nuclease (CUT&RUN).
  • ATCs-eq Assay for Transposase-Accessible Chromatin
  • MNase-seq micrococcal nuclease
  • ChIP-seq chromatin immunoprecipitation
  • CUT&RUN nuclease
  • Sites from these data that distinguish between cell types are selected using any suitable comparison, including but not limited to statistical hypothesis testing using two-group Mann-Whitney U (also called Wilcoxon rank-sum) tests or Student-t's tests and multi group Kruskal- Wallis test or analysis of variance (ANOVA). Additional filtering may be performed using fold change between groups.
  • a mean mappability score (metric representing the uniqueness of the genomic sequence) is determined in a fixed size window around each site likely to be informative, and at optional block 206, sites having a mean mappability score less than a predetermined threshold are discarded.
  • Mappability may be determined based on reference data, such as the mappability score track from the UCSC genome browser. In some embodiments, the actions of optional block 204 and optional block 206 may not be performed.
  • the remaining sites that are informative for determining cell type are identified.
  • Any suitable technique may be used.
  • the Cancer Genome Atlas (TCGA) ATAC seq data may be used to identify sites that have differential ATAC signal between ER positive samples and ER negative TCGA samples.
  • any suitable technique may be used.
  • TCGA Cancer Genome Atlas
  • FDR false discovery rate
  • ATAC seq read counts around each site may be provided as input to DESeq2 software, which may then identify differential sites and produce an adjusted fold change and FDR corrected p- value for each site.
  • the sites may be further refined by examining the fold change and retaining all sites with a log2 fold change greater than 0.5 in the subtype of interest relative to the other subtype.
  • ER positive and ER negative sites may be separated into those that are shared with hematopoietic cells and those which are not shared with hematopoietic cells using a separate dataset of hematopoietic ChIP seq peaks to generate a total of four subtype-specific informative site lists.
  • FIG. 3 is a flowchart that illustrates a non-limiting example embodiment of a procedure for determining a GC frequency matrix for a genome according to various aspects of the present disclosure.
  • the technique described in FIG. 3 is different from previous techniques, such as the approach described in Benjamini & Speed, 2012 and implemented in DeepTools (Ramirez, Diindar, Diehl, Griming, & Manke, 2014), at least because the previous techniques did not compensate for fragments of different lengths, and were never shown to work for cell-free DNA sequencing data.
  • a separate GC bias curve is determined for each different fragment length.
  • the procedure 300 advances to an end block and terminates.
  • a range of fragment lengths between a short length threshold and a long length threshold are analyzed in the procedure 300.
  • the short length threshold may be in a range of 10-20 bp
  • the long length threshold may be in a range of 450-550 bp.
  • the short length threshold may be 15 bp
  • the long length threshold may be 500 bp.
  • the for-loop may operate on each fragment length between the short length threshold and the long length threshold.
  • FIG. 4 is a flowchart that illustrates a non-limiting example embodiment of a procedure for using a GC frequency matrix to determine GC bias values for sequence read data according to various aspects of the present disclosure.
  • the number of observed reads of each fragment length and GC content are counted to determine GC counts for the sequence read data.
  • the GC counts are divided by the values in the GC frequency matrix to determine GC bias for each fragment length.
  • a mean GC bias is normalized for each fragment length to determine rough GC bias values.
  • the mean GC bias may be normalized to 1. This results in a rough GC bias value for every possible combination of fragment size and GC content.
  • the rough GC bias values are smoothed to determine the GC bias values.
  • all GC bias values for similar sized fragments (as a non- limiting example, for 165 bp fragments, fragments of sizes from 155 bp to 175 bp may be considered) may be determined.
  • the GC bias values for the similar sized fragments may be sorted by GC content, and kernel smoothing may be performed by taking the median of the nearest neighbors to determine the GC bias values.
  • the procedure 400 then advances to an end block and terminates.
  • FIG. 5 is a flowchart illustrating a non-limiting example embodiment of a procedure for using GC bias values to generate a genomic coverage distribution of sequence read data for cell-type-specific informative sites according to various aspects of the present disclosure.
  • the procedure 500 advances to block 502, where fragment midpoints in a window around each cell-type- specific informative site are determined.
  • a weight is assigned to each fragment based on the appropriate GC bias value for the fragment length and GC content (i.e., the GC bias value for the fragment length and GC content determined at subroutine block 108, e.g., by procedure 400). The weight is then based on that appropriate GC bias value.
  • the weights are used to determine GC-corrected midpoint profiles.
  • positions are excluded that overlap excluded regions.
  • the excluded regions may be determined using any suitable technique.
  • the excluded regions may be obtained from one or more excluded region lists.
  • Excluded region lists may include, but are not limited to, an encode unified GROG 8 exclusion list, centromeres, gaps in the human genome assembly, fix patches, alternative haplotypes, regions of zero mappability, and regions with unusually high coverage (e.g., 10 standard deviations above the mean).
  • GC-corrected midpoint profiles for all sites are averaged to determine a mean profile.
  • the mean profile is smoothed to generate a smoothed mean profile.
  • Any suitable technique for smoothing may be used.
  • the mean profile may be smoothed using a Savitzky-Golay filter with a window length of 165 bp and a 3rd order polynomial.
  • the smoothed mean profile is normalized by dividing by the mean of the surrounding coverage.
  • surrounding coverage in a range of 9,000-11,000 bp (+/- 4,500 bp to +/- 5,500 bp), such as 10,000 bp (+/- 5,000 bp) is considered for normalization. This allows samples with different depths of sequencing coverage to be compared.
  • the normalized mean profile may be used as the resulting genomic coverage distribution.
  • the procedure 500 then advances to and end block and terminates.
  • FIG. 6 is a block diagram that illustrates aspects of an exemplary computing device appropriate for use as a computing device of the present disclosure.
  • the techniques described above including but not limited to the techniques described in method 100, may be implemented in full or in part on one or more computing systems that include one or more computing devices such as computing device 600 that are communicatively coupled to each other.
  • the exemplary computing device 600 describes various elements that are common to many different types of computing devices, including but not limited to desktop computing devices, laptop computing devices, server computing devices, mobile computing devices, and computing devices that are part of a cloud computing system. While FIG. 6 is described with reference to a computing device that is implemented as a device on a network, the description below is applicable to servers, personal computers, mobile phones, smart phones, tablet computers, embedded computing devices, and other devices that may be used to implement portions of embodiments of the present disclosure. Some embodiments of a computing device may be implemented in or may include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other customized device. Moreover, those of ordinary skill in the art and others will recognize that the computing device 600 may be any one of any number of currently available or yet to be developed devices.
  • ASIC application-specific integrated circuit
  • FPGA field-programmable gate array
  • the computing device 600 includes at least one processor 602 and a system memory 610 connected by a communication bus 608.
  • the system memory 610 may be volatile or nonvolatile memory, such as read only memory (“ROM”), random access memory (“RAM”), EEPROM, flash memory, or similar memory technology.
  • ROM read only memory
  • RAM random access memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or similar memory technology.
  • system memory 610 typically stores data and/or program modules that are immediately accessible to and/or currently being operated on by the processor 602.
  • the processor 602 may serve as a computational center of the computing device 600 by supporting the execution of instructions.
  • the computing device 600 may include a network interface 606 comprising one or more components for communicating with other devices over a network. Embodiments of the present disclosure may access basic services that utilize the network interface 606 to perform communications using common network protocols.
  • the network interface 606 may also include a wireless network interface configured to communicate via one or more wireless communication protocols, such as Wi-Fi, 2G, 3G, LTE, WiMAX, Bluetooth, Bluetooth low energy, and/or the like.
  • the network interface 606 illustrated in FIG. 6 may represent one or more wireless interfaces or physical communication interfaces described and illustrated above with respect to particular components of the computing device 600.
  • the computing device 600 also includes a storage medium 604.
  • services may be accessed using a computing device that does not include means for persisting data to a local storage medium. Therefore, the storage medium 604 depicted in FIG. 6 is represented with a dashed line to indicate that the storage medium 604 is optional.
  • the storage medium 604 may be volatile or nonvolatile, removable or nonremovable, implemented using any technology capable of storing information such as, but not limited to, a hard drive, solid state drive, CD ROM, DVD, or other disk storage, magnetic cassettes, magnetic tape, magnetic disk storage, and/or the like.
  • FIG. 6 does not show some of the typical components of many computing devices.
  • the computing device 600 may include input devices, such as a keyboard, keypad, mouse, microphone, touch input device, touch screen, tablet, and/or the like. Such input devices may be coupled to the computing device 600 by wired or wireless connections including RF, infrared, serial, parallel, Bluetooth, Bluetooth low energy, USB, or other suitable connections protocols using wireless or physical connections.
  • the computing device 600 may also include output devices such as a display, speakers, printer, etc. Since these devices are well known in the art, they are not illustrated or described further herein.
  • the computer-implemented method implementing the Griffin workflow is highly adaptable to different types of input data reflective of the chromatin architecture (e.g., nucleosome occupancy and chromatin accessibility).
  • the method can be applied to various contexts of analyses depending on the source and character of the originating cells or tissues being analyzed.
  • the disclosure provides a method of determining a chromatin accessibility profile for a cell of interest from a sample comprising cell-free DNA derived from the cell of interest.
  • This method applies the Griffin data optimization workflow, described in more detail above, to determine a chromatin accessibility profile for a cell of interest.
  • the method is flexible and permits input data obtained from a variety of sequencing and capture protocols.
  • the method comprises: obtaining sequence read data from the cell-free DNA; receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, wherein each fragment read has a fragment length and a GC content indicating a percentage of bases in the fragment read that are G or C; determining, by the computing system, GC bias values for each fragment read based on the fragment length and the GC content of the fragment read; generating, by the computing system, a genomic coverage distribution that is adjusted for GC bias using the sequence read data and the GC bias values; and determining the chromatin accessibility profile from the genomic coverage distribution.
  • the method can further comprise determining a phenotype of the cell of interest based on the chromatin occupancy profile.
  • determinations of cell phenotype can include determining the tissue type of origin of the cell, determining if the cell is transformed (e.g., is cancerous or malignant), determining the cancer type or cancer subtype, determining a malignancy aggressiveness phenotype, and/ or determining a drug responsivity phenotype.
  • the term malignancy aggressiveness phenotype refers to the relative aggressiveness of a transformed (e.g., cancer) cell in terms of rate of reproduction, migration, drug responsivity, and the like.
  • the phenotype can be qualitative or can be assessed by various metrics to allow for quantitative comparison.
  • drug responsivity phenotype refers to the relative responsivity (i.e., susceptibility or resistance) of a cancer cell to a cancer therapy.
  • the metric can be quantitative or qualitative. These determinations can be made using various classifiers, described in more detail above, based on sequence data optimized by the Griffin workflow. Elements of the Griffin workflow and computer implemented method are described in more detail above and incorporated into the present aspect without limitation. Exemplary, nonlimiting implementations of the Griffin workflow and associated classifiers to subtype cancer cells with distinct phenotypes are provided in the Examples.
  • the Griffin workflow enhances data from a variety of sequencing and capture platforms to provide profiles of nucleosome accessibility, and these profiles can provide highly accurate insight as to the nature of cells that contribute to the ctDNA present in biological samples.
  • These insights enable detecting and characterizing cells that contribute to the ctDNA, including enabling the ability to detect cells of a certain type and/or differentiate cells between various subtypes.
  • the disclosure also provides a method for determining or identifying a cell type of a cell of interest from a sample comprising cell-free DNA derived from the cell of interest. The method of this aspect comprises: obtaining sequence read data generated from the sample comprising cell-free
  • the determining step can be performed by any of a number of appropriate classifiers based on the data enhanced by the Griffin workflow.
  • the determining step can comprise determining a cell phenotype, such as determining tissue type, a cancer type, a cancer subtype, a malignancy aggressiveness phenotype, a drug responsivity phenotype, or expression (or expression level) of a gene of interest.
  • the disclosure provides a method for detecting the presence of a cancer cell in a subject.
  • the method comprises: obtaining sequence read data generated from the sample comprising cell-free DNA obtained from the subject; performing the computer-implemented method described in more detail above (and which is incorporated into this aspect in all of its embodiments); and determining the presence of a cancer cell in the subject based on the prediction provided by the computing system.
  • the method is performed a plurality of times. Accordingly, the method can be a method of monitoring for the presence and/or identity of cancer in the subject.
  • the cancer cell(s) detected in the subject at each performance of the method can be further characterized. For example, the cell(s) can be monitored over time using this method to determine a cancer subtype or phenotype of the detected cancer cell(s) based on the prediction provided by the computing system.
  • the method further comprises detecting a change in phenotype of the detected cancer cell(s) over time. For example, as described in more detail below certain cancer types can progress from one subtype to another during the course of disease. Cancer cells can evolve and essentially switch between characterized subtypes.
  • non-small cell lung cancer can be monitored for transdifferentiation to small cell lung cancer (SCLC).
  • SCLC subtypes can be monitored for transdifferentiation to distinct subtypes.
  • the method can be performed starting before or during the course of treatment for cancer. Accordingly, the cancer can be monitored for responsivity to the treatment, or for changes in phenotype during the course of treatment. These characteristics can inform any appropriate adjustments to the treatment regimen.
  • the method comprises implementing a treatment or treatment change based on the monitored status of the cancer cells as determined by the method.
  • the disclosure provides a method of determining a cancer subtype of a target cancer cell from a sample comprising cell-free DNA derived from the target cancer cell.
  • the method comprises: obtaining sequence read data generated from the sample comprising cell-free
  • DNA DNA; performing the computer-implemented method described in more detail above (and which is incorporated into this aspect in all of its embodiments); and determining the cell type of the target cancer cell based on the predicted cancer subtype provided by the computing system.
  • the sample can be a biological sample from the subject, e.g., a subject with cancer or suspected to have cancer. Exemplary biological samples are described in more detail below.
  • the method comprises obtaining the biological sample from the subject and/or generating the sequence read data from the sample, according to standard techniques appropriate for the desired sequencing platform and/or targeted capture technology.
  • the Griffin platform has been employed to successfully distinguish between important subtypes of cancers for various different, unrelated cancers, indicating the broad applicability to cancer types in general.
  • the cancer is characterized as metastatic breast cancer.
  • the determining step comprises determining the status of the breast cancer as ER+ versus ER-, which refers to the expression of estrogen receptor (ER) and whether the cancer cells respond to exposure of the estrogen hormone. This status can be a critical to inform the appropriate course of therapy because ER+ breast cancers can be addressed by administration of endocrine therapies.
  • the determining step comprises determining the status of the breast cancer as PR + versus PR -, which refers to the expression of progesterone receptor (PR) and whether the cancer cells respond to exposure of the progesterone hormone. Similarly, this status can be a critical to inform the appropriate course of therapy because PR+ breast cancers can also be addressed by administration of appropriate hormonal therapies, such as tamoxifen and aromatase inhibitors.
  • the determining step comprises determining the status of the breast cancer as HER2+ versus HER2-, which refers to the expression of human epidermal growth factor receptor 2 (HER2).
  • HER2+ breast cancer cells tend to result in poorer prognosis as they grow faster and have a higher likelihood of spreading, e.g., to the lymph nodes.
  • This status can be a critical to inform the appropriate course of therapy because PR+ breast cancers can also be addressed by administration of appropriate Her2-targeted therapy, such as trastuzumab or pertuzumab.
  • Her2-targeted therapy such as trastuzumab or pertuzumab.
  • the disclosure also encompasses embodiments of distinguishing determining the expression status of multiple informative markers.
  • the method can comprise determining: whether the cancer is ER+ versus ER-; whether the cancer is PR+ versus PR-; and/or whether the cancer is HER2+ versus HER2-, in any combination.
  • the method comprises determining whether the cancer is ER+ versus ER-, whether the cancer is PR+ versus PR-, and whether the cancer is HER2+ versus HER2-.
  • Patients with triple-negative breast cancer i.e. ER-, PR-, HER-
  • the cancer is characterized as metastatic prostate cancer.
  • determining the subtype of the prostate cancer addresses determining whether the cancer expresses various markers characteristic of distinguishable subtypes.
  • the step of the cancer subtype comprises determining whether the prostate cancer is AR+ (ARPC) versus AR-, which refers to the status for expression of androgen receptors.
  • the step of the cancer subtype comprises determining whether the prostate cancer is AR+ (ARPC) versus AR (low).
  • Prostate cancers that are AR+ are often treated with androgen receptor signaling inhibitors (ARSI) that repress the androgen receptor activity in the cells.
  • ARSI androgen receptor signaling inhibitors
  • the step of the cancer subtype comprises determining whether the prostate cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not.
  • NEPC cells lack AR activity and possess distinct transcriptional programming regulation profiles from CRPC cells, including different epigenetic modifications, that result in a distinct phenotype that requires alternative therapeutic intervention.
  • the step of the cancer subtype comprises determining whether the prostate cancer is amphicrine, which refers to possessing both exocrine and neuroendocrine characteristics in the same cell. As is demonstrated in Example 2 below, the Griffin workflow can be leveraged to accurately distinguish these cell types from input sequence reads generated from ctDNA.
  • determining the cancer subtype comprises determining 2, 3, 4, or all of the following: whether the cancer is AR+ (ARPC) or AR-, whether the cancer is AR-low or ARPC, whether the cancer has a neuroendocrine prostate cancer (NEPC) phenotype signature or not, whether the cancer is AR-low or NEPC, whether the cancer is amphicrine or ARPC or NEPC, in any combination.
  • ARPC AR+
  • NEPC neuroendocrine prostate cancer
  • the cancer is characterized as metastatic lung cancer.
  • determining the subtype of the lung cancer comprises determining whether the cancer is small cell lung cancer (SCLC) or non-small cell lung cancer (NSCLC). If the lung cancer is NSCLC, in a further embodiment, the method further comprises determining whether the NSCLC is adenocarcinoma or squamous cell carcinoma.
  • the input sequence read data can be generated from a variety of platforms and with a variety of techniques, including whole genome analysis.
  • the inventors established that whole genome analysis, however, is not required. Instead, the inventors designed and implemented a panel of genomic targets deemed to be relevant to the scientific inquiry (e.g., subtyping lung cancer cells). Accordingly, in some embodiments, the lung cancer is further subtypes using sequence read data generated from a panel of genomic targets.
  • the panel of genomic targets comprises transcription factor binding sites (TFBSs) of one or more transcription factors associated with a designated subtype that is the subject of analysis, e.g., SCLC.
  • TFBSs transcription factor binding sites
  • the one or more associated transcription factors comprise one or more of ASLC, NEUROD1, POU2F3, REST, and the like.
  • the method comprises determining the nucleosome occupancy of the TFBSs using any appropriate technique (e.g., CUT & RUN, and the like).
  • the TFBSs can be identified by ChIP-seq data, or similar techniques known in the art.
  • Candidate TFBSs can be retained in the panel if they are proximal to a transcription start site (TSS) of a gene associated with lung cancer, or the subtype of lung cancer that is of interest in the subtyping.
  • TSS transcription start site
  • proximal can mean within a proximity that the TFBSs is functionally influential on the start of transcription at the TSS.
  • the functional influence or relationship can be established if the TSS is the closest TSS to the TFBS.
  • the panel of genomic targets comprise transcription start sites (TSSs) for one or more markers associated with lung cancer (or the specific subtype of lung cancer that is of interest).
  • the method comprises determining the nucleosome occupancy of the TSSs through known techniques.
  • the biological sample described herein can be any sample obtained from a subject that is likely to have cell free DNA.
  • Illustrative, non-limiting examples encompassed by the disclosure include the sample is blood, plasma, or serum, which are particularly useful to assess cfDNA and ctDNA from a subject.
  • the methods can further comprise obtaining the biological sample from the subject. Additionally, for a subject that is determined to have cancer or a cancer subtype at any time, the method can further comprise prescribing appropriate treatment or actively treating the subject appropriately based on the determination of the cancer type or subtype according to accepted practice in the medical field for the determined cancer.
  • the described method can be performed multiple times to provide multiple assessments. This can be useful to provide methods for monitoring the presence or evolution of cell types or subtypes from a source.
  • the methods can be performed from sequence read data obtained from biological samples obtained from a subject before and/or for time points at or after initial diagnosis of cancer.
  • the Griffin workflow is flexible and is not limited to a certain set of genomic regions of interest, nor to a specific type of sequence data for generating coverage profiles.
  • Exemplary, non-limiting approaches for generating sequence read data include whole genome sequencing (for example depths between 0.05X coverage and 100X coverage) and chromatin accessibility assays.
  • the sequence read data is generated by, or regions of interests are identified using, techniques such as ATAC-seq, ChIP-seq, DNAse sensitivity assays, and the like, which are known in the art.
  • the sequence data is generated by CUT & RUN. See, e.g., WO 2019/060907, incorporated herein by reference in its entirety.
  • the CUT & RUN assay can incorporate use of one or more affinity reagents (e.g., antibodies or antibody fragments) that target post-translational modifications of H3K27ac, H3K4mel and/or H3K27ac.
  • the method comprises affirmatively generating the sequence read data, using for example, any of the illustrative approaches described herein or other appropriate approaches known in the art.
  • the sequence read data can be produced from a panel of genomic targets. It will be understood that this targeted panel approach is applicable beyond Lung cancer subtyping to other types of cancers.
  • the sequence read data can comprise sequence read data generated from a panel of genomic targets.
  • the panel of genomic targets can be designed and assembled according to the approach described in Example 3 in the context of lung cancer (see also FIG. 16).
  • the panel can comprise TFBSs of one or more transcription factors associated with a cancer type of interest.
  • the transcription factors associated with a cancer type of interest can be readily identified from the art.
  • the TFBSs relating to the designated transcription factor(s) can be determined by standard assays that establish binding sites in the genome, such as ChIP-seq data, and the like. Furthermore, candidate TFBSs can be further retained based on an assessment of association or proximity with transcription start sites (TSSs) of genes with transcription levels (on, off, high, low, etc.) associated with a relevant cancer or cancer subtype.
  • the panel of genomic targets comprise transcription start sites (TSSs) for one or more markers associated with the cancer type of interest.
  • TSSs transcription start sites
  • the panel can be constructed using the TFBSs and/or TSSs in any combination. Once established, directed sequencing reads are generated from the targets. In some embodiments, the nucleosome occupancy of the TFBSs and/or TSSs is determined.
  • the sequence read data is the input into the computer-implemented Griffin method described above to facilitate the appropriate subtyping or other analysis.
  • the disclosure provides a computer-implemented method of enhancing sequence read data from cell-free DNA samples for cell type prediction.
  • the method comprises: receiving, by a computing system, sequence read data, wherein the sequence read data includes a plurality of fragment reads, and wherein each fragment read has a fragment length; determining, by the computing system, a fragment size variability for at least one gene associated with a cell type; and predicting, by the computing system, the cell type based on the fragment size variability for the at least one gene.
  • FIG. 23 is a flowchart that illustrates a non-limiting example embodiment of enhancing sequence read data from cell-free DNA samples for improved cell type prediction according to various aspects of the present disclosure.
  • a computing system receives sequence read data, wherein the sequence read data includes a plurality of fragment reads, and wherein each fragment read has a fragment length.
  • the computing system determines a fragment size variability for at least one gene associated with a cell type.
  • locations of genes whose mRNA expression and transcriptional activity are known to be associated with given cell types such as the 47 genes illustrated in Fig. 12D that are known to be associated with prostate cancer, may be used.
  • a coefficient of variation of the fragment size of fragments at locations associated with one or more genes may be determined and used as fragment size variability values.
  • the coefficient of variation (CV) has been found to be particularly useful in distinguishing cell types based on fragment size variability when analyzing fragments at genes that are associated with the cell types. In particular, CV has been found to be less affected by the depth of sequencing coverage than other techniques (such as measurements of entropy).
  • the computing system predicts the cell type based on the fragment size variability for at least one gene.
  • features may be generated based on the fragment size variability, and the features may be provided as input to a classifier model to determine whether the features represent a given cell type.
  • a ratio of the fragment size variability in a first cell type versus a second cell type may be used as a feature.
  • the classifier model may be used to determine whether the calculated features for a given sample are more like features of a first cell type or a second cell type. Any suitable classifier model, including but not limited to logistic regression models, artificial neural networks, decision trees, support vector machines, and Bayesian networks, may be used.
  • Example 2 One non-limiting example embodiment of the use of the method 700 is described in Example 2, where analysis of fragment size variability is used to distinguish prostate cancer cell types of androgen receptor pathway active prostate cancer (ARPC) varieties and neuroendocrine prostate cancer (NEPC) varieties.
  • ARPC androgen receptor pathway active prostate cancer
  • NEPC neuroendocrine prostate cancer
  • subject means a mammal being assessed for treatment and/or being treated.
  • the mammal is a human.
  • the terms "subject,” “individual,” and “patient” encompass, without limitation, individuals having cancer. While subjects may be human, the term also encompasses other mammals, particularly those mammals useful as laboratory models for human disease, e.g., mouse, rat, dog, non-human primate, and the like.
  • treating and grammatical variants thereof may refer to any indicia of success in the treatment or amelioration or prevention of a disease or condition (e.g., a cancer, infectious disease, or autoimmune disease), including any objective or subjective parameter such as abatement; remission; diminishing of symptoms or making the disease condition more tolerable to the patient; slowing in the rate of degeneration or decline; or making the final point of degeneration less debilitating.
  • a disease or condition e.g., a cancer, infectious disease, or autoimmune disease
  • any objective or subjective parameter such as abatement; remission; diminishing of symptoms or making the disease condition more tolerable to the patient; slowing in the rate of degeneration or decline; or making the final point of degeneration less debilitating.
  • the treatment or amelioration of symptoms can be based on objective or subjective parameters; including the results of an e amination by a physician.
  • the term “treating” includes the administration of the compounds or agents of the present disclosure to prevent or delay, to alleviate, to improve clinical outcomes, to decrease occurrence of symptoms, to improve quality of life, to lengthen disease-free status, to stabilize, to prolong survival, to arrest or inhibit development of the symptoms or conditions associated with a disease or condition (e.g., a cancer), or any combination thereof.
  • a disease or condition e.g., a cancer
  • therapeutic effect refers to the reduction, elimination, or prevention of the disease or condition, symptoms of the disease or condition, or side effects of the disease or condition in the subject.
  • nucleic acid or “polynucleic acid” refer to a polymer of nucleotide monomer units or “residues", typically DNA or RNA.
  • the nucleotide monomer subunits, or residues, of the nucleic acids each contain a nitrogenous base (i.e., nucleobase) a five-carbon sugar, and a phosphate group.
  • nucleobase a nitrogenous base
  • phosphate group i.e., nucleobase
  • the identity of each residue is typically indicated herein with reference to the identity of the nucleobase (or nitrogenous base) structure of each residue.
  • Canonical nucleobases include adenine (A), guanine (G), thymine (T), uracil (U) (in RNA instead of thymine (T) residues) and cytosine (C).
  • the nucleic acids of the present disclosure can include any modified nucleobase, nucleobase analogs, and/or non-canonical nucleobase, as are well-known in the art.
  • Example 1 is set forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how to make and use the present invention, and are not intended to limit the scope of what the inventors regard as their invention nor are they intended to represent that the experiments below are all or the only experiments performed.
  • Example 1 is set forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how to make and use the present invention, and are not intended to limit the scope of what the inventors regard as their invention nor are they intended to represent that the experiments below are all or the only experiments performed.
  • cfDNA Cell-free DNA
  • ctDNA circulating tumor DNA
  • genomic alterations from ctDNA have helped to distinguish molecular subsets of tumors.
  • these genomic alterations including somatic mutations, may not always fully explain treatment failure or identify therapeutic targets, exemplifying a major limitation of cancer precision medicine.
  • Tumor subtypes are often characterized by distinct transcriptional regulation, which can change during treatment resistance, leading to different clinical tumor phenotypes.
  • prostate and lung cancers may undergo trans-differentiation from adenocarcinoma to small-cell neuroendocrine phenotypes.
  • MBC metastatic breast cancer
  • treatment is guided based on clinical subtypes determined by the expression of the estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2), often in the primary tumor; endocrine therapies are prescribed to patients with ER-positive (ER+) or PR-positive (PR+) carcinomas while patients with HER2 positive tumors are prescribed anti-HER2 drugs.
  • ER estrogen receptor
  • PR progesterone receptor
  • HER2 human epidermal growth factor receptor 2
  • TNBC triple negative breast cancer
  • ER- ER-negative subtypes
  • mixtures of clinical subtypes may also co-exist across or within metastatic lesions in the same patient, presenting major clinical challenges. Therefore, accurate subtype classification and identification of transcriptional patterns underlying emergent clinical phenotype during therapy has critical implications for studying mechanisms of resistance and informing treatmentdecisions.
  • nucleosomes are positioned in an organized manner that allows access for DNA binding proteins (FIG. 7A). This nucleosome organization results in a loss of sequencing coverage, reflecting DNA degradation at the unprotected binding site with peaks of coverage at the surrounding protected bcations.
  • nucleosome profiling from cfDNA has been demonstrated for cancer detection and tumor tissue-of-origin prediction, including the analysis of shorter cfDNA fragments which tend to be enriched from tumor cells. While tumor subtyping from cfDNA has been explored in prostate cancer by analyzing TFBS locations, it is believed that there have not been demonstrations of subtype classification from cfDNA in other cancers. Specifically, predicting histological subtypes in breast cancer has not been shown from cfDNA. Furthermore, current cfDNA nucleosome profiling approaches have not been optimized for ULP-WGS data. Studying the clinical phenotype of tumors from ctDNA remains challenging due to lack of robust computational methods but has obvious potential clinical benefits for guiding treatment decisions in patients with metastatic cancer.
  • Griffin a computational framework called Griffin was developed to classify tumor subtypes from nucleosome profiling of cfDNA.
  • Griffin overcomes current analytical challenges to profiles the nucleosome accessibility and transcriptional regulation from the analysis of standard cfDNA genome sequencing, including ULP- WGS (O.lx) coverage.
  • Griffin employs a novel GC correction procedure that is specific for DNA fragment sizes and therefore unique for cfDNA sequencing data.
  • Griffin was applied to perform cancer detection and tumor tissue-of-origin analysis with high performance. Then, the first application of breast cancer ER subtyping from cfDNA was demonstrated, showing strong classification accuracy and insights into tumor heterogeneity and prognosis, all achieved from analysis of ULP-WGS data.
  • Griffin is a generalizable framework that can detect molecular changes in transcriptional regulation and chromatin accessibility from cfDNA and possibly direct personalized treatment to improve patient outcomes.
  • Griffin was developed as an analysis framework with a GC correction procedure to accurately profile nucleosome occupancy from cfDNA. Griffin processes fragment coverage to distinguish accessible and inaccessible features of nucleosome protection (FIG. 7A). Griffin is designed to be applied to whole genome sequencing (WGS) data of cfDNA from patients with cancer to quantify nucleosome protection around sites of interest and is optimized to work for ULP-WGS data (FIG. 7B). Sites of interest can be selected from various chromatin-based assays, such as from assay for transposase- accessible chromatin using sequencing (ATAC-seq) and are tailored to address specific problems including cancer detection and tumor sub typing.
  • GGS whole genome sequencing
  • ATAC-seq assay for transposase- accessible chromatin using sequencing
  • the analysis workflow begins with computing the genome-wide fragment-based GC bias for each sample. Then, for the region at each site of interest, the fragment midpoint coverage is computed and reweighted to remove GC biases (Methods). Midpoint coverage rather than full fragment coverage is used because it produces higher amplitude nucleosome protection signals (not shown). Next, a composite coverage profile is computed as the mean of the GC- corrected coverage across the set of sites specific for a tissue type, tumor type, transcription factor (TF), or any phenotypic comparison of interest.
  • Methods reweighted to remove GC biases
  • a novel aspect of Griffin is the implementation of a fragment-based GC bias correction.
  • GC-content is non-uniform, which leads to GC-related coverage biases (FIG 8A)
  • FIG. 8A Wang, J. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798-1812 (2012)).
  • GC bias varies between samples and between different fragment lengths within a sample (Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research 40, e72-e72 (2012)) (FIG.
  • nucleosome accessibility prediction (FIG. 8C).
  • Griffin computes the global estimated mean fragment coverage ("expected") using a fragment length position model (Benjamini, Y. & Speed, T. P. Nucleic Acids Research 40, e72-e72 (2012)) (Methods, FIG. 8B). Then, when calculating coverage profiles around sites of interest, each fragment is assigned a weight based on the global expected coverage for its length and GC bias. This correction eliminates unexpected increases (or decreases) in coverage at binding sites, removing technical biases to enhance the tissue-associated accessibility signals when analyzing WGS (9-25x, FIG. 8C) cancer patient cfDNA and ULP-WGS (0.1-0.3x, FIG. 8D).
  • the estimated TFBS accessibility was compared with the amount of tumor- derived DNA (i.e. tumor fraction) predicted by ichorCNA for ULP-WGS data from 191 MBC cfDNA samples with > 0.1 tumor fraction (Adalsteinsson, V. A. et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8, (2017)).
  • the tumor fraction was expected to be negatively corrected with the central coverage around tumor- specific sites, and positively correlated for blood-specific sites.
  • the RMSE decreased (0.062 to 0.046), indicating less inter-sample variation in the data after GC correction (FIG. 8E).
  • the central coverage for the 377 TFs was examined in a cohort of 215 healthy donors (Cristiano, S. et al. Genome- wide cell-free DNA fragmentation in patients with cancer. Nature 570, 385-389 (2019)) before and after GC correction.
  • the performance was likely reflective of the higher tumor fractions observed in late-stage cancer relative to early-stage cancer.
  • DHS DNase I Hypersensitivity Sites
  • FIG. 10B Griffin was applied to profile nucleosome accessibility at these four sets of ER subtype- specific accessible chromatin sites, extracting a total of 12 features (FIG. 10B).
  • Circulating tumour DNA in metastatic breast cancer to guide clinical trial enrolment and precision oncology A cohort study. PLoS medicine 17.10 (2020): el003363) and using the model trained on the original MBC dataset, we were able to predict ER status with 0.92 accuracy (0.96 AUC) in all samples with >0.05 tumor fraction. Looking only at samples with >0.1 tumor fraction, the accuracy was 0.96 and the AUC was 0.98. This analysis further supports that Griffin can perform accurate ER status prediction in independent datasets.
  • Griffin a new framework and analysis tool for studying transcriptional regulation and tumor phenotypes.
  • Griffin uses a novel cfDNA fragment length- specific normalization of GC-content biases that obscure chromatin accessibility information. It is demonstrated that Griffin can be used to detect cancer from low pass WGS with high accuracy. Additionally, an approach was developed to perform ER subtyping in breast cancer from ULP-WGS, which is the first time that ER phenotype prediction has been shown from ctDNA.
  • Griffin is versatile and can be used for various applications in cancer. This disclosure highlights cancer detection, tissue-of-origin, and tumor subtype use-cases. However, Griffin can also be used for any biological comparison where transcriptional regulation and chromatin accessibility differences can be delineated.
  • the applications described here use TFBSs from chromatin immunoprecipitation sequencing (ChIP-seq) and accessible chromatin sites from ATAC-seq.
  • ChIP-seq chromatin immunoprecipitation sequencing
  • ATAC-seq accessible chromatin sites from ATAC-seq.
  • Griffin differs from existing methods due to its ability to analyze custom sites of interest that are specific to any biological context. These sites may be obtained from external sources and different assays, such as ChIP-seq, DNase I hypersensitivity, ATAC-seq or cleavage under targets and release using nuclease (CUT&RUN).
  • Griffin is optimized for the analysis of ULP-WGS (O.lx) of cfDNA, while other nucleosome profiling methods have focused on deeper coverage sequencing. Griffin takes advantage of analyzing the breadth of sites as opposed to individual loci, which was inspired by a similar strategy used by Ulz, P. et al. Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nature Communications 10, 4666 (2019). It is demonstrated that Griffin has better performance for both detecting cancer and predicting ER status from ULP-WGS data when compared to the Ulz method, because of its novel bias correction and versatility to analyze any set of genomic regions. However, Griffin is not limited to low coverage data.
  • Increased cfDNA sequencing coverage can allow for analysis of specific gene promoters and cis- regulatory elements and may be able to inform gene expression (Ulz, P. et al. Inferring expressed genes by whole-genome sequencing of plasma DNA. Nature Genetics 48, 1273-1278 (2016)). While recent studies show the promise of cfDNA methylation and cfRNA analysis for tumor phenotype analysis and cancer detection (Beltran, H. et al. Circulating tumor DNA profile recognizes transformation to castration- resistant neuroendocrine prostate cancer. J Clin Invest 130, 1653-1668 (2020); Wu, A. et al. Genome-wide plasma DNA methylation features of metastatic prostate cancer. J Clin Invest 130, 1991-2000 (2020); Shen, S.
  • a limitation of the binary ER classification is the decreased accuracy for samples with lower tumor fraction (0.05 to 0.1); however, patients with cfDNA tumor fraction > 10% have poorer prognosis (Stover, D. G. et al. Association of Cell-Free DNA Tumor Fraction and Somatic Copy Number Alterations With Survival in Metastatic Triple-Negative Breast Cancer. JCO 36, 543-553 (2018)) and would benefit more from tumor monitoring. It may be possible to improve performance of ER subtyping for lower tumor fraction samples with additional sequencing depth or joint analysis of multiple cfDNA timepoints from the samepatient.
  • the breast cancer subtyping was focused on ER prediction because its status has important utility in predicting likely benefit to endocrine therapy (Group (EBCTCG), E. B. C. T. C. Relevance of breast cancer hormone receptors and other factors to the efficacy of adjuvant tamoxifen: patient-level meta-analysis of randomised trials. The Lancet 378, 771-784 (2011)). While PR expression is also determined in the clinic and ER-/PR+ tumors are considered hormone receptor positive, these are rare, not reproducible or less useful for prognosis (Hefti, M. M. et al. Estrogen receptor negative/progesterone receptor positive breast cancer is not a reproducible subtype. Breast Cancer Research 15, R68 (2013)).
  • HER2 overexpression is important relevant for prognosis and determining treatment such as trastuzumab (Slamon, D. J. et al. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science 235, 177-182 (1987)).
  • trastuzumab Stemmed, D. J. et al. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science 235, 177-182 (1987)
  • an insufficient number of open chromatin sites were identified that were specific for distinguishing HER2 status.
  • ERBB2 encodes the HER2 protein
  • the Griffin framework is a unique advance on our previous method to analyze genomic alterations and estimate tumor fraction from ULP-WGS of cfDNA (Adalsteinsson, V. A. et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8, (2017)). Together, these methods form a suite of tools to establish a new paradigm to study both tumor genotype and phenotype from ULP-WGS of cfDNA. Griffin has the potential to reveal clinically relevant tumor phenotypes, which will support the study of therapeutic resistance, inform treatment decisions, and accelerate applications in cancer precision medicine.
  • GC content influences the efficiency of amplification and sequencing leading to different expected coverages (coverage bias) for fragments with different GC contents and fragment lengths. This is called GC bias and is unique to each sample.
  • cover bias This is called GC bias and is unique to each sample.
  • mappability score 1
  • mappability score 2
  • mappability score 1
  • mappability score 2
  • mappability score 3
  • centromeres centromeres
  • fix patches fix patches
  • alternative haplotypes for hg38 downloaded from UCSC table browser
  • the pipeline takes a bam file, bedGraph file of valid (mappable, non-excluded) regions, and genome GC frequencies for those regions. For each given sample, we fetched all reads aligning to the valid regions on autosomes using pysam (github.com/pysam- developers/pysam) (Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009)).
  • the griffin nucleosome profiling pipeline To perform nucleosome profiling around sites of interest. This pipeline takes a bam file and site list, and assorted other parameters described below. For a given bam file and site list, we fetched all reads in a window (-5000 to +5000bp) around each site using pysam (excluding those that failed quality control measures). We then filtered read pairs by fragment length and selected those in a range of fragment lengths (100-200 bp unless otherwise specified). For each read pair, we determined the GC bias for the fragment and assigned a weight of to that fragment and identified the location of the fragment midpoint.
  • cfDNA tumor fraction was estimated using ichorCNA (Adalsteinsson, V. A. et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8, (2017) ).
  • An hg38 panel of normal (PoN) with a lmb bin size was created using all 215 healthy donors in the dataset.
  • ichorCNA was then ran on all cancer and healthy samples to estimate tumor fraction.
  • ichorCNA_fracReadsInChrYForMale was set to 0.001. Defaults were used for all other settings.
  • the LUCAS cohort included 158 patients who had no history of cancer and no future cancer diagnosis and 129 patients who were diagnosed with lung cancer within days of blood draw (0-44 days).
  • the validation cohort included 46 patients with cancer and 385 patients without cancer. All samples were realigned to hg38 as described below in sequence data processing. Tumor fraction was determined using ichorCNA as described above with a panel of normals constructed from 54 separate non-cancer samples from this same study.
  • MSC Metastatic breast cancer
  • WGS of cfDNA from patients with metastatic breast cancer (MBC) and healthy donors were obtained from an existing dataset (Adalsteinsson, V. A. et al. Nature Communications 8, (2017)). Bam files were downloaded from dbGaP (accession code: phs001417.vl.pl). This data consisted of ⁇ 0.1x ultra-low pass whole genome sequencing (ULP-WGS) from lOObp paired end Illumina sequencing reads.
  • ULP-WGS ultra-low pass whole genome sequencing
  • ER estrogen receptor
  • each sample was labeled as ER+ or ER- using information about the ER status from medical records. If metastatic ER status was known, the sample was labeled according to this status. If metastatic ER status was not known, the sample was labeled according to the primary tumor ER status (20 samples from 11 patients). ER low samples (11 samples from 6 patients) were labeled ER positive for the purpose of the binary classifier. For three patients (MBC_1405, MBC_1406, MBC_1408), we had information about multiple metastatic biopsies with different ER statuses. In these cases, we used the last biopsy taken for the purpose of the binary ER status classifier.
  • WGS of cfDNA samples from patients with MBC were obtained from an existing study as described above (Adalsteinsson, V. A. et al. Nature Communications 8, (2017)). Additional information, including primary ER status, metastatic ER status, and survival time, was abstracted from the medical records. Use of this data was approved by an institutional review board (Dana-Farber Cancer Institute IRB protocol identifiers 05-246, 09-204, 12-431 [NCT01738438; Closure effective date 6/30/2014]).
  • TFBS Transcription factor binding site selection Transcription factor binding sites (TFBSs) were downloaded from the GTRD database (Yevshin, I., GTRD: A database on gene transcription regulation - 2019 update. Nucleic Acids Research 47, D100-D105 (2019)).
  • This database contains a compilation of ChIP seq data from various sources.
  • we used the meta clusters data version 19.10, downloaded from gtrd.biouml.org/downloads/19.10/chip- seq/Homo%20sapiens_meta_clusters.interval.gz). This contains meta peaks observed in one or more ChIP seq experiments.
  • the GTRD database contains some ChIP seq experiments for targets that are not transcription factors (TFs).
  • DNase I hypersensitivity sites for a variety of tissue types were downloaded from zenodo.org/record/3838751/files/DHS_Index_and_Vocabulary _hg38_WM20190703.txt. gz (Meuleman, W. et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature 584, 244-251 (2020)). These sites were split by tissue type for a total of 16 site lists. The 'summit' column was used as the site position. The sites were sorted by the number of samples where that site had been observed ('numsamples') and the top 10,000 most frequently observed sites were selected for each tissue type.
  • a differential expression experiment was ran using the 'DESeq' and 'results' functions followed by log fold change shrinkage using the 'lfcShrink' function. Sites with a q-value ⁇ 5*10 4 were selected. Additionally, selected sites were further filtered based on the log2 fold change between ER+ and ER- tumors. Sites with a log2 fold change >0.5 were classified as ER+ specific, while sites with a log2 fold change ⁇ -0.5 were classified as ER- specific. These site lists were further split into sites shared with hematopoietic cells and those not shared with hematopoietic cells. Hematopoietic sites were obtained from a database of single cell ATAC-seq data (Satpathy, A.
  • nucleosome profiling with and without GC correction was performed on the top 10,000 sites for each of 377 TFs.
  • the MAD of the central coverage values was calculated both before and after GC correction.
  • the MAD values before and after GC correction were compared using a Wilcoxon signed-rank test (two-sided).
  • the realignment procedure was the same as above but using the hgl9 genome (downloaded from hgdownload.soe.ucsc.edu/goldenPath/hgl9/bigZips/hgl9.fa.gz) and hgl9 known polymorphic sites for base recalibration (downloaded from gsapubftp- anonymous@ftp.broadinstitute.org/bundle/hg37/Mills_and_1000G_gold_standard.indels.
  • nucleosome profiling using 100-200bp fragments to the 377 TFs from GTRD and extracted 3 features per profile for a total of 1131 features. We then used PCA to identify the components that explained 80% of the variance as described above. Second, we applied nucleosome profiling using 100-200bp fragments to the 4 ER differential AT AC seq lists and extracted 3 features per profile for a total of 12 features. Lastly, we applied nucleosome profiling using 35-150bp fragments to the 4 ER differential ATAC seq lists and extracted 3 features per list for a total of 12 features.
  • Sequencing data used in this study was obtained from dbGaP (accession phs001417.vl.pl) and EGA (dataset ID EGAD00001005339).
  • Griffin software and the subtype classifier tool can be obtained from github.com/adoebley/Griffin. Code for analysis and machine learning models can be accessed at github.com/adoebley/Griffin_analyses.
  • Example 1 above is a proof-of-concept demonstration that sequence analysis applying an embodiment of the Griffin workflow can enhance sequence signals with sufficient power and specificity to allow determination of breast cancer subtypes from low pass sequencing data.
  • This Example expands the application of Griffin workflow to other cancer types and makes use of data from an alternative sequence profiling platform. Specifically, histone modification profiling was performed using the CUT & RUN on different subtypes of prostate cancer cells. As with Example 1, the Griffin workflow provided robust signals to clearly differentiate different subtypes of prostate cancer, demonstrating the power and flexibility of the analytic workflow. Background
  • Metastatic castration-resistant prostate cancer describes the stage in which the disease has developed resistance to androgen ablation therapies and is lethal. Androgen receptor signaling inhibitors (ARSI), designed for the treatment of CRPC, repress androgen receptor (AR) activity and improve survival, but these therapies eventually fail. Since the adoption of ARSI as standard-of-care for mCRPC, there has been a prominent increase in the frequency of treatment-resistant tumors with neuroendocrine (NE) differentiation and features of small cell carcinomas. These aggressive tumors may develop through a resistance mechanism of trans-differentiation from AR-positive adenocarcinoma (ARPC) to NE prostate cancer (NEPC) that lack AR activity.
  • ARPC AR-positive adenocarcinoma
  • NEPC neuroendocrine
  • Additional phenotypes can also arise based on expression of AR activity and NE genes, including AR-low prostate cancer (ARLPC) and double- negative prostate cancer (DNPC; AR-null/NE-null).
  • ARLPC AR-low prostate cancer
  • DNPC double- negative prostate cancer
  • Distinguishing prostate cancer subtypes has clinical relevance in view of differential responses to therapeutics, but the need for a biopsy to diagnose tumor histology can be challenging: invasive procedures are expensive and accompanied by morbidity, a subset of tumors are not accessible to biopsy, and bone sites pose particular challenges with respect to sample quality.
  • Circulating tumor DNA (ctDNA) released from tumor cells into the blood as cell- free DNA (cfDNA) is a non-invasive "liquid biopsy" solution for accessing tumor molecular information.
  • the analysis of ctDNA to detect mutation and copy-number alterations has served to classify genomic subtypes of CRPC tumors.
  • the defining losses of TP53 and RBI in NEPC do not always lead to NE trans-differentiation. Rather, ARPC and NEPC tumors are associated with distinct reprogramming of transcriptional regulation.
  • Methylation analysis of cfDNA in mCRPC to profile the epigenome shows promise for distinguishing phenotypes, but requires specialized assays such as bisulfite treatment, enzymatic treatment, or immunoprecipitation.
  • cfDNA represents DNA protected by nucleosomes when released from dying cells into circulation, leading to DNA fragmentation that is reflective of the non-random enzymatic cleavage by nucleases.
  • Emerging approaches to analyze cfDNA fragmentation patterns from plasma for studying cancer can be performed directly from standard whole genome sequencing (WGS).
  • cfDNA fragments have the characteristic size of 167 bp, consistent with protection by a single core nucleosome octamer and histone linkers, but the size distribution may vary between healthy individuals and cancer patients.
  • TSS transcription start site
  • TFBS transcription factor binding site
  • nucleosome positioning and spacing are dynamic in active and repressed gene regulation. A detailed understanding of the nucleosome organization and positioning patterns associated with transcriptional regulation has not been fully explored in cfDNA.
  • ctDNA analysis A major challenge for ctDNA analysis is the low tumor content (tumor fraction) in patient plasma samples.
  • plasma from patient-derived xenograft (PDX) models may contain nearly pure human ctDNA after bioinform atic exclusion of mouse DNA reads. This provides a resource that is ideal for studying the properties of ctDNA, developing new analytical tools, and validating both genetic and phenotypic features by comparison to matching tumors.
  • WGS of ctDNA from mouse plasma across 24 CRPC PDX lines with diverse phenotypes was performed deep.
  • the models consisted of 18 classified as ARPC, two classified as AR-low and NE- negative prostate cancer (ARLPC), and six classified as NEPC (FIG. 11 A).
  • CUT&RUN Cleavage Under Targets and Release using Nuclease
  • PTMs H3K27me3 histone post-translational modifications
  • nucleosome organization inferred from ctDNA reflects the transcriptional activity state regulated by histone PTMs (Zhou et al. (2011). Charting histone modifications and the functional organization of mammalian genomes. Nat Rev Genet 12, 7-18).
  • ctDNA coverage at TFBSs were aggregated into composite profiles representing the inferred activity (Example 1 and Doebley et al., 2021; Ulz et al. (2019). Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nature Communications 10, 4666). Similarly, features in the composite profiles of subtype- specific open chromatin regions were extracted for analyzing the signatures of chromatin accessibility in ctDNA. Altogether, a multi-omic sequencing dataset was assembled from matching tumor and plasma for a total of 24 PDX lines, making this a unique molecular resource and platform for developing transcriptional regulation signatures of tumor phenotype prediction from ctDNA. Characterizing transcriptional activity of AR and ASCL1 in PDX phenotypes through analysis of tumor histone modifications and ctDNA
  • RNA Splicing Factors SRRM3 and SRRM4 Distinguish Molecular Phenotypes of Castration-Resistant Neuroendocrine Prostate Cancer. Cancer Research 81, 4736-4750). The transcriptional activity was further characterized in different tumor phenotypes by studying epigenetic regulation via histone PTMs.
  • H3K4mel Broad peak regions for H3K4mel (median of 17,643 regions, range 1,894 - 64,934), H3K27ac (median 7,093, range 1610 - 34,047), and H3K27me3 (median 8,737, range 2,024 - 42,495) were identified in the tumors of the 24 PDX lines and an additional nine LuCaP PDX lines where only tumor was available (total of 25 ARPC, 2 ARLPC, and 6 NEPC) (Methods).
  • H3K27ac putative active regulatory regions of enhancers and promoters
  • H3K4mel gene repressive heterochromatic mark
  • H3K27me3 gene repressive heterochromatic mark
  • AR and ASCL1 are two key differentially expressed TFs with known regulatory roles in ARPC and NEPC phenotypes, respectively (Brady et al. (2021). Temporal evolution of cellular heterogeneity during the progression to advanced AR-negative prostate cancer. Nat Commiin 12, 3372; Cejas et al. (2021). Subtype heterogeneity and epigenetic convergence in neuroendocrine prostate cancer. Nat Commun 12, 5775; Rapa et al. (2008). Human ASH1 expression in prostate cancer with neuroendocrine differentiation. Mod Pathol 21, 700-707; Wang et al. (2020). Molecular tracing of prostate cancer lethality. Oncogene 39, 7225-7238).
  • the ctDNA composite coverage profiles were analyzed at TFBSs to evaluate the nucleosome accessibility, whereby lower normalized central ( ⁇ 30 bp window) mean coverage across these sites suggests more nucleosome depletion (Methods).
  • the composite coverage profile at ASCL1 TFBSs showed the strongest nucleosome depletion for NEPC samples (mean central coverage 0.69) compared to ARLPC (0.86) and ARPC (0.88) (FIG. 12C). These observations were consistent with the differential binding activity by AR and ASCL1 in their respective phenotypes from tumor tissue. Furthermore, the ctDNA coverage patterns of the nucleosome depletion in ctDNA resembled the NDR flanked by nucleosomes with H3K27ac and H3K4mel peak profiles, which was exemplified when analyzing only nucleosome- sized fragments (140 bp - 200 bp) generated by CUT&RUN (FIG. 12A). Together, these results suggest that the nucleosome depletion in ctDNA at AR and ASCL1 binding sites represents active TF binding and regulatory activity in specific prostate PDX tumor phenotypes.
  • Nucleosome patterns at gene promoters inferred from ctDNA are consistent with transcriptional activity for phenotype-specific genes
  • RNA Splicing Factors SRRM3 and SRRM4 Distinguish Molecular Phenotypes of Castration-Resistant Neuroendocrine Prostate Cancer. Cancer Research 81, 4736- 4750) were selected and confirmed by differential expression analysis from PDX tumor RNA-Seq data (FIG. 12D, Methods).
  • increased coverage was observed at the TSS of AR (1.08) in NEPC and ASCL1 (0.42) in ARPC, which supports the nucleosome depletion in the absence of PTMs and inactive transcription.
  • TFBSs in PDX ctDNA were identified based on the intersection of 338 TFs analyzed using Griffin and 404 differentially expressed TFs between ARPC and NEPC PDX tumors (Methods). Of these TFs, 38 had significantly different accessibility in ctDNA between ARPC and NEPC phenotypes (two tailed Mann-Whitney U test, Benjamini-Hochberg adjusted p ⁇ 0.05). Through unsupervised hierarchical clustering of composite TFBS central coverage values for the 107 TFs, distinct groups of TFs were observed in PDX ctDNA (FIG. 13).
  • FOXA1, and GRHL2 were significantly more accessible in ARPC (and ARLPC) samples compared to NEPC (log2 fold-change ⁇ -0.57, adjusted p ⁇ 1.3 x 10 3 ).
  • AR, HOXB13, and NKX3-1 had higher accessibility in ARPC compared to NEPC (log2 fold-change ⁇ - 0.37, adjusted p ⁇ 1.3 x 10 3 ), but with only moderate accessibility in ARLPC, as expected.
  • Other TFs including RUNX1, BCL11B, POU3F2, NEUROG2, and SOX2 also had higher activity in NEPC (log2 fold-change > 0.06, adjusted p ⁇ 0.048), although the difference was modest.
  • HEY1, IRF1, and IKZF1 had a similar trend consistent with increased accessibility in NEPC samples but were not significantly different from ARPC (adjusted p > 0.10).
  • Other notable factors such as MYC and ETS transcription family genes (ETV4, ETV5, ETS1, ETV1) had high accessibility across all phenotypes, while NEUROD1, RUNX3, and TP63 were inaccessible in nearly all samples.
  • ETV4 ETS transcription family genes
  • NEUROD1, RUNX3, and TP63 were inaccessible in nearly all samples.
  • ASCL1, NR3C1, HNF4G, HNF1A, and SOX2 Arora et al. (2013). Glucocorticoid Receptor Confers Resistance to Antiandrogens by Bypassing Androgen Receptor Blockade.
  • Phenotype-specific open chromatin regions in PDX tumor tissue are reflected in ctDNA profiles of nucleosome accessibility
  • Nucleosome profiling from cfDNA sequencing analysis has shown agreement with overall chromatin accessibility in tumor tissue (Snyder et al. (2016). Cell-free DNA Comprises an in Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin. Cell 164, 57-68; Sun et al. (2019). Orientation-aware plasma cell-free DNA fragmentation analysis in open chromatin regions informs tissue of origin. Genome Research 29, 418— 427; Ulz et al. (2019). Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection.
  • cfDNA released by hematopoietic cells which leads to a lower ctDNA fraction (i.e., tumor fraction).
  • tumor fraction i.e., tumor fraction
  • a probabilistic model was developed to estimate the proportion of ARPC and NEPC from an individual plasma sample, accounting for the tumor fraction (Methods).
  • a focused was made on the phenotype- specific open chromatin composite site features and the PDX plasma ctDNA signals were used (FIGS. 14B and 14C) to inform the model.
  • the model produces a normalized prediction score that represents the estimated signature of ARPC (lower values) and NEPC (higher values).
  • the study presented here is believed to be the largest sequencing study to date of human ctDNA from mouse plasma of PDX models.
  • the sequencing of mouse plasma provided a unique opportunity to comprehensively interrogate the epigenetic nucleosome patterns in ctDNA from well-characterized tumor models.
  • Computational methodologies were developed and applied to construct a multitude of ctDNA features, each of which were associated with the transcriptional regulation in the LuCaP PDX models across CRPC tumor phenotypes.
  • a probabilistic model was developed to accurately classify ARPC and NEPC phenotypes from patient plasma in three clinical cohorts.
  • PDX mouse plasma overcomes the challenge of low ctDNA content or incomplete knowledge of the tumor when studying patient samples and can expedite development of cfDNA diagnostics, basic cancer research, and clinical translation.
  • LuCaP ctDNA sequencing data complements the maturing characterization of CRPC tumor phenotypes from tissue.
  • the ctDNA data and the disclosed approaches expand on the potential utility of PDX models for translational research. While these data were focused on ARPC and NEPC phenotypes, this study can serve as a framework for the use of PDX plasma from additional CRPC phenotypes and other cancers models.
  • LuCaP PDX ctDNA sequencing data confirmed the activity of key regulators between ARPC and NEPC phenotypes, including a set of 47 established differentially expressed gene markers. While gene expression inference from ctDNA has been shown in proof-of-concept studies (Ulz et al. (2016b). Inferring expressed genes by whole-genome sequencing of plasma DNA. Nature Genetics 48, 1273-1278; Zhu et al. (2021). Tissue-specific cell-free DNA degradation quantifies circulating tumor DNA burden. Nature Communications 12, 2229), the PDX ctDNA allowed for a detailed dissection of nucleosome organization associated with transcriptional activity of individual genes that define the tumor phenotypes.
  • ASCL1 Glucocorticoid Receptor Confers Resistance to Antiandrogens by Bypassing Androgen Receptor Blockade. Cell 155, 1309-1322; Chaytor et al., 2019; Shukla et al., 2017).
  • ASCL1 is a pioneer TF with roles in neuronal differentiation and was recently described to be active during NE trans -differentiation and in NEPC (Cejas et al., 2021; Rapa et al., 2008). To our knowledge, this study is the first to demonstrate ASCL1 binding site accessibility and provide a detailed characterization of its transcriptional activity in NEPC from plasma ctDNA.
  • This model does not require training on patient samples but does require tumor fraction estimates (ichorCNA (Adalsteinsson (2017). Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8) and a prediction score cutoff determined from DFCI cohort I.
  • the framework presented here can be extended to model multiple phenotype classes, provided the informative parameters for these additional states can be learned. Insights from additional datasets such as single-cell nucleosome and accessibility profiling (Fang et al. (2021). Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun 12, 1337; Wu et al. (2021). Single-cell CUT&Tag analysis of chromatin modifications in differentiation and tumor progression. Nat Biotechnol 39, 819-824) of PDX tumors and clinical samples may improve the resolution for ctDNA analysis.
  • Tumor heterogeneity and co existence of different molecular phenotypes are common in mCRPC where treatment- induced phenotypic plasticity may vary within and between tumors in an individual patient. Larger studies with comprehensive assessment of the tumor histologies will be needed for developing future extensions of the model to predict mixed phenotypes from ctDNA.
  • LuCaP patient-derived xenograft tumors (established at the University of Washington) were initiated from tumor specimens resected from men with advanced prostate cancer. The establishment and characterization of the PDX models were described previously (Lam et al. (2016). Generation of Prostate Cancer Patient-Derived Xenografts to Investigate Mechanisms of Novel Treatments and Treatment Resistance. In Prostate Cancer: Methods and Protocols, Z. Culig, ed. (New York, NY: Springer), pp. 1- 27). PDXs were propagated in vivo in male NOD scid IL2R-gamma-null (NSG) mice from Jackson Labs (cat#005557).
  • mice were caged in a pathogen-free facility and given unlimited access to food and water maintained on a 12-hour light/dark cycle. Surgeries were performed under isoflurane anesthesia, and mice were given supplemental buprenorphine sustained release (SR). PDX lines were evaluated using histopathology by at least two expert pathologists, and histological phenotypic subtype annotations were orthogonally validated based on transcriptome- derived signature marker expression scores to define phenotypes (Beltran et al. (2016).
  • UW cohort Blood samples were collected from men with metastatic castration resistant prostate cancer at the University of Washington (collected under University of Washington Human Subjects Division IRB protocol number CC6932 between years 2014-2021). In this study, 61 plasma samples from 30 patients were analyzed. After initial ultra-low pass whole genome sequencing (ULP-WGS) analysis, 47 plasma samples from 30 patients were retained for further high depth of coverage whole genome sequencing (WGS) analysis. All samples were de-identified prior to ctDNA analysis and a double blinded approach was employed for evaluating clinical phenotype predictions. The initial patient selection was done based on clinical disease burden information and the availability of clinically derived phenotypic subtype annotation. Clinical information on these patients is protected due to IRB protocol restrictions.
  • DFCI cohort I Plasma was collected from men diagnosed with mCRPC and treated at the Dana-Farber Cancer Institute (DFCI), Brigham and Women's Hospital, or Weill Cornell Medicine (WCM) between April 2003 and August 2021. All patients provided written informed consent for research participation and genomic analysis of their biospecimen and blood. The use of samples was approved by the DFCI IRB (#01- 045 and 09-171) and WCM (1305013903) IRBs. ULP-WGS data at mean coverage 0.5x (range 0.3x - 0.9x) for 101 patients were published previously (Berchuck et al. (2022). Detecting Neuroendocrine Prostate Cancer Through Tissue-Informed Cell-Free DNA Methylation Analysis. Clinical Cancer Research 28, 928-938).
  • DFCI cohort II Plasma samples in this cohort were collected from men diagnosed with mCRPC and treated at the Dana-Farber Cancer Institute (DFCI). All patients provided written informed consent for blood collection and the analysis of their clinical and genetic data for research purposes (DFCI Protocol # 01-045 and 11-104). WGS data at mean coverage 27x (range llx - 44x) (Viswanathan et al. (2016). Structural Alterations Driving Castration-Resistant Prostate Cancer Revealed by Linked-Read Genome Sequencing. Cell 174, 433-447.el9), and ULP-WGS data at mean coverage 0.13x (range 0.07x - 0.18x) (Adalsteinsson et al. (2017).
  • Healthy donor plasmacfDNA WGS data used in this study were obtained from previously published studies. Two samples (HD45 and HD46) with coverage of 13x and 15x, respectively, were accessed from dbGAP under accession phs001417 (Adalsteinsson et al. (2017). Nature Communications 8; Viswanathan et al. (2016). Cell 174, 433- 447.el9). These donors were consented under DFCI protocol IRB (# 03-022).
  • Blood samples were collected from NSG mice bearing subcutaneous PDX tumors at the time of sacrifice.
  • the PDX lines were maintained at vivaria in the University of Washington and FHCRC.
  • the blood was processed following methods described for human plasma DNA processing for subsequent DNA isolation.
  • Blood was collected in purple cap EDTA tubes and processed within 4 hours. All blood samples were double spun using centrifugation at 2500g for 10 minutes followed by a 16000g spin of the plasma fraction for 10 minutes at room temperature.
  • 7-10 mouse plasma samples were pooled. Processed plasma samples were preserved in clean, screw- capped cryo-microfuge tubes and stored at -80°C prior to cfDNA isolation.
  • the QIAamp Circulating Nucleic Acid Kit was used to isolate cfDNA from PDX mouse-derived plasma using the recommended protocol.
  • the pooled plasma samples from 7-10 mice for each PDX line contained ⁇ 2-3 mL total plasma volume for each line.
  • the filter retention-basedcfDNA kit method does not implement any fragment size class enrichment.
  • Carrier RNA spike-in was excluded from elusion buffer.
  • Isolated cfDNA was quantified using the Qubit dsDNA HS assay (Invitrogen) and the cfDNA fragment size profiles were analyzed using Tapestation HS D5000 and HS D1000 assays (Agilent).
  • NGS libraries were prepared with 50ng inputcfDNA.
  • Illumina NGS sequencing libraries were prepared with the KAPA hyperprep kit, adopting nine cycles of amplification, and purified using lab standardized SPRI beads. KAPA UDI dual indexed library adapters were used. Library concentrations were balanced and pooled for multiplexing and sequenced using the Illumina HiSeq 2500 at the Fred Hutch Genomics Shared Resources (200 cycles) and Illumina NovaSeq platform at the Broad Institute Genomics Platform Walkup-Seq Services using S4 flow cells (300 cycles). To match with Illumina HiSeq 2500 data, truncated 200 cycles FASTQ files were generated (100 bp paired end reads).
  • ARPC and ARLPC vs. ARPC The results were then filtered using a list of 1,635 human transcription factors published previously (Lambert et al. (2016). The Human Transcription Factors. Cell 172, 650-665), which resulted in 514 genes with FDR ⁇ 0.05 and fold change > 3. Out of these 514, deregulation of gene expression for 404 transcription factor genes delineated ARPC from NEPC.
  • CUT&RUN is an antibody targeted enzyme tethering chromatin profiling assay in which controlled cleavage by micrococcal nuclease releases specific protein-DNA complexes into the supernatant for paired-end DNA sequencing analysis.
  • CUT&RUN assays were performed for three histone modifications, H3K27ac, H3K4mel, and H3K27me3, according to published protocols (Skene and Henikoff (2017). An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. ELife 6, e21856).
  • CUT&RUN were performed on LuCaP PDX tumors using ⁇ 75mg flash- frozen tissue pieces.
  • frozen tissues were thoroughly chopped into small pieces and converted into smaller clusters of cells using collagenase and dispase.
  • Cell clusters were made permeabilized using digitonin and nutated with target antibody in EDTA antibody buffer.
  • Time-sensitive micrococcal nuclease enzyme treatments were performed on ice. Released DNA was precipitated along with glycogen career, and subsequent NGS libraries were prepared using picogram input DNA library preparation protocol.
  • Paired-end (50 bp) sequencing was performed and reads were aligned using bowtie2 version 2.4.2 (Langmead et al. (2019). Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics 35, 421 — 432 J to the hg38 human reference assembly. Aligned reads were processed as described in the SEACR protocol (github.eom/FredHutch/SEACR#preparing-input-bedgraph-files). Peaks were called using SEACR version 1.3 (Meers et al. (2019). Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling.
  • Genomewide peak heatmap, targeted heatmap, and respective profiles were plotted using deepTools. bigWig formatted files for each phenotype were obtained using the mean function in wiggletools 1.2.8. and deepTools computeMatrix. Phenotype-specific informative region coordinates were obtained from diffBind v3.5.0, and the top 10,000 most significant regions (all with FDR ⁇ 0.05) differentially open between ARPC and NEPC lines were used for downstream feature analyses (see Gene body and promoter region selection for additional subsetting criteria applied on a feature by feature basis). For heatmaps and profiles the plotHeatmap function was used.
  • Differential PTM analysis was performed with the Diffbind version 2.16.0 package (Ross-Innes et al. (2012). Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature 481, 389-393) in R-4.0.1 using standard parameters (bioconductor.riken.jp/packages/3.0/bioc/html/DiffBind.html).
  • ARPC, NEPC and ARLPC samples were grouped by histopathological and transcriptome signature defined phenotypes described in the "PDX mouse models" section. Samples were loaded with the dba function, reads counted with the dba.count function, and contrast specified as phenotype with dba.contrast and a minimum members of 2.
  • Differential peak sites were computed with the dba.analyze function with default settings. Differential peak binding of NEPC and ARLPC was computed against ARPC samples. Unique binding sites in NEPC and ARLPC were catalogued using bedtools v2.29.2 (Quinlan and Hall (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842). Intergroup differentially bound peaks were annotated using ChIPseeker 1.28.3 (Yu et al. (2015). ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31, 2382-2383) and TxDb.Hsapiens.UCSC.hg38.knownGene 3.2.2 in R 4.1.0.
  • ATAC-Seq sequence data for 15 tumor samples from 10 PDX lines were published previously and FASTQ files made available upon request (Cejas et al. (2021). Subtype heterogeneity and epigenetic convergence in neuroendocrine prostate cancer. Nat Commun 12, 5775). These lines included LuCaP PDX lines with ARPC histology (23.1, 77, 78, 81, 96) and NEPC histology (two replicates each of 49, 93, 145.1, 173.1 and one replicate of 145.2). Paired end reads were aligned using bowtie2 2.4.2 (Langmead et al. (2019). Scaling read aligners to hundreds of threads on general-purpose processors.
  • RNA-Seq derived phenotypes Phenotype specific binding sites were isolated by first selecting for positive fold change open chromatin enrichment and then using Intervene 0.6.5 (Khan and Mathelier (2017). Intervene: a tool for intersection and visualization of multiple gene or genomic region sets. BMC Bioinformatics 18, 287) where regions were considered overlapping if they shared at least 1 bp.
  • Regions with FDR adjusted p-values ⁇ 0.05 were then subset to those overlapping the 338,000 established TFBSs (338 TFs x 1,000 binding sites, see Griffin analysis for site selection) by at least 1 bp using BedTools Intersect. Only regions that overlapped an established TFBS were retained.
  • Griffin is a method for profiling nucleosome protection and accessibility on predefined genomic loci (see Example 1 and Doebley et al. (2021). Griffin: Framework for clinical cancer subtyping from nucleosome profiling of cell-free DNA. MedRxiv 2021.08.31.21262867). Griffin filters sites by mappability, estimates and corrects GC bias on a per fragment level, and generates GC-corrected coverage profiles around each site. First, griffin takes a site list and examines the mappability in a window (+/- 5000 bp around each site). Mappability (hg38 Umap multi-read mappability for 50bp reads) was obtained from UCSC genome browser (Karimzadeh et al. (2018).
  • GC biases were then smoothed by taking the median of values for fragments with similar lengths and GC contents (k nearest neighbors smoothing) to generate smoothed GC bias values.
  • nucleosome profiling was performed in each sample. For each mappable site of interest, fragments aligning to the region ⁇ 5000 bp from the site were fetched from the bam file. Fragments were filtered to remove duplicates and low- quality alignments ( ⁇ 20 mapping quality) and by fragment length. Nucleosome size fragments (140-250 bp) were retained. Fragments were then GC corrected by assigning each fragment a weight of l/GC_bias for that given fragment length and GC content and the fragment midpoint was identified. The number of weighted fragment midpoints in 15bp bins across the site were counted.
  • TFBS Transcription factor binding site
  • TFs 1,314 transcription factors
  • CIS-BP CIS-BP database
  • TFs from GTRD that were also in CIS-BP and had a known binding motif were retained.
  • Selected TF binding genomic loci were then filtered for mappability as described above (Griffin analysis) and TFs with fewer than 10,000 highly mappable sites on autosomes were excluded, resulting in 338 TFs.
  • LuCaP PDX cfDNA In downstream analysis of LuCaP PDX cfDNA, if any lines did not meet specific criteria in a region (including differentially open histone modification regions) that feature/region combination was excluded from analysis, leading to a variable lower number of regions considered based on the feature. These criteria included requiring at least 10 total fragments in a region for all Fragment size analysis (see below) and a non zero number of "short" and "long” fragments for the short-long ratio; short-long ratios less than 0.01 or greater than 10.0 were also excluded as outliers. Any region with no coverage in a line was excluded from all analyses. This resulted in gene lists that differed in numbers between genomic contexts and feature types.
  • Fragments were first filtered to remove duplicates and low-quality alignments
  • fragment short- long ratio (FSLR) was computed as the ratio of short
  • Admixtures for evaluating benchmarking performance were constructed using 5 ARPC (LuCaP 35, 35CR, 58, 92, 136CR) and 5 NEPC (LuCaP 49, 93, 145.2, 173.1, 208.4) lines mixed to 1%, 5%, 10%, 20%, and 30% tumor fraction with a single healthy donor plasma line (NPH004, EGAD00001005343) at ⁇ 25X mean coverage, assuming 100% tumor fraction in post- mouse subtracted PDX sequencing data. After extracting chromosomal DNA with SAMtools (Danecek et al. (2021). Twelve years of SAMtools and BCFtools.
  • ichorCNA Alsteinsson et al. (2017). Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nature Communications 8) with binSize 1,000,000 bp and hgl9 reference genome. Default tumor fraction estimates reported by ichorCNA were used. See github.com/GavinHaEab/CRPCSubtypingPaper/tree/main/ichorCNA_configuration for complete configuration settings.
  • a probabilistic model was developed to classify the mCRPC phenotype (ARPC or NEPC) in an individual patient plasma ctDNA sample.
  • This is a generative mixture model that is unsupervised — it does not train on the patient cohort of interest.
  • the model accepts the pre-estimated tumor fraction from ichorCNA for the given patient ctDNA sample, as well as the pre-computed ctDNA features values from the LuCaP PDX ctDNA and healthy donor ctDNA as prior information. For each patient ctDNA sample, it fits the heterogeneous tumor fractions against the pure PDX LuCaP models.
  • Q has range [0,1], where higher values indicate an increased proportion of the sample having a NEPC phenotype and was used as the NEPC prediction score metric.
  • Code and implementation of the method can be found at github.com/GavinHaLab/CRPCSubtypingPaper/tree/main/GenerativeMixtureModel. Analysis and classification of clinical patient samples
  • the model was then validated on two cohorts, beginning with the already published DFCI cohort II (Adalsteinsson et al. (2017). Nature Communications S; Choudhury et al. (2016). Tumor fraction in cell-free DNA as a biomarker in prostate cancer. JCI Insight 3 Viswanathan et al. (2016). Structural Alterations Driving Castration-Resistant Prostate Cancer Revealed by Linked-Read Genome Sequencing. Cell 174, 433-447.el9). The analysis was restricted to eleven samples from six patients with matched ULP-WGS and WGS data with paired-end reads. Tumor fraction estimates from ichorCNA were obtained from the original study ( Adalsteinsson et al. (2017). Nature Communications S). All samples were considered adenocarcinoma (ARPC) based on clinical histories (see Human subjects). The scoring threshold of 0.3314, determined from DFCI cohort I was used for phenotype classification.
  • Example 1 applied an embodiment of the Griffin workflow to enhance sequence signals to allow accurate determination of breast cancer subtypes from low pass sequencing data.
  • Example 2 applied an embodiment the Griffin workflow approach to differentiate subtypes of other cancers, namely prostate cancer, successfully leveraging data from an alternative sequence profiling platform (e.g., from the CUT & RUN platform for nucleosome accessibility), demonstrating the power and flexibility of the Griffin analytic workflow for different cancers and input data.
  • This Example described the development of targeted sequencing panels to use in conjunction with the Griffin workflow to understand transcriptional features of small cell lung cancer, non-small cell lung cancer, and other cancer types from blood ctDNA.
  • This Example describes an innovative analytical assay based on analysis of cell free DNA, demonstrating clear translational potential for clinical lung cancer diagnostics.
  • Cell-free DNA circulating in the blood of cancer patients has been widely used to assess gene mutations, and through analyses of whole genome DNA has more recently been used to infer activation of certain transcription factors.
  • Cancer cells give rise to cell-free DNA via cell death and that cell-free DNA is overwhelmingly nucleosomal, i.e. bound to a histone octamer, which protects the DNA from degradation.
  • Histone positioning in the genome is influenced by components of chromatin, including transcription factors and the RNA polymerase complex.
  • TFBSs transcription factor binding sites
  • TSSs transcription start sites
  • One innovation is the identification of highly informative TFBSs and TSSs that can be used to differentiate between NSCLC and SCLC or between subtypes of NSCLC or SCLC, which then facilitate the use of hybridization capture-based DNA sequencing of ctDNA to generate high resolution maps of nucleosome occupancy at TFBSs of key TFs in SCLC (ASCL1, NEUROD1, POU2F3, REST) and TSSs for genes that are markers of key transcriptional features of lung cancer cells.
  • these informative sites can also be e amined in low-coverage whole genome sequencing to extract similar transcriptional features.
  • Targeted capture panels are routinely applied in the clinic to call mutations from ctDNA in the blood, and application of targeted sequencing to assess transcriptional activity in cancer cells is very feasible as a clinical test.
  • the technology is especially relevant and viable in SCLC, which kills -30,000 people in the US each year.
  • Tissue sampling in SCLC is typically only performed once during a patient's disease course and is often done by transbronchial fine needle aspiration, which yields a very small amount of tissue. Surgery is very rarely performed.
  • SCLC has a high level of ctDNA compared to most other cancer types, reflecting its highly metastatic nature, making this assay both practical for application to SCLC patients and potentially especially valuable.
  • SCLC subtypes exist based not on mutations but on activation of key transcription factors and their downstream programs (such as ASLC1, NEUROD1, and POU2F3).
  • key transcription factors and their downstream programs such as ASLC1, NEUROD1, and POU2F3
  • the disclosed targeted assay is designed to differentiate transcriptional subtypes of SCLC from ctDNA providing powerful clinical applications for use of this assay.
  • the panel is designed to call gene mutations in exons from a panel of -600 genes.
  • the assay has broad clinical utility for correlative analyses of both mutations and transcriptional activity in clinical samples.
  • transdifferentiation of driver mutation positive lung cancer to SCLC is treated differently from disease that is progressing but has not acquired a notable histologic change.
  • transdifferentiation is likely significantly underdiagnosed because currently it can only be assessed via biopsy of a progressing lesion, which is often infeasible or undesirable.
  • This assay can also be applied to lung adenocarcinoma patients who develop resistance to EGFR inhibitors to determine whether this resistance is associated with activation of SCLC transcriptional profiles.
  • the major non-invasive applications include the following:
  • FIG. 16 shows the generation of a capture panel.
  • the approach included rationally designing a targeted sequencing panel for integrated detection of SCLC genetic mutations, transcription factor (TF) subtype identity, and expression of key gene programs.
  • Public mutations databases and functional mutation data were interrogated for coding mutations coding in approximately 600 genes related to SCLC.
  • TF subtype identity TFBSs for four key SCLC-related TFs (ASCL1, NEUROD1, POU2F3, and REST) were targeted.
  • TSSs corresponding to the vast majority of protein-coding genes in the genome were targeted. To select specific sites, multiple sources of data were integrated as follows.
  • ChIP-seq data was used to identify TFBSs, resulting in 4-30k sites per factor. These candidate sites were then annotated with the distance to the nearest gene TSS. Retained sites were sites for which the nearest gene TSS was a gene known to be upregulated in SCLC cells that expressed the factor of interest, as determined by available RNAseq data. This resulted in -400-700 SCLC-focused sites per factor. In the final probe set, a 1 kb window symmetrically encompassing these ⁇ 2k sites (500 bp on each side) was targeted.
  • TSS profiling beginning with an established transcript annotation, non-coding transcripts, Y chromosome genes, and TSSs corresponding to multi-exon genes that had lower confidence annotations were removed, resulting in approximately -36k theoretically targeted TSSs. In the probe set, regions 260 bp downstream of the TSS and 100 bp upstream were targeted.
  • Use of application-specific orthogonal chromatin profiling data to select sites is a key feature of the approach. However, it will be noted that other types of chromatin profiling data could readily be substituted or added and yield same or similar results, such as ATAC-seq, CUT&RUN/TAG, DNAse-seq, modified histone ChIP-seq, etc.
  • a data analysis pipeline was developed to quantify cfDNA fragments protected by nucleosomes in both the TFBS and TSS captured DNA.
  • the analysis pipeline Griffin (described in more detail above), includes using fragment length-based GC correction to remove GC biases that obscure signals.
  • a fragment size-aware GC-bias correction approach helps to maximize signal-to-noise and optimizes the analysis of captured DNA.
  • FIGS. 17A and 17B illustrate the detection of transcription factor (TF) expression in SCLC models using targeted sequencing of cfDNA.
  • FIG. 17A is a schematic of experimental workflow for proof-of-concept negative control ("healthy donor") and positive control ("flank tumors" from SCLC cellular models) samples.
  • FIG. 17B graphically illustrates aggregated coverage across TFBSs in targeted sequencing data for healthy donors (top row) and flank tumors (bottom row).
  • the TFBS is expected to be located at position 0 on the x axis. Data are coded by expected TF expression. Healthy donor-derived cfDNA is expected to reflect REST expression but not ASCL1, NEUROD1, or POU2F3. In SCLC models, systematic differences in coverage distribution as a function of TF expression are apparent.
  • FIGS. 18A-18C illustrate transcription factor activity inference using TFBS coverage distributions from SCLC patient samples with available matched tumor gene expression data.
  • FIG. 18A graphically illustrates aggregated coverage across TFBSs in targeted sequencing data for healthy donors (top row) and patients with SCLC (bottom row) for whom matched tumor tissue with gene expression data was available. Samples are coded by expected TF expression. Systematic differences in coverage distribution as a function of expected TF expression are again apparent.
  • FIG. 18B illustrates gene expression of key genes in selected patient samples displayed as a heatmap. Cells are coded by Z-score and the inset text is the log2(TPM+l).
  • FIG. 18A graphically illustrates aggregated coverage across TFBSs in targeted sequencing data for healthy donors (top row) and patients with SCLC (bottom row) for whom matched tumor tissue with gene expression data was available. Samples are coded by expected TF expression. Systematic differences in coverage distribution as a function of expected TF expression are again apparent.
  • FIG. 18B illustrates gene
  • Trough depth magnitude corresponds to gene expression of the key TFs in these bona fide SCLC patient samples.
  • FIG. 19 is a series of graphs illustrating quantification of transcription factor binding site peak to trough amplitude sample types. Distribution of TFBS peak to trough amplitude calculated from aggregated coverage distributions according to expected ground truth of TF expression.
  • ASCL1 site peak to trough amplitude is associated with both SCLC status and ASCL1 positivity, while NEUROD1 and POU2F3 peak to trough amplitude is associated only with TF positivity.
  • FIGS. 20A and 20B graphically illustrate gene expression inference using TSS coverage distributions in flank tumor positive control samples.
  • FIG. 20A illustrates TSS coverage distribution from targeted sequencing of cfDNA, grouped by gene expression quintile in SCLC flank tumor models (quintiles 1-5) and blood ("B", dark blue). Shown are 1,912 TSS corresponding to 1,213 genes, which were selected based on low expression in whole blood and correlation between TSS coverage distribution and gene expression. TSS coverage distribution varies systematically according to expression of the corresponding gene.
  • FIG. 20B illustrates receiver operating characteristic curves for prediction of gene expression as above or below a threshold value (shown for thresholds of 0.1, 0.5, 1.0, and 2.0), as inferred from the coverage distribution of the corresponding TSS.
  • FIG. 21 is a series of graphs illustrating use of aggregated coverage profiles across large rationally selected subsets of the TSS panel for prediction of SCLC vs NSCLC status in lung cancer Pdx models and Patient samples.
  • NSCLC Pdx model As shown overlayed on the NSCLC PDX model, an amplitude feature was calculated from each coverage distribution curve as the difference between the coverage at the -45 position and the +120 position relative to the TSS, facilitating comparison within and between samples.
  • FIG. 22 is a series of graphs illustrating use of aggregated coverage profiles across large rationally selected subsets of the TSS panel for prediction of SCLC vs NSCLC status in lung cancer Pdx models and Patient samples.
  • An SCLC PDX that transdifferentiated from an adenocarcinoma is identified with a thick red line.
  • Griffin uses unique normalization of cfDNA sequence data that is specific for nucleosome profiling and chromatin accessibility analysis. This includes GC-bias correction, repetitive sequence filtering, and local coverage normalization. All of these normalization techniques are not available in existing proof-of-concept methods such as in Ulz P, et al. Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nat Commun. 2019;10(1):4666. Further, multi-omic feature extraction from Griffin for use in machine learning classifier construction to predict cancer subtype is unique to this approach.
  • a targeted sequencing panel is expected to yield higher resolution while retaining practical cost, and is more readily integrable with resequencing of regions of interest for genetic mutation detection (i.e. cancer gene panel sequencing).
  • genetic mutation detection i.e. cancer gene panel sequencing.
  • From output of Griffin many features can be extracted from each binding site of interest and machine learning classifiers can be used to predict subtypes of lung cancer histological subtypes from the cfDNA Griffin-optimized data.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biomedical Technology (AREA)
  • Primary Health Care (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Medicinal Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
EP22785557.4A 2021-04-08 2022-04-08 Zellfreies dna-sequenzdatenanalyseverfahren zur untersuchung des nukleosomschutzes und der chromatzugänglichkeit Pending EP4320618A2 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202163172590P 2021-04-08 2021-04-08
US202163276378P 2021-11-05 2021-11-05
PCT/US2022/024082 WO2022217096A2 (en) 2021-04-08 2022-04-08 Cell-free dna sequence data analysis method to examine nucleosome protection and chromatin accessibility

Publications (1)

Publication Number Publication Date
EP4320618A2 true EP4320618A2 (de) 2024-02-14

Family

ID=83545807

Family Applications (1)

Application Number Title Priority Date Filing Date
EP22785557.4A Pending EP4320618A2 (de) 2021-04-08 2022-04-08 Zellfreies dna-sequenzdatenanalyseverfahren zur untersuchung des nukleosomschutzes und der chromatzugänglichkeit

Country Status (5)

Country Link
EP (1) EP4320618A2 (de)
JP (1) JP2024515565A (de)
AU (1) AU2022255198A1 (de)
CA (1) CA3214391A1 (de)
WO (1) WO2022217096A2 (de)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376616B (zh) * 2022-10-24 2023-04-28 臻和(北京)生物科技有限公司 一种基于cfDNA多组学的多分类方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8725422B2 (en) * 2010-10-13 2014-05-13 Complete Genomics, Inc. Methods for estimating genome-wide copy number variations
US10497461B2 (en) * 2012-06-22 2019-12-03 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
WO2016015058A2 (en) * 2014-07-25 2016-01-28 University Of Washington Methods of determining tissues and/or cell types giving rise to cell-free dna, and methods of identifying a disease or disorder using same
US20190287645A1 (en) * 2016-07-06 2019-09-19 Guardant Health, Inc. Methods for fragmentome profiling of cell-free nucleic acids
WO2018227202A1 (en) * 2017-06-09 2018-12-13 Bellwether Bio, Inc. Determination of cancer type in a subject by probabilistic modeling of circulating nucleic acid fragment endpoints
CN112805563A (zh) * 2018-05-18 2021-05-14 约翰·霍普金斯大学 用于评估和/或治疗癌症的无细胞dna
WO2020232033A1 (en) * 2019-05-14 2020-11-19 Tempus Labs, Inc. Systems and methods for multi-label cancer classification

Also Published As

Publication number Publication date
CA3214391A1 (en) 2022-10-13
AU2022255198A1 (en) 2023-11-23
WO2022217096A3 (en) 2022-12-29
WO2022217096A2 (en) 2022-10-13
JP2024515565A (ja) 2024-04-10

Similar Documents

Publication Publication Date Title
JP7455757B2 (ja) 生体試料の多検体アッセイのための機械学習実装
Doebley et al. A framework for clinical cancer subtyping from nucleosome profiling of cell-free DNA
Schwarz et al. Spatial and temporal heterogeneity in high-grade serous ovarian cancer: a phylogenetic analysis
US11978535B2 (en) Methods of detecting somatic and germline variants in impure tumors
Haferlach et al. Landscape of genetic lesions in 944 patients with myelodysplastic syndromes
Riester et al. Combination of a novel gene expression signature with a clinical nomogram improves the prediction of survival in high-risk bladder cancer
Naumov et al. Genome-scale analysis of DNA methylation in colorectal cancer using Infinium HumanMethylation450 BeadChips
Tran et al. Cancer genomics: technology, discovery, and translation
EP3430170B1 (de) Verfahren zur genomcharakterisierung
US20220336046A1 (en) Methods and systems for refining copy number variation in a liquid biopsy assay
CN112602156A (zh) 用于检测残留疾病的系统和方法
CN114026646A (zh) 用于评估肿瘤分数的系统和方法
De Sarkar et al. Nucleosome patterns in circulating tumor DNA reveal transcriptional regulation of advanced prostate cancer phenotypes
US20230175058A1 (en) Methods and systems for abnormality detection in the patterns of nucleic acids
US20240279745A1 (en) Systems and methods for multi-analyte detection of cancer
Brannon et al. Enhanced specificity of high sensitivity somatic variant profiling in cell-free DNA via paired normal sequencing: design, validation, and clinical experience of the MSK-ACCESS liquid biopsy assay
Adams et al. Global mutational profiling of formalin-fixed human colon cancers from a pathology archive
Ren et al. SinoDuplex: an improved duplex sequencing approach to detect low-frequency variants in plasma cfDNA samples
Yadav et al. Next-Generation sequencing transforming clinical practice and precision medicine
US20180371553A1 (en) Methods and compositions for the analysis of cancer biomarkers
AU2022255198A1 (en) Cell-free dna sequence data analysis method to examine nucleosome protection and chromatin accessibility
Wang et al. Terminal modifications independent cell-free RNA sequencing enables sensitive early cancer detection and classification
KR20240104202A (ko) 순환 종양 핵산 분자의 다중모드 분석
Seo et al. Germline Functional Variants Contribute to Somatic Mutation and Outcomes in Neuroblastoma
Doebley Predicting cancer subtypes from nucleosome profiling of cell-free DNA

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20231108

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)