WO2024124207A2 - Systems and methods for cell-free nucleic acids methylation assessment - Google Patents

Systems and methods for cell-free nucleic acids methylation assessment Download PDF

Info

Publication number
WO2024124207A2
WO2024124207A2 PCT/US2023/083236 US2023083236W WO2024124207A2 WO 2024124207 A2 WO2024124207 A2 WO 2024124207A2 US 2023083236 W US2023083236 W US 2023083236W WO 2024124207 A2 WO2024124207 A2 WO 2024124207A2
Authority
WO
WIPO (PCT)
Prior art keywords
cell
nucleic acid
free nucleic
methylation
regions
Prior art date
Application number
PCT/US2023/083236
Other languages
French (fr)
Other versions
WO2024124207A3 (en
Inventor
Maximilian Diehn
Arash Ash Alizadeh
Emily Hamilton
Diego ALMANZA
Original Assignee
The Board Of Trustees Of The Leland Stanford Junior University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Board Of Trustees Of The Leland Stanford Junior University filed Critical The Board Of Trustees Of The Leland Stanford Junior University
Publication of WO2024124207A2 publication Critical patent/WO2024124207A2/en
Publication of WO2024124207A3 publication Critical patent/WO2024124207A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers

Definitions

  • the disclosure provides description for assessment of methylated cell-free nucleic acids for the purpose of detecting a condition.
  • Lung cancer screening remains an unmet clinical need.
  • Image-based screening is the most common current screening method but analysis of circulating tumor DNA (ctDNA) represents a promising alternative and complement.
  • Previous studies leveraged features of genetic alterations (e.g. single nucleotide variants and somatic copy number variations) found in cell-free DNA (cfDNA) to predict the Lung Cancer Likelihood in Plasma (Lung-CLiP) of a given sample (J. J. Chabon, et al., Nature. 2020 Apr;580(7802):245-251 , the disclosure of which is incorporated herein by reference).
  • cfDNA also reflects the epigenome of the cells from which it originates. This means that tumor-derived cfDNA molecules (ctDNA) contain cancer-associated epigenetic signals that might be additionally leveraged for detection of malignancies.
  • DNA methylation represents a promising tumor biomarker.
  • CpGs CG dinucleotides
  • DNA methylation is found at millions of loci across the genome and is known to contribute to the regulation of chromatin conformation and gene expression.
  • methylation patterns vary greatly across cell types; methylomes are cell-type specific.
  • tissue-specific methylation signatures have been used to deconvolute bulk DNA methylation data, an exercise of particular relevance to cell-free DNA, which has been shown to comprise DNA from blood cells, liver, colon, and to a smaller extent other tissues.
  • a method is for sequencing for identification of condition-related differentially methylated regions in cell-free nucleic acids.
  • the method comprises obtaining a cell-free nucleic acid sample comprising cell-free nucleic acid molecules.
  • the method comprises extracting a subset of the cell- free nucleic acid molecules from the cell-free nucleic acid sample using a panel of nucleic acid probes designed to hybridize to regions that are known to be differentially methylated in a condition.
  • the method comprises converting nucleobases of the subset of the cell-free nucleic acid molecules.
  • the conversion of a nucleobase is indicative of a methylated state of that nucleobase.
  • the method comprises sequencing the subset of the cell-free nucleic acid molecules via high-throughput sequencing.
  • the condition is cancer
  • the cancer is non-small cell lung cancer.
  • the regions that are known to be differentially methylated in a condition comprise at least 5% of the regions in Table 2.
  • the regions that are known to be differentially methylated in a condition comprise at least 50% of the regions in Table 2.
  • the panel of nucleic acid probes excludes regions known to be associated with false discovery.
  • the panel of nucleic acid probes excludes regions known to be differentially methylated in blood cells.
  • the method comprises extracting a subset of the cell-free nucleic acid molecules from the cell-free nucleic acid sample using a panel of nucleic acid probes designed to hybridize to regions known to be correlated with factors associated with the condition.
  • the condition is cancer and the regions known to be correlated with factors associated with the condition comprise regions in Table 1.
  • the method comprises extracting a subset of the cell- free nucleic acid molecules from the cell-free nucleic acid sample using a panel of nucleic acid probes designed to hybridize to regions known to be invariably hypermethylated or invariably hypomethylated.
  • the regions known to be invariably hypermethylated or invariably hypomethylated comprise regions in Table 1.
  • converting nucleobases of the subset of the cell-free nucleic acid molecules comprises at least one of the following: bisulfite treatment, TET2 oxidation and APOBEC3A conversion, or TET2 oxidation and pyridine borane treatment.
  • converting nucleobases of the subset of the cell-free nucleic acid molecules comprises TET2 oxidation and APOBEC3A conversion.
  • cell-free nucleic acid sample is derived from a collection of: blood, plasma, saliva, urine, stool, mucus, lymph, or another bodily fluid.
  • cell-free nucleic acid sample comprises at least 100,000 nucleic acid molecules.
  • the cell-free nucleic acid of the cell-free nucleic acid sample is cell-free DNA.
  • the cell-free nucleic acid of the cell-free nucleic acid sample comprises at least 1 ng of cell-free DNA.
  • the cell-free nucleic acid of the cell-free nucleic acid sample comprises at least 15 ng of cell-free DNA.
  • the method comprises attaching adapters to the comprising cell-free nucleic acid molecules.
  • the adapters are resistant to nucleobase conversion as performed in the step converting nucleobases of the subset of the cell-free nucleic acid molecules.
  • the panel of nucleic acid probes comprises at least 50 unique probes.
  • a method is for sequencing that enhances detection of differentially methylated regions for assessing a condition of an individual.
  • the method comprises preparing a cell-free nucleic acid sample for targeted methyl sequencing.
  • the prepared cell-free nucleic acid sample is collected from an individual and comprises at least 100,000 cell-free nucleic acid molecules that are derived from a plurality regions that are known to be differentially methylated in a condition.
  • the method comprises sequencing the cell-free nucleic acid sample via a high-throughput sequencer to yield a sequencing result of the cell-free nucleic acid molecules that are derived from a plurality of regions that are known to be differentially methylated in a condition.
  • the method comprises computing, using a computational device and the sequencing result, a methylation metric, wherein the methylation metric is computed for one region of the plurality of regions that are known to be differentially methylated in a condition.
  • the methylation metric indicates an amount of methylation of cell-free nucleic acid molecules that align to the region.
  • the method comprises entering, using the computational device, the computed methylation metric as a feature into a computational model to yield an assessment of the cell-free nucleic acid sample.
  • the assessment indicates the individual has the condition.
  • the method comprises aligning each cell-free nucleic molecule sequencing result across a region.
  • the region is one of the plurality of regions that are differentially methylated in a condition.
  • the method comprises for a set of cell-free nucleic acid molecules that align across the region, determining an amount of methylation for each cell-free nucleic acid molecule of the set.
  • the methylation metric is based on at least one cell-free molecule the set.
  • the method comprises determining a number of cell- free nucleic acid molecules within the set that are methylated more than a threshold.
  • the method comprises computing a methylated molecule fraction (MMF) for the region.
  • MMF methylated molecule fraction
  • #molecules > methylation threshold is the number of cell-free nucleic acid molecules within the set that are determined to be methylated more than a threshold.
  • total molecules assessed is number of total number of cell-free nucleic acid molecules within the set.
  • the threshold is 60% of CpGs methylated.
  • computing methylation metric for a region further comprises identifying within the set of cell-free nucleic acid molecules that align across the region, the cell-free nucleic acid molecule that is most methylated.
  • the methylation metric is computed is an amount of methylation of the cell-free nucleic acid molecule that is most methylated.
  • each cell-free nucleic acid molecule of the set of cell- free nucleic acid molecules that align across the region has a number of CpGs greater than a threshold.
  • the method comprises computing, using the computational device and the sequencing result, a methylation metric for each region of at least fifty percent of the plurality of regions that are known to be differentially methylated in a condition.
  • the method comprises entering, using the computational device, each computed methylation metric as features into the computational model to yield an assessment of the cell-free nucleic acid sample.
  • the method comprises computing, using the computational device and the sequencing result, a methylation metric for each region of the plurality of regions that are known to be differentially methylated in a condition.
  • the method comprises entering, using the computational device, each computed methylation metric as features into the computational model to yield an assessment of the cell-free nucleic acid sample.
  • the method comprises computing, using the computational device and the sequencing result, a plurality of methylation metrics, wherein each methylation metric is computed for one region of the plurality of regions that are known to be differentially methylated in a condition.
  • the method comprises computing, using the computational device and the plurality of methylation metrics, a sample summary statistic that combines the plurality of methylation metrics.
  • the method comprises entering, using the computational device, the sample summary statistic as a feature into a computational model to yield the assessment of the cell-free nucleic acid sample.
  • the sample summary statistic is a percentile of a number of regions with nonzero MMFs.
  • the sample summary statistic is a percentile of a number of regions where MMF is zero.
  • the sample summary statistic is a percentile of a number of regions where MMF is greater than threshold.
  • the sample summary statistic is a percentile of the amount of methylation of the cell-free nucleic acid molecule that is most methylated.
  • the sample summary statistic is a median of the amount of methylation of the cell-free nucleic acid molecule that is most methylated.
  • the sample summary statistic is a skewness of the amount of methylation of the cell-free nucleic acid molecule that is most methylated.
  • the plurality of regions that are known to be differentially methylated in a condition comprises at least 10 genomic regions associated with a condition.
  • the plurality of regions that are known to be differentially methylated in a condition comprises at least 50 genomic regions associated with a condition.
  • the condition is a cancer.
  • Figure 1 provides an example of a method to perform sequencing of methylated cell-free nucleic acid samples.
  • Figure 2 provides an example of a method to classify a cell-free nucleic acid sample.
  • Figure 3 provides an example of a computational processing system for classification of cell-free nucleic acid samples.
  • Figures 4A to 4G provide differential methylation analysis to select CpGs for targeted NSCLC panel design.
  • Figure 4A provides a volcano plot of average methylation difference (Ap) vs. Benjamini-Hochberg adjusted limma P-value for the differential methylation analysis between lung adenocarcinoma (LLIAD) and normal lung. Each point is a CpG in the Illumina 450k array; the 412 points highlighted in green were selected as differentially methylated CpGs (DMCs) for LUAD.
  • Figure 4B provides a heatmap showing methylation states of DMCs selected for LUAD in TCGA-LUAD, blood, and normal lung.
  • Figure 4C provides a volcano plot of average methylation difference (Ap) vs. Benjamini- Hochberg adjusted limma P-value for the differential methylation analysis between lung squamous cell carcinoma (LUSC) and normal lung. Each point is a CpG in the Illumina
  • FIG. 4D provides a heatmap showing methylation states of DMCs selected for LUSC in TCGA-LUSC, blood, and normal lung. Columns are samples, rows are individual CpGs, and heat is methylation level (0 value).
  • Figure 4E provides data showing the relationship between LUAD-Normal Lung Ap and LUSC-Normal Lung Ap for all selected DMCs. Points are colored by whether they were picked as DMCs for LUAD, LUSC, or both (‘LUAD; LUSC’).
  • Figure 4F provides a bar plot of the CpG density annotations for the 651 selected DMCs.
  • Figure 4G provides a bar plot of gene context annotations for the 651 selected DMCs. The total exceeds 651 due to several CpGs having annotations relative to multiple genes.
  • Figures 5A to 5D provide validation data of selector DMRs and selection of additional markers.
  • Figure 5A provides a heatmap of 651 selected NSCLC DMRs in NSCLC cell lines, NSCLC primary tumors from TCGA (LUAD & LUSC), normal lung and blood.
  • Figure 5B provides a heatmap of 651 selected NSCLC DMRs in additional TCGA cancer types. BLCA, bladder; BRCA, breast; COADREAD, colorectal; GBM, glioblastoma; PAAD, pancreas; PRAD, prostate.
  • Figure 5C provides a heatmap of additional CpGs selected to distinguish NSCLC from other cancer types.
  • Figure 5D provides a heatmap showing hypermethylated and hypomethylated control regions. For all heatmaps within Figs. 5A to 5D, columns are samples, CpGs are rows, and heat is methylation level (0 value).
  • Figures 6A and 6B provide data showing methylation levels in NCI-H441 cell line admixtures.
  • Figure 6A provides a heatmap of NSCLC-DMRs (rows) in cell line admixture samples (columns) of NCI-H441 cell line DNA mixed into healthy control cfDNA. Libraries were prepared using a preliminary mCAPP-Seq protocol with bisulfite conversion and captured with the targeted panel. Each spike level was prepared in triplicate.
  • Figure 6B provides a violin plot of average methylation levels in hyper-DMRs across all three replicates for each cell line spike level. ****, P ⁇ 0.0001 , Student’s T-test.
  • Figures 7A to 7E provide data on DMRS above background as a measure of ctDNA content.
  • Figure 7B provides a heatmap of methylated AFs in NSCLC hyper-DMRs in the same samples as in Fig.
  • Figure 7A depicts DMRs above background in NCI-H441 cell line admixture samples.
  • Figure 7D depicts DMRs above background in 12 control and 12 Stage IV patient cfDNA libraries prepared with bisulfite sequencing.
  • Figure 7E provides a linear fit for estimating the LOD of mCAPP-Seq using cell line spike samples.
  • Horizontal line denotes mean + 3 standard deviations of DMRs above background above undetected samples (0% and 0.005%).
  • Vertical line denotes spike AF (%) at which mean + 3 standard deviations above undetected samples would be expected based on linear fit (0.013%). All P-values calculated with Student’s T-test: **, P ⁇ 0.01 ; **** P ⁇ 0.0001 ; ns, not significant.
  • Figures 8A to 8C provide data showing comparison of conversion methods for detection of methylated cytosines.
  • Figure 8A provides mapping rates for targeted sequencing libraries prepared with bisulfite, EM-seq, TAPS, or no conversion captured with a 44kb panel. Bisulfite and EM-seq samples were mapped with Bismark and TAPS and unconverted samples were mapped with bwa aln.
  • Figure 8B provides median deduplicated depths after barcode deduping for each conversion method. *, P ⁇ 0.05, Student’s T-test.
  • Figure 80 provides rate of conversion of unmethylated cytosines to thymine in lambda control DNA spiked into EM-Seq and bisulfite samples. ****, P ⁇ 0.0001 , Student’s T-test.
  • Figures 9A to 9D provide a technical assessment of mCAPP-Seq with EM-seq conversion.
  • Figure 9A provides median deduplicated depths for EM-Seq libraries prepared in duplicate for 5 different inputs of cfDNA.
  • Figure 9B provides median deduplicated depths for cell line spike admixture samples prepared with 3 cfDNA inputs and 4 tumor fractions. P-values calculated with Student’s T-test. ****, P ⁇ 0.0001.
  • Figure 9C depicts DMRs above background for each input and tumor fraction prepared in triplicate. P-values calculated for each spike level compared to 0% with Student’s T-test. *, P ⁇ 0.05; **, P ⁇ 0.01 ; ns, not significant.
  • Figure 9D depicts DMRs above background for each input and tumor fraction in a repeated cell line spike experiment prepared in triplicate. P-values calculated for each spike level compared to 0% with Student’s T-test. *, P ⁇ 0.05; **, P ⁇ 0.01 ; ***, P ⁇ 0.001 ; ns, not significant.
  • Figures 10A and 10B provide data showing an application of EM-seq mCAPP- Seq to advanced stage NSCLC patients.
  • Figure 10A provides boxplots of hyper-DMR methylated allele fractions for 12 control cfDNA samples and 12 stage IV cfDNA samples. Each boxplot represents one sample. T-test computed on all control DMRs vs. all patient DMRs. ****, P ⁇ 0.001 , Student’s T-test.
  • Figure 10B depicts DMRs above background for 12 control cfDNA samples and 12 stage IV cfDNA samples sequenced with EM-Seq. **** P ⁇ 0.001 , Student’s T-test.
  • Figures 11 A to 11 F provide cfDNA extraction metrics and availability for early- stage training and validation cohorts.
  • Figure 11A provides data on volume plasma available for each cfDNA sample extracted from risk-matched controls and early-stage NSCLC patients in the mCAPP-Seq training and validation cohorts.
  • Figure 11 B provides total cfDNA yields extracted from each sample.
  • Figure 11 C provides data depicting percent of cfDNA in the 50-450bp size range as measured by Agilent fragment analyzer.
  • Figure 11 D provides data depicting total cfDNA extracted in the 50-450bp size range per sample.
  • Figure 11 E provides data on plasma cfDNA concentration (ng/mL) considering only cfDNA in the 50-450bp size range.
  • Figure 11 F provides data depicting percent of samples in each cohort with sufficient cfDNA in the 50-450bp size range (> 40ng) for both mCAPP-seq and regular CAPP-Seq.
  • Figures 12A to 12E provide resultant data from applying mCAPP-Seq to early- stage NSCLC patients and risk-matched controls.
  • Figure 12A provides non-deduplicated total sequencing read pairs per sample in the early-stage training cohort NSCLC patients and controls, ns, not significant by Wilcoxon rank-sum test.
  • Figure 12B provides data on median selector-wide unique depth after removing PCR duplicates using molecular barcodes for all early-stage training cohort NSCLC patients and controls, ns, not significant by Wilcoxon rank-sum test.
  • Figure 12C provides data depicting relationship between reads on the sequencer and median unique depth in all samples in the training cohort. R and p-value calculated using Spearman correlation.
  • Figure 12D provides boxplots for each NSCLC patient and control depicting the distribution of hyper-DMR methylated allele fractions (AFs). Wilcoxon rank-sum test performed on all patients vs. all controls. **** P ⁇ 0.0001 .
  • Figure 12E depicts DMRs above background for all early-stage NSCLC patients and controls in the training cohort. ***, P ⁇ 0.001 , Wilcoxon rank-sum test.
  • Figures 13A to 13D provide data on optimization and feature identification for a statistical model for classification of early-stage patient plasma from cfDNA methylation.
  • Figure 13A provides a heatmap displaying leave-one-out (LOO) detection sensitivity in early-stage cancer patient plasma samples in the training cohort at different minimum thresholds for minimum number of CpGs and minimum percent methylated CpGs required to consider a molecule ‘highly methylated.’
  • Figure 13B provides a heatmap displaying LOO specificity in risk-matched non-cancer controls for the analysis shown in Fig. 13A.
  • Figure 13C provides data on distribution of highly methylated molecule fractions across all hyper-DMRs for every cancer patient and control in the early-stage training cohort.
  • Figure 13C provides data on distribution of number of methylated CpGs per subject in the set of most highly methylated fragments in the hyper-DMRs. These distributions were summarized to calculate the fragment methylation index (FMI).
  • Figures 14A to 14G provide data depicting fragment methylation index is a discriminatory feature for NSCLC detection.
  • Figure 14A provides data showing association between fragment methylation index and stage.
  • Figure 14B provides boxplots showing median fragment lengths per patient or control across all fragments in the hyper- DMR regions.
  • Figure 14C provides boxplots showing fraction of all fragments in the hyper- DMR regions with lengths greater than 300bp for each patient or control.
  • Figure 14F provides boxplots showing median fragment lengths per patient or control across fragments with the highest CpG count, regardless of methylation state, per hyper-DMR region for every patient and control.
  • Figure 14G provides boxplots showing median number of CpGs across fragments with the highest CpG count, regardless of methylation state, per hyper-DMR region for every patient and control.
  • Figures 15A to 15F provide data on statistical models that show biological plausibility for detecting early-stage lung cancer.
  • Figure 15A provides data on detection sensitivity by stage for a LASSO logistic regression classifier trained in a leave-one-out framework using DMR AFs as features.
  • Figure 15B provides a number of non-zero coefficients for each leave-one-out model shown in Fig. 15A.
  • Figure 15C provides results of fraction of leave-one-out models in which a feature had a non-zero coefficient for the most recurrently selected DMR features.
  • Figure 15D provides data on detection sensitivity by stage for a Ridge logistic regression classifier trained in a leave-one-out framework using quantiles of the DMR AF distribution as features.
  • Figure 15E provides data on detection sensitivity by stage for a LASSO logistic regression classifier trained in a leave- one-out framework using DMR AF features and fragment-level CpG statistics.
  • Figure 15F provides results of fraction of leave-one-out models recurrently selected features from the LOO analysis in Fig. 15E had non-zero coefficients.
  • Figures 16A to 16E provide data on a detection model robustly detects early- stage NSCLC.
  • Figure 16A provides data on detection sensitivity at 95% specificity by stage for a LASSO logistic regression classifier trained in a leave-one-out framework using DMR AF quantiles and fragment-level CpG statistics as features.
  • Figure 16B provides data depicting recurrently selected features from the LOO analysis in Fig. 15F.
  • Figure 16C provides data on model detection sensitivity by disease histology.
  • Figure 16D provides data on Model detection sensitivity by stage for adenocarcinoma only.
  • Figure 16E provides data on model detection sensitivity at 80% specificity.
  • the systems and methods provide a means for detecting cancer-derived cell-free nucleic acids (cfNA) within an individual.
  • cfNA cancer-derived cell-free nucleic acids
  • Cancer-derived methylation signals have been shown to be detectable in cell- free DNA (cfDNA).
  • cfDNA cell- free DNA
  • NSCLC non-small cell lung cancer
  • M. Esteller, et al., Cancer Res. 1999 Jan 1 ;59(1 ):67-70 the disclosure of which is incorporated by reference
  • high-throughput sequencing of broader swathes of the cell-free methylome has demonstrated success for cancer detection and localization (M. C. Liu, et al., DNA. Ann Oncol. 2020 Jun;31 (6):745-759; and S. Y.
  • This disclosure describes a lung cancer- focused cfDNA methylation assay that integrates both epigenetic and genetic signals, improving noninvasive detection of small tumors.
  • the systems and methods of the disclosure provide a means assessing methylation of a cfNA sample.
  • the systems and methods can generate and sequence libraries using sourced from a cfNA sample.
  • the systems and methods can be utilized in any sequencing technique for identification methylation patterns that are indicative of a condition. It has now appreciated that aberrant methylation patterns are biomarkers of various conditions, especially cancer.
  • the various embodiments of the disclosure provide a means to take advantage of methylation patterns of various conditions to detect such conditions in a cfNA sample.
  • Fig. 1 Provided in Fig. 1 is an example of a method to perform sequencing of cell-free nucleic acid samples to detect differentially methylated regions.
  • a cfNA sample is obtained and processed for targeted methyl sequencing.
  • the method can be useful for sequencing a cfNA sample collected from an individual for the purpose of detecting a condition in which the condition is marked by abberant methylation patterns.
  • the method can be utilized to sequence a cfNA sample to detect aberrant methylation patterns associated with a cancer.
  • the method can be used for an early detection screen (e.g.
  • Sequencing results can be utilized in downstream applications, such as perfuming computational analysis on the sequencing results in order to classify a sample for detecting a condition.
  • Method 100 can begin by obtaining (101 ) a cfNA sample.
  • the cfNA sample can comprise cell-free RNA (cfRNA) and/or cell-free DNA (cfDNA).
  • cfNA cell-free RNA
  • cfDNA cell-free DNA
  • aberrant methylation patterns of cfDNA can be useful biomarkers of a condition such as cancer.
  • cfNA can be collected from any extracellular source, such as (for example) blood, plasma, saliva, urine, stool, mucus, lymph and/or other bodily fluids.
  • a cfNA sample is described as a liquid biopsy and biological sample, but any description of a sample that comprises cfNA molecules is applicable.
  • cfNAs can be isolated and purified by any appropriate means.
  • a human plasma sample typically contains 0.5 to 10 ng per mL of cfDNA, corresponding to 150 to 3,000 copies of the haploid human genome.
  • Some conditions such as (for example) cancer and donor transplant rejection, will result in higher levels of circulating cfDNA, with levels of greater than 1000 ng per mL having been detected.
  • cfDNA typically circulates in fragments ranging between 120 to 220 bp, with a maximum peak at about 167 bp.
  • plasma can typically have from about 15 million to 400 million cfDNA copies per mL, and greater than 40 billion cfDNA copies per mL in some conditions.
  • Other biological samples will vary in the amount of cfDNA, but generally have concentrations in the millions of copies per mL of sample and is greater when affected by a condition marked by high necrosis and/or DNA leakage such as cancer.
  • the extracted and isolated cfDNA fragments can be utilized as originating nucleic acid molecules.
  • a collection of originating nucleic acid molecules for a sequencing reaction can have greater than 10,000 nucleic acid molecules, greater than 100,000 nucleic acid molecules, greater than 1 ,000,000 nucleic acid molecules, greater than 10,000,000 nucleic acid molecules, greater than 100,000,000 nucleic acid molecules, or greater than 1 ,000,000,000 nucleic acid molecules.
  • a cfNA sample is obtained prior to any indication of cancer.
  • a cfNA sample is obtained to provide an early screen in order to detect a cancer prior to a diagnosis of cancer.
  • a cfNA sample is obtained to detect if residual cancer exists after a treatment.
  • a cfNA sample is obtained during treatment to determine whether the treatment is providing the desired response. Screening of any particular cancer can be performed. In some embodiments, screening is performed to detect a cancer that develops aberrant methylation patterns in stereotypical regions in the genome, such as (for example) lung cancer. In some embodiments, screening is performed to detect a cancer in which regions of aberrant methylation were discovered utilizing a prior extracted cancer biopsy, which may be useful for monitoring treatment or detecting minimal residual disease.
  • a cfNA sample is obtained from an individual with a determined risk of developing cancer, such as those with a familial history of the disorder or have determined risk factors (e.g., exposure to carcinogens).
  • a cfNA sample is obtained from any individual within the general population.
  • a cfNA sample is obtained from individuals within a particular age group with higher risk of cancer, such as, for example, aging individuals above the age of 50.
  • a cfNA sample is obtained from an individual diagnosed with and treated for a cancer.
  • Method 100 can further generate (103) a sequencing library targeting differentially methylated regions
  • targeted sequencing can be performed by capturing and/or specifically amplifying particular regions of a genome.
  • adapters and/or primers are attached onto cell-free nucleic acids to facilitate sequencing.
  • any appropriate amount of input cfNA can be utilized in library preparation.
  • the limit of detection (LOD) can be affected by the amount of input cfNA.
  • LOD limit of detection
  • the amount of input cfDNA for library prepare is at least
  • targeted sequencing of particular genomic loci is to be performed, and thus particular sequences corresponding to the particular loci are captured via hybridization prior to sequencing (e.g., capture sequencing).
  • capture sequencing is performed utilizing a panel of probes that pull down (or capture) regions that have been discovered to be differentially methylated for a particular a cancer (e.g., lung cancer).
  • capture sequencing is performed utilizing a panel of probes that pull down (or capture) regions that have been discovered to be differentially methylated as determined prior by methyl sequencing a biopsy of the cancer.
  • a panel of probes comprises at least 10 unique probes, at least 20 unique probes, at least 50 unique probes, at least 100 unique probes, at least 150 unique probes, at least 200 unique probes, at least 250 unique probes, at least 500 unique probes, or at least 1000 unique probes.
  • Table 2 Provided in Table 2 is a set of genomic loci for detecting regions aberrantly methylated in non-small cell lung cancer (NSCLC), and particular for lung adenocarcinoma (LUAD) and LUSC). All or some of these regions can be utilized for assessing NSCLC. Further, these regions may be utilized for differentiating between LUAD and LUSC, which may be useful for determining treatment options.
  • NSCLC non-small cell lung cancer
  • LUAD lung adenocarcinoma
  • LUSC lung adenocarcinoma
  • a panel of capture nucleic acid probes can be designed to hybridize to at least 5%, at least about 10%, at least 20%, at least 30%, at least about 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, at least 99%, or 100% of the genomic regions listed in Table 2.
  • the sequence of a probe is at least 50% complementary, at least 60% complementary, at least 70% complementary, at least 80% complementary, at least 90% complementary, at least 95% complementary, or at least 99% complementary to a sequence within a genomic region listed in Table 2.
  • a standard genomic reference, such as hg19 can be utilized to retrieve sequences of genomic regions listed in Table 2.
  • methyl sequencing of the cancer can be performed to identify regions that are differentially methylated in association with the cancer tissue.
  • Nucleic acid probes can be designed to hybridize to these identified regions such that methyl sequencing cfNA can better detect the presence of cancer-derived cfNAs in a biological sample of that individual. This personalized method of using probes designed to hybridize identified regions that differentially methylated can improve the ability to detect the presence of cancer when performing assessments of therapeutic progress and/or detection of minimal residual disease.
  • a panel of capture nucleic acid probes can be designed to hybridize to at least 10 genomic regions, at least 20 genomic regions, at least 50 genomic regions, at least 100 genomic regions, at least 1500 genomic regions, at least 200 genomic regions, at least 250 genomic regions, at least 500 genomic regions, at least 750 genomic regions, or at least 1000 genomic regions are assessed.
  • the sequence of a probe is at least 50% complementary, at least 60% complementary, at least 70% complementary, at least 80% complementary, at least 90% complementary, at least 95% complementary, or at least 99% complementary to a sequence that has been identified to be differentially expressed in an individual’s cancer.
  • capture sequencing is performed utilizing a panel of probes that pull down (or capture) regions related to other useful information, which may be useful for performing a diagnostic.
  • certain methylated regions can be useful to provide indication of factors associated with cancer, such as regions in which methylation patterns are correlated with age, smoking history, and body mass index (BMI).
  • BMI body mass index
  • Table 1 Provided in Table 1 are many regions that are associated with various factors, including age, BMI, cell type (cibersortX-sites), tissue origin (miniselector), multi-cancer, pan-cancer, smoking history, and BMI.
  • a panel of capture nucleic acid probes can be designed to hybridize to at least about 5%, at least about 10%, at least about 20%, at least about 30%, at least about 40%, at least about 50%, at least about 60%, at least about 70%, at least about 80%, at least about 90%, at least about 95%, at least about 99%, or about 100% of the genomic regions identified in Table 1 .
  • the sequence of a probe is at least 50% complementary, at least 60% complementary, at least 70% complementary, at least 80% complementary, at least 90% complementary, at least 95% complementary, or at least 99% complementary to a sequence within a genomic region identified in Table 1.
  • a standard genomic reference, such as hg19, can be utilized to retrieve sequences of genomic regions listed in Table 1 .
  • capture sequencing is performed utilizing a panel of probes that pull down (or capture) regions utilized as controls, such as regions that invariably hypermethylated and/or invariably hypom ethylated. In some embodiments, capture sequencing is performed utilizing a panel of probes that pull down (or capture) regions related to other forms of molecular information useful for diagnostics, such as single nucleotide variants, insertions, deletions, and copy number variations.
  • regions that are differentially methylated in a certain condition may also be differentially methylated for other reasons unrelated to the condition. For instance, regions may be differentially methylated between healthy individuals, upon an environmental stimulus, in association with different cell types, etc. Differential methylation in blood cells is a special concern as these cells are a high source of cfNAs. Detection of differential methylation in these regions may yield false discovery. Accordingly, in some embodiments, a panel of probes excludes regions known to be associated with false discovery. And in some embodiments, a panel of probes excludes regions known to be differentially methylated in blood cells.
  • adapters utilized for sequencing are resistant conversion of nucleobases.
  • some adapters include methylated cytosine, which resists conversion via bisulfite treatment.
  • Method 100 can further convert (105) nucleobases, which can differentiate nucleobases that are methylated from nucleobases that are unmethylated. Any method for converting nucleobases can be utilized. In some embodiments, a chemical conversion is performed. In some embodiments, an enzymatic conversion is performed. Various methodologies can be utilized to convert methylated nucleobases, including (but not limited to) bisulfite treatment, TET2 oxidation and APOBEC3A conversion, and TET2 oxidation and pyradine borane treatment.
  • TET2 oxidation and pyradine borane treatment and TET2 oxidation and APOBEC3A conversion methods provided higher mappabililty of reads than bisulfite treatment. It was further found that TET2 oxidation and APOBEC3A conversion provided the better unique molecule recovery. Accordingly, in some preferred embodiments, TET2 oxidation and APOBEC3A is utilized to convert methylated nucleobases. For more details on nucleobase conversion inclusive of methods, data and results, see the Examples section herein.
  • nucleobases including (but not limited to) 5-methylcytosine (5-mC), 5-hydroxymethylcytosine (5-hmC), 5-formylcytosine (5-fC), and 5-carboxylcytosine (5-caC), as dependent on the biomarkers of the condition. Sequencing and detection methods of the various modifications can be utilized, as has been reported, see, e.g., R. P. Darst, Curr Protoc Mol Biol. 2010 Jul;Chapter 7: Unit 7.9.1 -17; C. X. Song, et al., Nat Biotechnol.
  • sequencing platforms can detect methylation without conversion, such as nanopore sequencing techniques. When a sequencing technique can detect methylation directly, the nucleobase conversion step can be excluded. Examples of sequencing platforms that can detect methylation include (but are not limited to) Oxford Nanopore Technologies PromethlON, MinlON, and GridlON sequencing platforms (Oxford, UK) and Pacific Bioscience’s Single Molecule, Real-Time (SMRT) sequencing platform (Menlo Park, CA).
  • SMRT Real-Time
  • Method 100 further sequences (107) the generated and converted library to detect methylation status of differentially methylated regions.
  • Any appropriate high- throughput sequencing technique can be utilized that can detect converted nucleobases.
  • High-throughput sequencing techniques include (but are not limited to) 454 sequencing, Illumina sequencing, SOLiD sequencing, Ion Torrent sequencing, single-read sequencing, paired-end sequencing, etc.
  • a high-throughput sequencing method can simultaneously sequence at least about 10,000, at least about 100,000, at least about 1 million, at least about 10 million, at least about 100 million, or at least about 1 billion cfNA molecules.
  • Several embodiments are directed towards utilizing a computational model to detect the presence of a condition utilizing methyl sequencing data of a cfNA sample.
  • Interpretation of methyl sequencing data results is difficult in cases in which the differentiation of methylation in regions is not readily appreciated.
  • detection of Stage I lung cancer via cfDNA is a difficult task due to the low amount cfDNA markers present in a liquid biopsy.
  • featurizing methyl sequencing results and utilizing these features within a computational classifier improved the ability to detect stage 1 cancers in plasma samples.
  • a clinical intervention can be performed on the individual.
  • Method 200 can begin by obtaining (201 ) targeted methyl sequencing result of a cfNA sample.
  • the sequencing result is obtained via high-throughput such that methylation of a high number of cfNA molecules is determined.
  • the sequencing can be targeted particular regions of a genome, especially regions that are known to be differentially methylated in a region.
  • lung cancer is be assessed and at least some of the genomic regions identified in Table 2 are targeted for methyl sequencing.
  • a cancer is sequenced to identify genomic regions that differentially methylated such that subsequent cfNA sequencing is targeted to those regions.
  • sequencing is performed as described in reference to Fig. 1 .
  • the targeted sequence result covers at least 10 genomic regions, at least 20 genomic regions, at least 50 genomic regions, at least 100 genomic regions, at least 1500 genomic regions, at least 200 genomic regions, at least 250 genomic regions, at least 500 genomic regions, at least 750 genomic regions, or at least 1000 genomic regions.
  • the genomic regions can comprise regions associated with a condition (e.g., cancer), regions correlated with factors associated with a condition, and/or control regions.
  • Method 200 further can assess (203) methylation of cfNA molecules.
  • Assessment of methylation can be done in a variety of ways, but generally assessment can comprise an amount of cfNA molecules that have at least one methylated nucleobase and/or an amount of methylated nucleobases per cfNA molecule.
  • at least 100 cfNA molecules of a sample are assessed, at least 1000 cfNA molecules of a sample are assessed, at least 10,000 cfNA molecules of a sample are assessed, at least 100,00 cfNA molecules of a sample are assessed, at least 1 ,000,000 molecules of a sample are assessed, or at least 10,000,000 molecules of a sample are assessed.
  • methylation is assessed by computing a methylation metric that indicates an amount methylation of cfNA molecules at a particular locus, considering the cfNA molecules that align to the locus.
  • a set of cfNA molecules that align to the region are utilized.
  • the region comprises a CpG island.
  • each cfNA molecule to be used in the assessment comprises at least 2 CpGs, at least 4 CpGs, at least 6 CpGs, at least 8 CpGs, at least 10 CpGs, at least 15 CpGs, or at least 20 CpGs in the region.
  • a methylated molecule fraction MMF
  • the MMF is the number of cfNA molecules having an amount of methylation in that region over a threshold per the total cfNAs assessed for that region:
  • the methylation threshold is at least 20% CpGs methylated, at least 30% CpGs methylated, at least 40% CpGs methylated, at least 50% CpGs methylated, at least 60% CpGs methylated, at least 70% CpGs methylated, at least 90% CpGs methylated, or at least 70% CpGs methylated.
  • the MMF can be computed for any or all regions sequenced in association with a particular trait (e.g., any or all regions differentially methylated in cancer).
  • a particular trait e.g., any or all regions differentially methylated in cancer.
  • an MMF is computed for at least 50% regions, at least 60% regions, at least 70% regions, at least 80% regions, at least 90% regions, or all regions sequenced in association with a particular trait.
  • Method 200 optionally computes (205) a sample summary statistic. Having determined a methylation metric for a number of regions, an overall sample summary statistic can be generated by combining methylation metrics for a plurality regions. In some embodiments, a sample summary statistic combines methylation metrics for all regions associated with a trait that were sequenced. In some embodiments, a sample summary statistic combines methylation metrics for a subset regions associated with a trait that were sequenced. For example, a number of the top informative regions can be combined to yield a sample summary statistic. A top informative region is a region that has been determined have a greater association with or more predictive ability of a condition (e.g., cancer) when compared to the other regions assessed.
  • a condition e.g., cancer
  • a sample summary statistic can be determined by combining a methylation metric for a number of regions that have the most association with or most predictive ability of a condition.
  • at least 10% of regions assessed are combined at least 20% of regions assessed are combined, at least 30% of regions assessed are combined, at least 40% of regions assessed are combined, at least 50% of regions assessed are combined, at least 60% of regions assessed are combined, at least 70% of regions assessed are combined, at least 80% of regions assessed are combined, at least 90% of regions assessed are combined, or all regions assessed are combined to yield a summary sample statistic.
  • Sample summary statistic can be computed via a number of different ways. In some embodiments, any statistic that can combine a methylation metric for a plurality of regions can be utilized. In some embodiments, percentiles of a distribution of MMF regions for a sample is determined, where the percentile is determined by comparing to a cohort of samples. For example, the percentile of the number of regions with nonzero MMFs is determined, as compared to a cohort. In another example, the percentile of the number of regions where MMF is zero, as compared to a cohort.
  • the cfNA molecule with the most methylated CpGs is further considered; the median methylated CpG amount is determined; an X percentile methylated CpG amount is determined, where X is any percentage between 1 % and 100%; the skewness of methylated CpG amount is determined.
  • a sample summary statistic is normalized based on length of cfNA molecule. As would be understood from these examples, many other sample statistics can be determined.
  • Method 200 can further enter (207) one or more methylation features into a trained computational model to assess a sample, where the result indicates whether the sample is associated with a condition (e.g., cancer).
  • Methylation features comprise computed assessments of methylation of cfNAs.
  • a feature can be based on methylation of a particular region associated with the condition to be assessed.
  • a feature can be based on a summary sample statistic of a plurality of regions associated with the condition to be assessed.
  • a methylation assessment of a particular region is utilized as a feature.
  • a computed methylation metric for a particular region can be utilized as a feature.
  • each feature of a plurality of features are based on a particular region, where the features utilized within the model are based on the predictive ability of the feature or the association of methylation of its region with the condition.
  • a sample summary statistic that combines a plurality of methylation assessments, each methylation assessment of a particular region.
  • any sample statistics as computed in step 205 can be used as feature.
  • ML models that can be implemented include (but are not limited to) regression-based and/or classification-based models.
  • regression-based models provide a score that indicates a likelihood of the cancer whereas a classification-based model classifies a sample as likely to include or to not include cancer.
  • Regression-based models include (but are not limited to) LASSO regression, ridge regression, k-nearest neighbors, elastic net, least angle regression (LAR), and random forest regression.
  • Classification-based models include (but are not limited to) support vector machines (SVMs), decision trees, random forests, and naive Bayes.
  • SVMs support vector machines
  • a regression-based model or a classification-based model is regularized, while in various embodiments, a regression-based model or a classification-based model is gradient boosted.
  • Computational models can be trained using a cohort of cfNA samples. For example, to train a classifier for lung cancer, cfNA samples derived from a cohort can be utilized.
  • a leave-one-out cross validation (LOOCV) machinelearning model is used to build and train a model. In each LOOCV round, the model is iteratively trained on all samples except for one sample that left out. Model performance can be evaluated on the left-out sample. LOOCV training is attractive because it reduces overfitting and provides a more accurate assessment of the overall stability.
  • LOOCV leave-one-out cross validation
  • Method 200 can optionally perform (209) a clinical intervention when the ML model indicates that the cfDNA sample contains cfDNA molecules derived from a cancer.
  • Clinical interventions can include further clinical evaluation of or administration of a treatment to an individual.
  • a clinical procedure is performed, such as (for example) a blood test, genetic test, medical imaging, physical exam, a tumor biopsy, or any combination thereof.
  • diagnostics are preformed to determine the particular stage of cancer.
  • a treatment is performed, such as (for example) chemotherapy, radiotherapy, chemoradiotherapy, immunotherapy, hormone therapy, targeted drug therapy, surgery, transplant, transfusion, medical surveillance, or any combination thereof.
  • an individual is assessed and/or treated by medical professional, such as a doctor, physician, physician’s assistant, nurse practitioner, nurse, caretaker, dietician, or similar.
  • non-limiting examples of a treatment can include chemotherapy, radiotherapy, chemoradiotherapy, immunotherapy, adoptive cell therapy (e.g., chimeric antigen receptor (CAR) T cell therapy, CAR NK cell therapy, modified T cell receptor (TCR) T cell therapy, etc.) hormone therapy, targeted drug therapy, surgery, transplant, transfusion, or medical surveillance.
  • adoptive cell therapy e.g., chimeric antigen receptor (CAR) T cell therapy, CAR NK cell therapy, modified T cell receptor (TCR) T cell therapy, etc.
  • a treatment for a condition of subject can comprise administering the subject with one or more therapeutic agents.
  • the one or more therapeutic drugs can be administered to the subject by one or more of the following: orally, intraperitoneally, intravenously, intraarterially, transdermally, intramuscularly, liposomally, via local delivery by catheter or stent, subcutaneously, intraadiposally, and intrathecally.
  • a computational processing system to assess differentially methylated regions in cfNA to detect a condition in accordance with the various methods of the disclosure typically utilizes a processing system including one or more of a CPU, GPU and/or neural processing engine.
  • methyl sequencing results of cfNA are processed and assessed to detect a condition based using a computational processing system.
  • the computational processing system is housed within a computing device associated with a sequencer.
  • the computational processing system is housed separately from the sequencer and receives the sequencing results.
  • the computational processing system is implemented using a software application on a computing device such as (but not limited to) mobile phone, a tablet computer, a wearable device (e.g., watch), and/or portable computer.
  • the computational processing system 300 includes a processor system 302, an I/O interface 304, and a memory system 306.
  • the processor system 302, I/O interface 304, and memory system 306 can be implemented using any of a variety of components appropriate to the requirements of specific applications including (but not limited to) CPUs, GPUs, ISPs, DSPs, wireless modems (e.g., WiFi, Bluetooth modems), serial interfaces, depth sensors, IMUs, pressure sensors, ultrasonic sensors, volatile memory (e.g., DRAM) and/or nonvolatile memory (e.g., SRAM, and/or NAND Flash).
  • volatile memory e.g., DRAM
  • nonvolatile memory e.g., SRAM, and/or NAND Flash
  • the memory system is capable of storing a sequencing data 308, an application for feature generation 310, and a computational model to detect a condition 312.
  • the application can be downloaded and/or stored in non-volatile memory.
  • the application for feature generation and/or computational model to detect a condition is capable of configuring the processing system to implement computational processes including (but not limited to) the computational processes described above and/or combinations and/or modified versions of the computational processes described above.
  • the application for feature generation 310 utilizes the sequence data 308 to generate features based on differentially methylated regions.
  • the computational model to detect a condition utilizes the generated features to determine whether a cfNA sample is derived from an individual with a condition such as cancer. Intermediate data and/or final results can be temporarily stored in the memory system during processing and/or saved for use in downstream applications.
  • computational processes and/or other processes utilized in the provision of assessing differentially methylated regions of cfNA in accordance with various embodiments of the disclosure can be implemented on any of a variety of processing devices including combinations of processing devices. Accordingly, computational devices in accordance with the disclosure should be understood as not limited to specific computational processing systems, but can be implemented using any of the combinations of systems described herein and/or modified versions of the systems described herein to perform the processes, combinations of processes, and/or modified versions of the processes described herein.
  • CAPP-Seq Cancer Personalized Profiling by deep Sequencing
  • the method developed for detecting tumor variants in cfDNA in a disease-specific manner served as a template from which to design a novel cfDNA methylation detection method. While most existing cfDNA methylation methods focus on broad genomic coverage, the current methodology aimed to target a relatively small portion of the genome. This would enable high depth of coverage at relatively low sequencing costs, which would allow incorporation of barcoding and error suppression as in CAPP-Seq. Furthermore, the assay was kept disease-specific with the understanding that initial blood-based detection tests would best serve a high-risk population. To achieve these goals, a targeting sequencing panel was designed that would cover regions of informative methylation status for lung cancer detection.
  • the top 500 most differentially methylated CpGs (DMCs) as ranked by the absolute difference in average beta values (Ap) between LUAD and normal tissue were selected. Filtering this list to probes with low background in other normal tissues including brain, liver, and skin yielded 412 DMCs for LUAD to include in the panel design (Fig. 4B). Of these, 379 were hypermethylated in LUAD compared to blood and normal lung, while 33 were hypomethylated. A similar analysis for lung squamous cell carcinoma (LUSC) selected 370 differentially methylated CpGs (Figs. 4C and 4D).
  • LUSC lung squamous cell carcinoma
  • TP73 was identified as hypomethylated in LUSC as previously observed (A. Daskalos, et al., Cancer Lett. 2011 Jan 1 ;300(1 ):79-86; the disclosure of which is incorporated herein by reference).
  • 450k array methylation data was downloaded from NSCLC cell lines and found that cell lines showed similar methylation states as TCGA primary tumors at our selected sites (Fig. 5A) (K. Walter, et al., Clin Cancer Res. 2012 Apr 15;18(8):2360-73; the disclosure of which is incorporated herein by reference).
  • a subset of major cancer types were selected from TCGA — LUAD, LUSC, bladder (BLCA), breast (BRCA), colorectal (COADREAD), B-cell lymphoma (DLBCL), hepatocellular carcinoma (LIHC), pancreas (PAAD), and prostate (PRAD) — and a differential methylation analysis was performed to identify cancer type-specific methylation signals.
  • TCGA LUAD
  • LUSC bladder
  • BRCA colorectal
  • COADREAD B-cell lymphoma
  • LIHC hepatocellular carcinoma
  • PAAD pancreas
  • PRAD prostate
  • the targeted sequencing panel was designed to detect the presence of lung tumor-derived DNA in a background of healthy cfDNA.
  • admixtures of sheared NCI-H441 lung cancer cell line DNA were created into cfDNA from a healthy donor at serially decreasing tumor fractions: 100% tumor (pure cell line), 5% tumor, 0.5% tumor, 0.05% tumor, 0.005% tumor, and 0% tumor (pure cfDNA).
  • Sequencing libraries from each tumor fraction were generated in triplicate using a preliminary methylation-CAPP-Seq (mCAPP-Seq) protocol. In this protocol, conversion-resistant methylated Y-adapters were ligated to the DNA before performing bisulfite conversion and finally PCR.
  • DNA methylation states at neighboring CpGs are non-random and highly correlated. This observation has been leveraged for the deconvolution of plasma cell-free DNA, with the hypothesis that fragment-level CpG patterns would exhibit higher specificity than standard single-CpG beta values. It was therefore sought to develop a method that similarly utilized fragment-level methylation states. To do this, a threshold was initially set to require molecules to cover a least 10 CpGs. Focusing on hyper-DMRs, it was then required that at least 80% of those CpGs be methylated in a fragment for the fragment to be considered highly methylated (and thus likely tumor-derived). For each DMR region, a methylation ‘allele fraction’ (AF) was calculated as the proportion of fragments in the region with highly methylated states.
  • AF methylation ‘allele fraction
  • the controlled nature of the tumor fraction in an admixture experiment is useful for estimating the true limit of detection (LOD) of the assay, providing an early sense of the assay’s performance in clinical samples. It was found that the sum of DMRs above background was significantly elevated in tumor fractions as low as 0.5% compared to 0% tumor (pure cfDNA) (Fig. 7C, P ⁇ 0.01 , Student’s T-test). It was observed that there was a log-linear relationship between spike level and DMRs above background between 0.005% and 0.5% AF. Above that level, DMRs above background became saturated, and below that, 0% AF was equivalent to 0.005% AF. The linear portion of the curve was focused on to estimate the LOD.
  • LOD true limit of detection
  • Stage I NSCLCs have ctDNA levels below 0.01 % in plasma. This suggests that a good portion of the earliest stage tumors might remain undetectable with the mCAPP-Seq assay. However, the orthogonal nature of methylation signal (relative to SNV signal) might still enable detection of cases missed by CLiP due to low mutation count or low genomewide copy number alteration.
  • E-Seq Enzymatic Methyl-seq
  • TAPS TET- assisted pyradine borane sequencing
  • EM-Seq employs the TET2 enzyme to first oxidize 5mC to 5caC, protecting it from further conversion by APOBEC3A. Subsequently, APOBEC3A deaminates C and 5mC (but not 5caC), converting them to
  • TAPS Ts.
  • unmethylated cytosines are converted to thymines and methylated cytosines are protected from conversion, resulting in the same sequence as bisulfite would produce.
  • TAPS also begins with a TET-mediated oxidation of 5mC to 5caC.
  • TAPS then proceeds with a chemical treatment of pyridine borane to reduce those 5caCs to dihydroxyuracil (DHU), which then become T through PCR. Therefore, TAPS converts methyl-C to T, unlike bisulfite or EM-Seq, resulting in a different final sequence. By converting only methylated Cs, the final TAPS sequence would have higher complexity.
  • the chemical treatment may also prove harsh on the DNA. All three conversion methods were compared and evaluated for their performance using cfDNA.
  • the main readouts of this experiment would be molecule recovery (measured as unique sequencing depth), conversion efficiency, and mapping rate.
  • Conversion efficiency can be measured with non-human or synthetic spike-in control DNA that is either fully unmethylated (for bisulfite and EM-Seq) or fully methylated (for TAPS).
  • Mapping rate can be measured with any reads distributed across the human genome.
  • unique molecule recovery would be best measured in high-depth sequencing, necessitating a targeted sequencing approach.
  • a 44kb ‘miniselector’ was designed that covered the DMR regions, control regions, and a few tissue-specific regions from the larger sequencing panel.
  • methylation sequencing libraries was prepared from 3 healthy donor cfDNA samples using either bisulfite conversion, EM-Seq, TAPS, or no conversion, captured the libraries with the appropriate bait set, and subjected the samples to sequencing. It was found that TAPS had a higher mapping rate than EM-Seq or bisulfite (Fig. 8A). However, after deduplication, EM-Seq showed a higher unique molecule recovery than either of the other two conversion methods despite DNA input being equal (Fig. 8B). Furthermore, EM-Seq demonstrated significantly higher conversion efficiency than bisulfite as measured by unmethylated lambda control DNA (Fig. 8C). EM-Seq was selected as the preferred conversion method for the mCAPP-Seq protocol moving forward.
  • cfDNA consists of short, nucleosome-bound DNA molecules that exhibit a multimodal size distribution.
  • genomic DNA can end up in the ‘cfDNA’ eluate after DNA isolation.
  • the fragment analyzer results can be used to scale the DNA input by the percent of fragments in the typical cfDNA size range (50-450bp) to better control the input of molecules that will become sequenceable library.
  • the input was not scaled in this manner, and thus the 5, 10, and 30ng inputs included any gDNA present.
  • Fragment analysis of the healthy cfDNA sample used as the denominator for the spike showed that in fact the sample contained only about 50% of molecules in the 50-450bp size range. Therefore, the adjusted inputs for the spike experiment were closer to 2.5ng, 5ng, and 15ng DNA.
  • mCAPP-Seq a goal with mCAPP-Seq was to test its ability to detect the presence of lung cancer DNA in real patient plasma samples. It was planned to test its sensitivity and specificity for detection in an early-stage (Stage l-lll) lung cancer cohort and risk- matched controls. First, though, it was desired to again confirm with the optimized method that observed the expected signal in a high tumor-burden setting in which one could expect substantial ctDNA content. The method was applied to a pilot cohort of healthy control and stage IV NSCLC patient samples.
  • mCAPP-Seq libraries were made with the optimized EM-seq protocol from the cfDNA of 12 healthy controls and 12 Stage IV NSCLC patient samples, captured them with the targeted panel, and sequenced them to a target of 100 million paired reads per sample. After mapping and deduplication, methylation calls were extracted for every CpG and DMRs above background were calculated as described above. Confirming prior observations using bisulfite sequencing (Fig. 7D), AFs in the DMR regions were higher in stage IV patients compared to controls (Fig. 10A). Consequently, DMRs above background were also significantly higher in patients (Fig. 10B). This test case again confirmed that this method would work beyond the contrived setting of admixture samples in patient plasma. However, detection of early-stage, low-ctDNA cases would need to be assessed.
  • cfDNA yield was primarily determined by available plasma volume, which varied by center and collection protocol. Across all control and patient samples extracted, a median of 6.7 ml of plasma was available for extraction (range 1.7-14.0, Fig. 11A). cfDNA isolation from all samples yielded a median of 59.9 ng for controls and 87.5 ng for NSCLC patients (Fig. 11 B). All samples were analyzed for genomic DNA contamination with Agilent Fragment Analyzer, showing high cfDNA fractions (50-450bp) for most samples (Fig. 11 C).
  • Samples were first analyzed from the training cohort. Libraries were prepared from 97 risk-matched controls, 10 granuloma controls, and 117 early-stage NSCLC cfDNA samples, using a fixed input of 15 ng cfDNA in the 50-450bp size range. Samples were ligated to UMI-containing methylated duplex adapters and unmethylated cytosines were converted to uracils via EM-Seq before amplification with 11 cycles on PCR. Libraries were captured with the targeted sequencing panel designed specifically for NSCLC detection. After capture, libraries were sequencing to a target of 80 million paired- end reads (i.e. 40 million read pairs).
  • methylation ‘allele fractions’ - or the percent of fragments having highly methylated states in a region - were calculated for each DMR of interest.
  • AFs were used as features in a machine learning model.
  • a leave-one-out (LOO) framework was developed in which a LASSO logistic regression classifier was trained with all but one sample with DMR AFs as features. The held-out sample was scored with the trained model and repeated the process for each patient sample.
  • Lung cancer screening has the potential to significantly improve patient outcomes, and blood-based assays represent an attractive complement to imaging.
  • a classifier was developed for determining Lung- Cancer Likelihood in Plasma (Lung-CLiP) from genetic features of cell-free DNA
  • mCAPP-Seq methyl-CAPP-Seq
  • NEB Enzymatic Methyl-seq (EM-seq) outperformed the gold standard bisulfite conversion both in preserving DNA integrity and properly converting unmethylated cytosines to thymines.
  • a framework was developed to identify highly methylated reads and found that signal was associated with tumor DNA fraction in the plasma in both cell line admixture experiments and primary NSCLC patient samples.
  • logistic regression classifiers were developed to distinguish healthy plasma samples from NSCLC samples via their methylation signals and found that model performance was better than prior methods and biologically plausible.
  • Illumina Infinium HumanMethylation450 (450k array) data were downloaded in processed form (beta values) from TCGA via the UCSC Xena Browser or from published datasets via GEO (accession numbers GSE32148, GSE41169, GSE54670, GSE73745, GSE53045, GSE107205, GSE35069 for blood samples; GSE52401 and GSE66836 for normal lung). Beta values were transformed to M-values (P. Du, et al., BMC Bioinformatics. 2010 Nov 30;11 :587; the disclosure of which is incorporated herein by reference) before using limma (M. E. Ritchie, et al., Nucleic Acids Res.
  • EM-Seq conversion was performed with the NEB EM-seq conversion module (NEB #E7125) according to manufacturer’s instructions, using formamide as the denaturing agent.
  • PCR was performed as described in “Post-conversion grafting PCR” for 7 cycles. Libraries were further amplified with universal PCR for 4-7 cycles (depending on DNA input, 4 cycles in most cases.)
  • Libraries were ligated to standard (unmethylated) partial Y-adapters with KAPA HyperPrep overnight at 4C and purified with a 1X SPRI bead cleanup. Libraries were converted using a TAPS protocol as previously described (Y. Liu, et al., Nat Biotechnol. 2019 Apr; 37(4):424-429; the disclosure of which is incorporated herein by reference) with the following modifications. After conversion, PCR was performed with KAPA HiFi Uracil+ master mix as described in “Post-conversion grafting PCR,” using custom dual-index primers for 7 cycles. After grafting PCR, libraries were further amplified with 6 cycles of universal PCR.
  • Primers contained dual-indexed sample barcodes. Post-PCR samples were cleaned up with a 1X SPRI bead cleanup and eluted in 24ul nuclease-free water.
  • Sequencing data was demultiplexed using in-house scripts and adapter read- through was trimmed with fastp (S. Chen, et al., Bioinformatics. 2018 Sep 1 ;34(17):i884- i890; the disclosure of which is incorporated herein by reference). Samples were then mapped to the human genome with Bismark (F. Krueger and S. R. Andrews, ioinformatics. 2011 Jun 1 ;27(11 ):1571-2; the disclosure of which is incorporated herein by reference). PCR duplicates were removed with in-house scripts. Methylation states of all CpGs were extracted with Bismark. CpG states were summarized at the fragment- and region-level using custom python scripts.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Genetics & Genomics (AREA)
  • Wood Science & Technology (AREA)
  • Physics & Mathematics (AREA)
  • Biotechnology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Hospice & Palliative Care (AREA)
  • Biophysics (AREA)
  • Oncology (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Systems and methods for sequencing of cell-free nucleic acids to assess a condition are provided. Generally, a cell-free nucleic acid sample is utilized to perform methyl sequencing targeted to particular regions associated with aberrant methylation. Methylation of cell-free nucleic acid molecules can be assessed based on the sequencing result. Various features can be derived from the methylation assessment and utilized within a computational model to assess the cell-free nucleic acid sample for a condition.

Description

Figure imgf000003_0001
SYSTEMS AND METHODS FOR CELL-FREE NUCLEIC ACIDS METHYLATION ASSESSMENT
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Application Ser. No. 63/386,557, entitled “Systems and Methods for Cell-Free DNA Methylation Assessment,” filed December 8, 2022, the disclosures of which is incorporated herein by reference in their entirety.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
[0002] This invention was made with Government support under contract NSF- 1656518 awarded by the National Science Foundation. The Government has certain rights in the invention.
TECHNICAL FIELD
[0003] The disclosure provides description for assessment of methylated cell-free nucleic acids for the purpose of detecting a condition.
BACKGROUND
[0004] Lung cancer screening remains an unmet clinical need. Image-based screening is the most common current screening method but analysis of circulating tumor DNA (ctDNA) represents a promising alternative and complement. Previous studies leveraged features of genetic alterations (e.g. single nucleotide variants and somatic copy number variations) found in cell-free DNA (cfDNA) to predict the Lung Cancer Likelihood in Plasma (Lung-CLiP) of a given sample (J. J. Chabon, et al., Nature. 2020 Apr;580(7802):245-251 , the disclosure of which is incorporated herein by reference). However, in addition to this genetic information, cfDNA also reflects the epigenome of the cells from which it originates. This means that tumor-derived cfDNA molecules (ctDNA) contain cancer-associated epigenetic signals that might be additionally leveraged for detection of malignancies.
Figure imgf000004_0001
[0005] DNA methylation represents a promising tumor biomarker. A stable and heritable covalent modification to cytosines in CG dinucleotides (CpGs), DNA methylation is found at millions of loci across the genome and is known to contribute to the regulation of chromatin conformation and gene expression. Importantly, this means that methylation patterns vary greatly across cell types; methylomes are cell-type specific. In fact, these tissue-specific methylation signatures have been used to deconvolute bulk DNA methylation data, an exercise of particular relevance to cell-free DNA, which has been shown to comprise DNA from blood cells, liver, colon, and to a smaller extent other tissues.
SUMMARY
[0006] In some embodiments, a method is for sequencing for identification of condition-related differentially methylated regions in cell-free nucleic acids.
[0007] In some embodiments, the method comprises obtaining a cell-free nucleic acid sample comprising cell-free nucleic acid molecules.
[0008] In some embodiments, the method comprises extracting a subset of the cell- free nucleic acid molecules from the cell-free nucleic acid sample using a panel of nucleic acid probes designed to hybridize to regions that are known to be differentially methylated in a condition.
[0009] In some embodiments, the method comprises converting nucleobases of the subset of the cell-free nucleic acid molecules.
[0010] In some embodiments, the conversion of a nucleobase is indicative of a methylated state of that nucleobase.
[0011] In some embodiments, the method comprises sequencing the subset of the cell-free nucleic acid molecules via high-throughput sequencing.
[0012] In some embodiments, the condition is cancer.
[0013] In some embodiments, the cancer is non-small cell lung cancer.
[0014] In some embodiments, the regions that are known to be differentially methylated in a condition comprise at least 5% of the regions in Table 2.
[0015] In some embodiments, the regions that are known to be differentially methylated in a condition comprise at least 50% of the regions in Table 2.
Figure imgf000005_0001
[0016] In some embodiments, the panel of nucleic acid probes excludes regions known to be associated with false discovery.
[0017] In some embodiments, the panel of nucleic acid probes excludes regions known to be differentially methylated in blood cells.
[0018] In some embodiments, the method comprises extracting a subset of the cell- free nucleic acid molecules from the cell-free nucleic acid sample using a panel of nucleic acid probes designed to hybridize to regions known to be correlated with factors associated with the condition.
[0019] In some embodiments, the condition is cancer and the regions known to be correlated with factors associated with the condition comprise regions in Table 1.
[0020] In some embodiments, the method comprises extracting a subset of the cell- free nucleic acid molecules from the cell-free nucleic acid sample using a panel of nucleic acid probes designed to hybridize to regions known to be invariably hypermethylated or invariably hypomethylated.
[0021] In some embodiments, the regions known to be invariably hypermethylated or invariably hypomethylated comprise regions in Table 1.
[0022] In some embodiments, converting nucleobases of the subset of the cell-free nucleic acid molecules comprises at least one of the following: bisulfite treatment, TET2 oxidation and APOBEC3A conversion, or TET2 oxidation and pyridine borane treatment. [0023] In some embodiments, converting nucleobases of the subset of the cell-free nucleic acid molecules comprises TET2 oxidation and APOBEC3A conversion.
[0024] In some embodiments, cell-free nucleic acid sample is derived from a collection of: blood, plasma, saliva, urine, stool, mucus, lymph, or another bodily fluid.
[0025] In some embodiments, cell-free nucleic acid sample comprises at least 100,000 nucleic acid molecules.
[0026] In some embodiments, the cell-free nucleic acid of the cell-free nucleic acid sample is cell-free DNA.
[0027] In some embodiments, the cell-free nucleic acid of the cell-free nucleic acid sample comprises at least 1 ng of cell-free DNA.
[0028] In some embodiments, the cell-free nucleic acid of the cell-free nucleic acid sample comprises at least 15 ng of cell-free DNA.
Figure imgf000006_0001
[0029] In some embodiments, the method comprises attaching adapters to the comprising cell-free nucleic acid molecules.
[0030] In some embodiments, the adapters are resistant to nucleobase conversion as performed in the step converting nucleobases of the subset of the cell-free nucleic acid molecules.
[0031] In some embodiments, the panel of nucleic acid probes comprises at least 50 unique probes.
[0032] In some embodiments, a method is for sequencing that enhances detection of differentially methylated regions for assessing a condition of an individual.
[0033] In some embodiments, the method comprises preparing a cell-free nucleic acid sample for targeted methyl sequencing.
[0034] In some embodiments, the prepared cell-free nucleic acid sample is collected from an individual and comprises at least 100,000 cell-free nucleic acid molecules that are derived from a plurality regions that are known to be differentially methylated in a condition.
[0035] In some embodiments, the method comprises sequencing the cell-free nucleic acid sample via a high-throughput sequencer to yield a sequencing result of the cell-free nucleic acid molecules that are derived from a plurality of regions that are known to be differentially methylated in a condition.
[0036] In some embodiments, the method comprises computing, using a computational device and the sequencing result, a methylation metric, wherein the methylation metric is computed for one region of the plurality of regions that are known to be differentially methylated in a condition.
[0037] In some embodiments, the methylation metric indicates an amount of methylation of cell-free nucleic acid molecules that align to the region.
[0038] In some embodiments, the method comprises entering, using the computational device, the computed methylation metric as a feature into a computational model to yield an assessment of the cell-free nucleic acid sample.
[0039] In some embodiments, the assessment indicates the individual has the condition.
Figure imgf000007_0001
[0040] In some embodiments, the method comprises aligning each cell-free nucleic molecule sequencing result across a region.
[0041] In some embodiments, the region is one of the plurality of regions that are differentially methylated in a condition.
[0042] In some embodiments, the method comprises for a set of cell-free nucleic acid molecules that align across the region, determining an amount of methylation for each cell-free nucleic acid molecule of the set.
[0043] In some embodiments, the methylation metric is based on at least one cell-free molecule the set.
[0044] In some embodiments, the method comprises determining a number of cell- free nucleic acid molecules within the set that are methylated more than a threshold.
[0045] In some embodiments, the method comprises computing a methylated molecule fraction (MMF) for the region.
[0046] In some embodiments,
Figure imgf000007_0002
[0047] In some embodiments, #molecules > methylation threshold is the number of cell-free nucleic acid molecules within the set that are determined to be methylated more than a threshold.
[0048] In some embodiments, total molecules assessed is number of total number of cell-free nucleic acid molecules within the set.
[0049] In some embodiments, the threshold is 60% of CpGs methylated.
[0050] In some embodiments, computing methylation metric for a region further comprises identifying within the set of cell-free nucleic acid molecules that align across the region, the cell-free nucleic acid molecule that is most methylated.
[0051] In some embodiments, the methylation metric is computed is an amount of methylation of the cell-free nucleic acid molecule that is most methylated.
[0052] In some embodiments, each cell-free nucleic acid molecule of the set of cell- free nucleic acid molecules that align across the region has a number of CpGs greater than a threshold.
[0053] In some embodiments, the method comprises computing, using the computational device and the sequencing result, a methylation metric for each region of
Figure imgf000008_0001
at least fifty percent of the plurality of regions that are known to be differentially methylated in a condition.
[0054] In some embodiments, the method comprises entering, using the computational device, each computed methylation metric as features into the computational model to yield an assessment of the cell-free nucleic acid sample.
[0055] In some embodiments, the method comprises computing, using the computational device and the sequencing result, a methylation metric for each region of the plurality of regions that are known to be differentially methylated in a condition.
[0056] In some embodiments, the method comprises entering, using the computational device, each computed methylation metric as features into the computational model to yield an assessment of the cell-free nucleic acid sample.
[0057] In some embodiments, the method comprises computing, using the computational device and the sequencing result, a plurality of methylation metrics, wherein each methylation metric is computed for one region of the plurality of regions that are known to be differentially methylated in a condition.
[0058] In some embodiments, the method comprises computing, using the computational device and the plurality of methylation metrics, a sample summary statistic that combines the plurality of methylation metrics.
[0059] In some embodiments, the method comprises entering, using the computational device, the sample summary statistic as a feature into a computational model to yield the assessment of the cell-free nucleic acid sample.
[0060] In some embodiments, the sample summary statistic is a percentile of a number of regions with nonzero MMFs.
[0061] In some embodiments, the sample summary statistic is a percentile of a number of regions where MMF is zero.
[0062] In some embodiments, the sample summary statistic is a percentile of a number of regions where MMF is greater than threshold.
[0063] In some embodiments, the sample summary statistic is a percentile of the amount of methylation of the cell-free nucleic acid molecule that is most methylated.
[0064] In some embodiments, the sample summary statistic is a median of the amount of methylation of the cell-free nucleic acid molecule that is most methylated.
Figure imgf000009_0001
[0065] In some embodiments, the sample summary statistic is a skewness of the amount of methylation of the cell-free nucleic acid molecule that is most methylated.
[0066] In some embodiments, the plurality of regions that are known to be differentially methylated in a condition comprises at least 10 genomic regions associated with a condition.
[0067] In some embodiments, the plurality of regions that are known to be differentially methylated in a condition comprises at least 50 genomic regions associated with a condition.
[0068] In some embodiments, the condition is a cancer.
BRIEF DESCRIPTION OF THE DRAWINGS
[0069] The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments and should not be construed as a complete recitation of the scope of the disclosure.
[0070] Figure 1 provides an example of a method to perform sequencing of methylated cell-free nucleic acid samples.
[0071] Figure 2 provides an example of a method to classify a cell-free nucleic acid sample.
[0072] Figure 3 provides an example of a computational processing system for classification of cell-free nucleic acid samples.
[0073] Figures 4A to 4G provide differential methylation analysis to select CpGs for targeted NSCLC panel design. Figure 4A provides a volcano plot of average methylation difference (Ap) vs. Benjamini-Hochberg adjusted limma P-value for the differential methylation analysis between lung adenocarcinoma (LLIAD) and normal lung. Each point is a CpG in the Illumina 450k array; the 412 points highlighted in green were selected as differentially methylated CpGs (DMCs) for LUAD. Figure 4B provides a heatmap showing methylation states of DMCs selected for LUAD in TCGA-LUAD, blood, and normal lung. Columns are samples, rows are individual CpGs, and heat is methylation level ( value). Figure 4C provides a volcano plot of average methylation difference (Ap) vs. Benjamini- Hochberg adjusted limma P-value for the differential methylation analysis between lung squamous cell carcinoma (LUSC) and normal lung. Each point is a CpG in the Illumina
Figure imgf000010_0001
450k array; the 370 points highlighted in green were selected as differentially methylated CpGs (DMCs) for LUSC. Figure 4D provides a heatmap showing methylation states of DMCs selected for LUSC in TCGA-LUSC, blood, and normal lung. Columns are samples, rows are individual CpGs, and heat is methylation level (0 value). Figure 4E provides data showing the relationship between LUAD-Normal Lung Ap and LUSC-Normal Lung Ap for all selected DMCs. Points are colored by whether they were picked as DMCs for LUAD, LUSC, or both (‘LUAD; LUSC’). Figure 4F provides a bar plot of the CpG density annotations for the 651 selected DMCs. Figure 4G provides a bar plot of gene context annotations for the 651 selected DMCs. The total exceeds 651 due to several CpGs having annotations relative to multiple genes.
[0074] Figures 5A to 5D provide validation data of selector DMRs and selection of additional markers. Figure 5A provides a heatmap of 651 selected NSCLC DMRs in NSCLC cell lines, NSCLC primary tumors from TCGA (LUAD & LUSC), normal lung and blood. Figure 5B provides a heatmap of 651 selected NSCLC DMRs in additional TCGA cancer types. BLCA, bladder; BRCA, breast; COADREAD, colorectal; GBM, glioblastoma; PAAD, pancreas; PRAD, prostate. Figure 5C provides a heatmap of additional CpGs selected to distinguish NSCLC from other cancer types. Figure 5D provides a heatmap showing hypermethylated and hypomethylated control regions. For all heatmaps within Figs. 5A to 5D, columns are samples, CpGs are rows, and heat is methylation level (0 value).
[0075] Figures 6A and 6B provide data showing methylation levels in NCI-H441 cell line admixtures. Figure 6A provides a heatmap of NSCLC-DMRs (rows) in cell line admixture samples (columns) of NCI-H441 cell line DNA mixed into healthy control cfDNA. Libraries were prepared using a preliminary mCAPP-Seq protocol with bisulfite conversion and captured with the targeted panel. Each spike level was prepared in triplicate. Figure 6B provides a violin plot of average methylation levels in hyper-DMRs across all three replicates for each cell line spike level. ****, P < 0.0001 , Student’s T-test. [0076] Figures 7A to 7E provide data on DMRS above background as a measure of ctDNA content. Figure 7A provides distribution data of methylated allele fractions (AFs) in hypermethylated NSCLC DMRs in cell line spike admixtures (n = 3 technical replicates each), healthy control cfDNA (n = 12), stage IV patient cfDNA (n = 12), and primary tumor
Figure imgf000011_0001
tissue samples (n = 5). Libraries were prepared with a preliminary mCAPP-seq protocol using bisulfite conversion and captured with our targeted sequencing panel. Significance denoted for cell line spikes reflects comparison to spike-0%. Figure 7B provides a heatmap of methylated AFs in NSCLC hyper-DMRs in the same samples as in Fig. 7A. Figure 7C depicts DMRs above background in NCI-H441 cell line admixture samples. Figure 7D depicts DMRs above background in 12 control and 12 Stage IV patient cfDNA libraries prepared with bisulfite sequencing. Figure 7E provides a linear fit for estimating the LOD of mCAPP-Seq using cell line spike samples. Horizontal line denotes mean + 3 standard deviations of DMRs above background above undetected samples (0% and 0.005%). Vertical line denotes spike AF (%) at which mean + 3 standard deviations above undetected samples would be expected based on linear fit (0.013%). All P-values calculated with Student’s T-test: **, P < 0.01 ; **** P < 0.0001 ; ns, not significant.
[0077] Figures 8A to 8C provide data showing comparison of conversion methods for detection of methylated cytosines. Figure 8A provides mapping rates for targeted sequencing libraries prepared with bisulfite, EM-seq, TAPS, or no conversion captured with a 44kb panel. Bisulfite and EM-seq samples were mapped with Bismark and TAPS and unconverted samples were mapped with bwa aln. Figure 8B provides median deduplicated depths after barcode deduping for each conversion method. *, P < 0.05, Student’s T-test. Figure 80 provides rate of conversion of unmethylated cytosines to thymine in lambda control DNA spiked into EM-Seq and bisulfite samples. ****, P < 0.0001 , Student’s T-test.
[0078] Figures 9A to 9D provide a technical assessment of mCAPP-Seq with EM-seq conversion. Figure 9A provides median deduplicated depths for EM-Seq libraries prepared in duplicate for 5 different inputs of cfDNA. Figure 9B provides median deduplicated depths for cell line spike admixture samples prepared with 3 cfDNA inputs and 4 tumor fractions. P-values calculated with Student’s T-test. ****, P < 0.0001. Figure 9C depicts DMRs above background for each input and tumor fraction prepared in triplicate. P-values calculated for each spike level compared to 0% with Student’s T-test. *, P < 0.05; **, P < 0.01 ; ns, not significant. Figure 9D depicts DMRs above background for each input and tumor fraction in a repeated cell line spike experiment prepared in
Figure imgf000012_0001
triplicate. P-values calculated for each spike level compared to 0% with Student’s T-test. *, P < 0.05; **, P < 0.01 ; ***, P < 0.001 ; ns, not significant.
[0079] Figures 10A and 10B provide data showing an application of EM-seq mCAPP- Seq to advanced stage NSCLC patients. Figure 10A provides boxplots of hyper-DMR methylated allele fractions for 12 control cfDNA samples and 12 stage IV cfDNA samples. Each boxplot represents one sample. T-test computed on all control DMRs vs. all patient DMRs. ****, P < 0.001 , Student’s T-test. Figure 10B depicts DMRs above background for 12 control cfDNA samples and 12 stage IV cfDNA samples sequenced with EM-Seq. **** P < 0.001 , Student’s T-test.
[0080] Figures 11 A to 11 F provide cfDNA extraction metrics and availability for early- stage training and validation cohorts. Figure 11A provides data on volume plasma available for each cfDNA sample extracted from risk-matched controls and early-stage NSCLC patients in the mCAPP-Seq training and validation cohorts. Figure 11 B provides total cfDNA yields extracted from each sample. Figure 11 C provides data depicting percent of cfDNA in the 50-450bp size range as measured by Agilent fragment analyzer. Figure 11 D provides data depicting total cfDNA extracted in the 50-450bp size range per sample. Figure 11 E provides data on plasma cfDNA concentration (ng/mL) considering only cfDNA in the 50-450bp size range. Figure 11 F provides data depicting percent of samples in each cohort with sufficient cfDNA in the 50-450bp size range (> 40ng) for both mCAPP-seq and regular CAPP-Seq.
[0081] Figures 12A to 12E provide resultant data from applying mCAPP-Seq to early- stage NSCLC patients and risk-matched controls. Figure 12A provides non-deduplicated total sequencing read pairs per sample in the early-stage training cohort NSCLC patients and controls, ns, not significant by Wilcoxon rank-sum test. Figure 12B provides data on median selector-wide unique depth after removing PCR duplicates using molecular barcodes for all early-stage training cohort NSCLC patients and controls, ns, not significant by Wilcoxon rank-sum test. Figure 12C provides data depicting relationship between reads on the sequencer and median unique depth in all samples in the training cohort. R and p-value calculated using Spearman correlation. Figure 12D provides boxplots for each NSCLC patient and control depicting the distribution of hyper-DMR methylated allele fractions (AFs). Wilcoxon rank-sum test performed on all patients vs. all
Figure imgf000013_0001
controls. **** P < 0.0001 . Figure 12E depicts DMRs above background for all early-stage NSCLC patients and controls in the training cohort. ***, P < 0.001 , Wilcoxon rank-sum test.
[0082] Figures 13A to 13D provide data on optimization and feature identification for a statistical model for classification of early-stage patient plasma from cfDNA methylation. Figure 13A provides a heatmap displaying leave-one-out (LOO) detection sensitivity in early-stage cancer patient plasma samples in the training cohort at different minimum thresholds for minimum number of CpGs and minimum percent methylated CpGs required to consider a molecule ‘highly methylated.’ Figure 13B provides a heatmap displaying LOO specificity in risk-matched non-cancer controls for the analysis shown in Fig. 13A. Figure 13C provides data on distribution of highly methylated molecule fractions across all hyper-DMRs for every cancer patient and control in the early-stage training cohort. These distributions were summarized to higher-order summary features and used for cancer classification. Figure 13C provides data on distribution of number of methylated CpGs per subject in the set of most highly methylated fragments in the hyper-DMRs. These distributions were summarized to calculate the fragment methylation index (FMI). [0083] Figures 14A to 14G provide data depicting fragment methylation index is a discriminatory feature for NSCLC detection. Figure 14A provides data showing association between fragment methylation index and stage. Figure 14B provides boxplots showing median fragment lengths per patient or control across all fragments in the hyper- DMR regions. Figure 14C provides boxplots showing fraction of all fragments in the hyper- DMR regions with lengths greater than 300bp for each patient or control. Figure 14D provides boxplots showing median fragment lengths per patient or control across only fragments meeting the minimum CpG content threshold (>= 12 CpGs). Figure 14E provides boxplots showing fraction of fragments meeting the minimum CpG content threshold (>= 12 CpGs) with lengths greater than 300bp for each patient or control. Figure 14F provides boxplots showing median fragment lengths per patient or control across fragments with the highest CpG count, regardless of methylation state, per hyper-DMR region for every patient and control. Figure 14G provides boxplots showing median number of CpGs across fragments with the highest CpG count, regardless of methylation state, per hyper-DMR region for every patient and control.
Figure imgf000014_0001
[0084] Figures 15A to 15F provide data on statistical models that show biological plausibility for detecting early-stage lung cancer. Figure 15A provides data on detection sensitivity by stage for a LASSO logistic regression classifier trained in a leave-one-out framework using DMR AFs as features. Figure 15B provides a number of non-zero coefficients for each leave-one-out model shown in Fig. 15A. Figure 15C provides results of fraction of leave-one-out models in which a feature had a non-zero coefficient for the most recurrently selected DMR features. Figure 15D provides data on detection sensitivity by stage for a Ridge logistic regression classifier trained in a leave-one-out framework using quantiles of the DMR AF distribution as features. Figure 15E provides data on detection sensitivity by stage for a LASSO logistic regression classifier trained in a leave- one-out framework using DMR AF features and fragment-level CpG statistics. Figure 15F provides results of fraction of leave-one-out models recurrently selected features from the LOO analysis in Fig. 15E had non-zero coefficients.
[0085] Figures 16A to 16E provide data on a detection model robustly detects early- stage NSCLC. Figure 16A provides data on detection sensitivity at 95% specificity by stage for a LASSO logistic regression classifier trained in a leave-one-out framework using DMR AF quantiles and fragment-level CpG statistics as features. Figure 16B provides data depicting recurrently selected features from the LOO analysis in Fig. 15F. Figure 16C provides data on model detection sensitivity by disease histology. Figure 16D provides data on Model detection sensitivity by stage for adenocarcinoma only. Figure 16E provides data on model detection sensitivity at 80% specificity.
DETAILED DESCRIPTION
[0086] Turning now to the drawings and data, systems and methods for performing sequencing and analysis of methylated nucleic acids are described in accordance with the various embodiments of the description. The systems and methods provide a means for detecting cancer-derived cell-free nucleic acids (cfNA) within an individual. The system and methods improve on prior methods of cfNA assessment, resulting in enhanced detection of early-stage cancer.
[0087] Cancer-derived methylation signals have been shown to be detectable in cell- free DNA (cfDNA). For example, aberrant promoter hypermethylation of a small number
Figure imgf000015_0001
of tumor-suppressor genes was first detected in serum of non-small cell lung cancer (NSCLC) patients in 1999 by methylation-specific PCR (M. Esteller, et al., Cancer Res. 1999 Jan 1 ;59(1 ):67-70, the disclosure of which is incorporated by reference), but more recently, high-throughput sequencing of broader swathes of the cell-free methylome has demonstrated success for cancer detection and localization (M. C. Liu, et al., DNA. Ann Oncol. 2020 Jun;31 (6):745-759; and S. Y. Shen, et al., Nature. 2018 Nov;563(7732):579- 583; the disclosures of which are incorporated herein by reference). Though evidence suggests some of the performance results for early lung cancer detection from published studies might be overstated, some of the largest studies to date have suggested that cfDNA methylation profiling has strong screening potential. However, despite promising existing methods, there is room for improved sensitivity, especially in a disease-focused manner. In looking at Grail’s results for lung cancer specifically, Stage I detection remains around 20%, and when split by histology, Stage I adenocarcinoma has a detection rate of 0-10% (X. Chen, et al., Clin Cancer Res. 2021 Aug 1 ;27(15):4221 -4229, the disclosure of which is incorporated herein by reference). This disclosure describes a lung cancer- focused cfDNA methylation assay that integrates both epigenetic and genetic signals, improving noninvasive detection of small tumors.
[0088] The systems and methods of the disclosure provide a means assessing methylation of a cfNA sample. Generally, the systems and methods can generate and sequence libraries using sourced from a cfNA sample. The systems and methods can be utilized in any sequencing technique for identification methylation patterns that are indicative of a condition. It has now appreciated that aberrant methylation patterns are biomarkers of various conditions, especially cancer. The various embodiments of the disclosure provide a means to take advantage of methylation patterns of various conditions to detect such conditions in a cfNA sample.
[0089] Provided in Fig. 1 is an example of a method to perform sequencing of cell-free nucleic acid samples to detect differentially methylated regions. In this example, a cfNA sample is obtained and processed for targeted methyl sequencing. The method can be useful for sequencing a cfNA sample collected from an individual for the purpose of detecting a condition in which the condition is marked by abberant methylation patterns. For example, the method can be utilized to sequence a cfNA sample to detect aberrant
Figure imgf000016_0001
methylation patterns associated with a cancer. The method can be used for an early detection screen (e.g. prior to any cancer diagnosis), monitoring treatment (e.g., progress and success of an anti-cancer therapy), and/or detection of minimal residual disease (e.g., detection of cancer recurrence after completion of a treatment). Sequencing results can be utilized in downstream applications, such as perfuming computational analysis on the sequencing results in order to classify a sample for detecting a condition.
[0090] Method 100 can begin by obtaining (101 ) a cfNA sample. The cfNA sample can comprise cell-free RNA (cfRNA) and/or cell-free DNA (cfDNA). Generally, aberrant methylation patterns of cfDNA can be useful biomarkers of a condition such as cancer. cfNA can be collected from any extracellular source, such as (for example) blood, plasma, saliva, urine, stool, mucus, lymph and/or other bodily fluids. Sometimes a cfNA sample is described as a liquid biopsy and biological sample, but any description of a sample that comprises cfNA molecules is applicable. cfNAs can be isolated and purified by any appropriate means.
[0091] Generally, a human plasma sample typically contains 0.5 to 10 ng per mL of cfDNA, corresponding to 150 to 3,000 copies of the haploid human genome. Some conditions, such as (for example) cancer and donor transplant rejection, will result in higher levels of circulating cfDNA, with levels of greater than 1000 ng per mL having been detected. cfDNA typically circulates in fragments ranging between 120 to 220 bp, with a maximum peak at about 167 bp. Thus, plasma can typically have from about 15 million to 400 million cfDNA copies per mL, and greater than 40 billion cfDNA copies per mL in some conditions. Other biological samples will vary in the amount of cfDNA, but generally have concentrations in the millions of copies per mL of sample and is greater when affected by a condition marked by high necrosis and/or DNA leakage such as cancer.
[0092] In the case of cfDNA, the extracted and isolated cfDNA fragments can be utilized as originating nucleic acid molecules. A collection of originating nucleic acid molecules for a sequencing reaction can have greater than 10,000 nucleic acid molecules, greater than 100,000 nucleic acid molecules, greater than 1 ,000,000 nucleic acid molecules, greater than 10,000,000 nucleic acid molecules, greater than 100,000,000 nucleic acid molecules, or greater than 1 ,000,000,000 nucleic acid molecules.
Figure imgf000017_0001
[0093] In some embodiments, a cfNA sample is obtained prior to any indication of cancer. In some embodiments, a cfNA sample is obtained to provide an early screen in order to detect a cancer prior to a diagnosis of cancer. In some embodiments, a cfNA sample is obtained to detect if residual cancer exists after a treatment. In some embodiments, a cfNA sample is obtained during treatment to determine whether the treatment is providing the desired response. Screening of any particular cancer can be performed. In some embodiments, screening is performed to detect a cancer that develops aberrant methylation patterns in stereotypical regions in the genome, such as (for example) lung cancer. In some embodiments, screening is performed to detect a cancer in which regions of aberrant methylation were discovered utilizing a prior extracted cancer biopsy, which may be useful for monitoring treatment or detecting minimal residual disease.
[0094] In some embodiments, a cfNA sample is obtained from an individual with a determined risk of developing cancer, such as those with a familial history of the disorder or have determined risk factors (e.g., exposure to carcinogens). In many embodiments, a cfNA sample is obtained from any individual within the general population. In some embodiments, a cfNA sample is obtained from individuals within a particular age group with higher risk of cancer, such as, for example, aging individuals above the age of 50. In some embodiments, a cfNA sample is obtained from an individual diagnosed with and treated for a cancer.
[0095] Method 100 can further generate (103) a sequencing library targeting differentially methylated regions Generally, targeted sequencing can be performed by capturing and/or specifically amplifying particular regions of a genome. In some embodiments, adapters and/or primers are attached onto cell-free nucleic acids to facilitate sequencing.
[0096] Any appropriate amount of input cfNA can be utilized in library preparation. The limit of detection (LOD) can be affected by the amount of input cfNA. When assessing for deduplicated depth, it was found that cfDNA as little as 1 ng cfDNA could be utilized. But to improve sensitivity of detecting difference in methylation patterns, more cfDNA is useful. In various embodiments, the amount of input cfDNA for library prepare is at least
Figure imgf000018_0001
1 ng, at least 2.5 ng, at least 5 ng, at least 10 ng, at least 15 ng, at least 20 ng, at least 25 ng, or at least 30 ng.
[0097] In some embodiments, targeted sequencing of particular genomic loci is to be performed, and thus particular sequences corresponding to the particular loci are captured via hybridization prior to sequencing (e.g., capture sequencing). In some embodiments, capture sequencing is performed utilizing a panel of probes that pull down (or capture) regions that have been discovered to be differentially methylated for a particular a cancer (e.g., lung cancer). In some embodiments, capture sequencing is performed utilizing a panel of probes that pull down (or capture) regions that have been discovered to be differentially methylated as determined prior by methyl sequencing a biopsy of the cancer.
[0098] In various embodiments, a panel of probes comprises at least 10 unique probes, at least 20 unique probes, at least 50 unique probes, at least 100 unique probes, at least 150 unique probes, at least 200 unique probes, at least 250 unique probes, at least 500 unique probes, or at least 1000 unique probes.
[0099] Provided in Table 2 is a set of genomic loci for detecting regions aberrantly methylated in non-small cell lung cancer (NSCLC), and particular for lung adenocarcinoma (LUAD) and LUSC). All or some of these regions can be utilized for assessing NSCLC. Further, these regions may be utilized for differentiating between LUAD and LUSC, which may be useful for determining treatment options.
[0100] In various embodiments, a panel of capture nucleic acid probes can be designed to hybridize to at least 5%, at least about 10%, at least 20%, at least 30%, at least about 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, at least 99%, or 100% of the genomic regions listed in Table 2. In various embodiments, the sequence of a probe is at least 50% complementary, at least 60% complementary, at least 70% complementary, at least 80% complementary, at least 90% complementary, at least 95% complementary, or at least 99% complementary to a sequence within a genomic region listed in Table 2. A standard genomic reference, such as hg19, can be utilized to retrieve sequences of genomic regions listed in Table 2.
[0101] If an individual is known to have a cancer, methyl sequencing of the cancer can be performed to identify regions that are differentially methylated in association with the
Figure imgf000019_0001
cancer tissue. Nucleic acid probes can be designed to hybridize to these identified regions such that methyl sequencing cfNA can better detect the presence of cancer-derived cfNAs in a biological sample of that individual. This personalized method of using probes designed to hybridize identified regions that differentially methylated can improve the ability to detect the presence of cancer when performing assessments of therapeutic progress and/or detection of minimal residual disease. In various embodiments, a panel of capture nucleic acid probes can be designed to hybridize to at least 10 genomic regions, at least 20 genomic regions, at least 50 genomic regions, at least 100 genomic regions, at least 1500 genomic regions, at least 200 genomic regions, at least 250 genomic regions, at least 500 genomic regions, at least 750 genomic regions, or at least 1000 genomic regions are assessed. In various embodiments, the sequence of a probe is at least 50% complementary, at least 60% complementary, at least 70% complementary, at least 80% complementary, at least 90% complementary, at least 95% complementary, or at least 99% complementary to a sequence that has been identified to be differentially expressed in an individual’s cancer.
[0102] In some embodiments, capture sequencing is performed utilizing a panel of probes that pull down (or capture) regions related to other useful information, which may be useful for performing a diagnostic. For instance, certain methylated regions can be useful to provide indication of factors associated with cancer, such as regions in which methylation patterns are correlated with age, smoking history, and body mass index (BMI). Provided in Table 1 are many regions that are associated with various factors, including age, BMI, cell type (cibersortX-sites), tissue origin (miniselector), multi-cancer, pan-cancer, smoking history, and BMI.
[0103] In various embodiments, a panel of capture nucleic acid probes can be designed to hybridize to at least about 5%, at least about 10%, at least about 20%, at least about 30%, at least about 40%, at least about 50%, at least about 60%, at least about 70%, at least about 80%, at least about 90%, at least about 95%, at least about 99%, or about 100% of the genomic regions identified in Table 1 . In various embodiments, the sequence of a probe is at least 50% complementary, at least 60% complementary, at least 70% complementary, at least 80% complementary, at least 90% complementary, at least 95% complementary, or at least 99% complementary to a sequence within a
Figure imgf000020_0001
genomic region identified in Table 1. A standard genomic reference, such as hg19, can be utilized to retrieve sequences of genomic regions listed in Table 1 .
[0104] In some embodiments, capture sequencing is performed utilizing a panel of probes that pull down (or capture) regions utilized as controls, such as regions that invariably hypermethylated and/or invariably hypom ethylated. In some embodiments, capture sequencing is performed utilizing a panel of probes that pull down (or capture) regions related to other forms of molecular information useful for diagnostics, such as single nucleotide variants, insertions, deletions, and copy number variations.
[0105] Certain regions that are differentially methylated in a certain condition may also be differentially methylated for other reasons unrelated to the condition. For instance, regions may be differentially methylated between healthy individuals, upon an environmental stimulus, in association with different cell types, etc. Differential methylation in blood cells is a special concern as these cells are a high source of cfNAs. Detection of differential methylation in these regions may yield false discovery. Accordingly, in some embodiments, a panel of probes excludes regions known to be associated with false discovery. And in some embodiments, a panel of probes excludes regions known to be differentially methylated in blood cells.
[0106] In some embodiments, adapters utilized for sequencing are resistant conversion of nucleobases. For instance, some adapters include methylated cytosine, which resists conversion via bisulfite treatment.
[0107] Method 100 can further convert (105) nucleobases, which can differentiate nucleobases that are methylated from nucleobases that are unmethylated. Any method for converting nucleobases can be utilized. In some embodiments, a chemical conversion is performed. In some embodiments, an enzymatic conversion is performed. Various methodologies can be utilized to convert methylated nucleobases, including (but not limited to) bisulfite treatment, TET2 oxidation and APOBEC3A conversion, and TET2 oxidation and pyradine borane treatment. In experimentation performed and described in the Examples section, it was found that TET2 oxidation and pyradine borane treatment and TET2 oxidation and APOBEC3A conversion methods provided higher mappabililty of reads than bisulfite treatment. It was further found that TET2 oxidation and APOBEC3A conversion provided the better unique molecule recovery. Accordingly, in some preferred
Figure imgf000021_0001
embodiments, TET2 oxidation and APOBEC3A is utilized to convert methylated nucleobases. For more details on nucleobase conversion inclusive of methods, data and results, see the Examples section herein.
[0108] Although methylation is principally described throughout, the systems and methods can be used on a variety of modified nucleobases, including (but not limited to) 5-methylcytosine (5-mC), 5-hydroxymethylcytosine (5-hmC), 5-formylcytosine (5-fC), and 5-carboxylcytosine (5-caC), as dependent on the biomarkers of the condition. Sequencing and detection methods of the various modifications can be utilized, as has been reported, see, e.g., R. P. Darst, Curr Protoc Mol Biol. 2010 Jul;Chapter 7: Unit 7.9.1 -17; C. X. Song, et al., Nat Biotechnol. 2011 Jan;29(1 ):68-72; J. Morrison, et al., Epigenetics Chromatin. 2021 Jun 19; 14(1 ):28; F. Erger, et al., Genome Med. 2020 Jun 24;12(1 ):54; M. J. Booth, et al., Nat Chem. 2014 May;6(5):435-40; X. Lu, et al., J Am Chem Soc. 2013 Jun 26;135(25):9315-7; J. Xiong, et al., Chem Sci. 2022 Aug 11 ;13(34): 9960-9972; and A. B. R. McIntyre, et al., Nat Commun. 2019 Feb 4;10(1 ):579; the disclosures of which are each incorporated by reference.
[0109] It is noted that some sequencing platforms can detect methylation without conversion, such as nanopore sequencing techniques. When a sequencing technique can detect methylation directly, the nucleobase conversion step can be excluded. Examples of sequencing platforms that can detect methylation include (but are not limited to) Oxford Nanopore Technologies PromethlON, MinlON, and GridlON sequencing platforms (Oxford, UK) and Pacific Bioscience’s Single Molecule, Real-Time (SMRT) sequencing platform (Menlo Park, CA).
[0110] Method 100 further sequences (107) the generated and converted library to detect methylation status of differentially methylated regions. Any appropriate high- throughput sequencing technique can be utilized that can detect converted nucleobases. High-throughput sequencing techniques include (but are not limited to) 454 sequencing, Illumina sequencing, SOLiD sequencing, Ion Torrent sequencing, single-read sequencing, paired-end sequencing, etc. A high-throughput sequencing method can simultaneously sequence at least about 10,000, at least about 100,000, at least about 1 million, at least about 10 million, at least about 100 million, or at least about 1 billion cfNA molecules.
Figure imgf000022_0001
[0111] Several embodiments are directed towards utilizing a computational model to detect the presence of a condition utilizing methyl sequencing data of a cfNA sample. Interpretation of methyl sequencing data results is difficult in cases in which the differentiation of methylation in regions is not readily appreciated. For example, detection of Stage I lung cancer via cfDNA is a difficult task due to the low amount cfDNA markers present in a liquid biopsy. As described in the Examples Section herein, it was found that featurizing methyl sequencing results and utilizing these features within a computational classifier improved the ability to detect stage 1 cancers in plasma samples. Upon classifying a cfNA sample to be derived from a cancer, a clinical intervention can be performed on the individual.
[0112] Provided in Fig. 2 is an example of a computational method to classify a cfNA based on a methyl sequencing result. Method 200 can begin by obtaining (201 ) targeted methyl sequencing result of a cfNA sample. Generally, the sequencing result is obtained via high-throughput such that methylation of a high number of cfNA molecules is determined. The sequencing can be targeted particular regions of a genome, especially regions that are known to be differentially methylated in a region. In some embodiments, lung cancer is be assessed and at least some of the genomic regions identified in Table 2 are targeted for methyl sequencing. In some embodiments, a cancer is sequenced to identify genomic regions that differentially methylated such that subsequent cfNA sequencing is targeted to those regions. In some embodiments, sequencing is performed as described in reference to Fig. 1 .
[0113] In various embodiments, the targeted sequence result covers at least 10 genomic regions, at least 20 genomic regions, at least 50 genomic regions, at least 100 genomic regions, at least 1500 genomic regions, at least 200 genomic regions, at least 250 genomic regions, at least 500 genomic regions, at least 750 genomic regions, or at least 1000 genomic regions. The genomic regions can comprise regions associated with a condition (e.g., cancer), regions correlated with factors associated with a condition, and/or control regions.
[0114] Utilizing the sequencing result, Method 200 further can assess (203) methylation of cfNA molecules. Assessment of methylation can be done in a variety of ways, but generally assessment can comprise an amount of cfNA molecules that have at
Figure imgf000023_0001
least one methylated nucleobase and/or an amount of methylated nucleobases per cfNA molecule. When performing a methylation assessment, in various embodiments, at least 100 cfNA molecules of a sample are assessed, at least 1000 cfNA molecules of a sample are assessed, at least 10,000 cfNA molecules of a sample are assessed, at least 100,00 cfNA molecules of a sample are assessed, at least 1 ,000,000 molecules of a sample are assessed, or at least 10,000,000 molecules of a sample are assessed.
[0115] In some embodiments, methylation is assessed by computing a methylation metric that indicates an amount methylation of cfNA molecules at a particular locus, considering the cfNA molecules that align to the locus. To perform this assessment, a set of cfNA molecules that align to the region are utilized. In some embodiments, the region comprises a CpG island. In various embodiments, each cfNA molecule to be used in the assessment comprises at least 2 CpGs, at least 4 CpGs, at least 6 CpGs, at least 8 CpGs, at least 10 CpGs, at least 15 CpGs, or at least 20 CpGs in the region. In some embodiments, a methylated molecule fraction (MMF), where the MMF is the number of cfNA molecules having an amount of methylation in that region over a threshold per the total cfNAs assessed for that region:
#molecules > methylation threshold MMF = - ■ - - - - - total molecules assessed
In various embodiments, the methylation threshold is at least 20% CpGs methylated, at least 30% CpGs methylated, at least 40% CpGs methylated, at least 50% CpGs methylated, at least 60% CpGs methylated, at least 70% CpGs methylated, at least 90% CpGs methylated, or at least 70% CpGs methylated.
[0116] The MMF can be computed for any or all regions sequenced in association with a particular trait (e.g., any or all regions differentially methylated in cancer). In various embodiments, an MMF is computed for at least 50% regions, at least 60% regions, at least 70% regions, at least 80% regions, at least 90% regions, or all regions sequenced in association with a particular trait.
[0117] Method 200 optionally computes (205) a sample summary statistic. Having determined a methylation metric for a number of regions, an overall sample summary statistic can be generated by combining methylation metrics for a plurality regions. In some embodiments, a sample summary statistic combines methylation metrics for all
Figure imgf000024_0001
regions associated with a trait that were sequenced. In some embodiments, a sample summary statistic combines methylation metrics for a subset regions associated with a trait that were sequenced. For example, a number of the top informative regions can be combined to yield a sample summary statistic. A top informative region is a region that has been determined have a greater association with or more predictive ability of a condition (e.g., cancer) when compared to the other regions assessed. In one hypothetical example, a sample summary statistic can be determined by combining a methylation metric for a number of regions that have the most association with or most predictive ability of a condition. In various embodiments, at least 10% of regions assessed are combined, at least 20% of regions assessed are combined, at least 30% of regions assessed are combined, at least 40% of regions assessed are combined, at least 50% of regions assessed are combined, at least 60% of regions assessed are combined, at least 70% of regions assessed are combined, at least 80% of regions assessed are combined, at least 90% of regions assessed are combined, or all regions assessed are combined to yield a summary sample statistic.
[0118] Sample summary statistic can be computed via a number of different ways. In some embodiments, any statistic that can combine a methylation metric for a plurality of regions can be utilized. In some embodiments, percentiles of a distribution of MMF regions for a sample is determined, where the percentile is determined by comparing to a cohort of samples. For example, the percentile of the number of regions with nonzero MMFs is determined, as compared to a cohort. In another example, the percentile of the number of regions where MMF is zero, as compared to a cohort. In another example, number of regions with an MMF > X%, as compared to a cohort; in some embodiments X is any percentage between 0.01% and 1 %; in some embodiments, X is 0.1%. In another example, for each region, the cfNA molecule with the most methylated CpGs is further considered; the median methylated CpG amount is determined; an X percentile methylated CpG amount is determined, where X is any percentage between 1 % and 100%; the skewness of methylated CpG amount is determined. In some embodiments, a sample summary statistic is normalized based on length of cfNA molecule. As would be understood from these examples, many other sample statistics can be determined.
Figure imgf000025_0001
[0119] Method 200 can further enter (207) one or more methylation features into a trained computational model to assess a sample, where the result indicates whether the sample is associated with a condition (e.g., cancer). Methylation features comprise computed assessments of methylation of cfNAs. A feature can be based on methylation of a particular region associated with the condition to be assessed. A feature can be based on a summary sample statistic of a plurality of regions associated with the condition to be assessed.
[0120] In some embodiments, a methylation assessment of a particular region is utilized as a feature. For example, a computed methylation metric for a particular region can be utilized as a feature. In some embodiments, each feature of a plurality of features are based on a particular region, where the features utilized within the model are based on the predictive ability of the feature or the association of methylation of its region with the condition.
[0121] In some embodiments, a sample summary statistic that combines a plurality of methylation assessments, each methylation assessment of a particular region. For example, any sample statistics as computed in step 205 can be used as feature.
[0122] Any appropriate machine learning model and architecture can be utilized as a model. In some implementations, multiple trained machine models are utilized and/or combined (e.g., an ensemble model). ML models that can be implemented include (but are not limited to) regression-based and/or classification-based models. Generally, regression-based models provide a score that indicates a likelihood of the cancer whereas a classification-based model classifies a sample as likely to include or to not include cancer. Regression-based models include (but are not limited to) LASSO regression, ridge regression, k-nearest neighbors, elastic net, least angle regression (LAR), and random forest regression. Classification-based models include (but are not limited to) support vector machines (SVMs), decision trees, random forests, and naive Bayes. In some embodiments, a regression-based model or a classification-based model is regularized, while in various embodiments, a regression-based model or a classification-based model is gradient boosted.
[0123] Computational models can be trained using a cohort of cfNA samples. For example, to train a classifier for lung cancer, cfNA samples derived from a cohort can be
Figure imgf000026_0001
utilized. In some embodiments, a leave-one-out cross validation (LOOCV) machinelearning model is used to build and train a model. In each LOOCV round, the model is iteratively trained on all samples except for one sample that left out. Model performance can be evaluated on the left-out sample. LOOCV training is attractive because it reduces overfitting and provides a more accurate assessment of the overall stability.
[0124] Method 200 can optionally perform (209) a clinical intervention when the ML model indicates that the cfDNA sample contains cfDNA molecules derived from a cancer. Clinical interventions can include further clinical evaluation of or administration of a treatment to an individual. In a number of embodiments, a clinical procedure is performed, such as (for example) a blood test, genetic test, medical imaging, physical exam, a tumor biopsy, or any combination thereof. In several embodiments, diagnostics are preformed to determine the particular stage of cancer. In a number of embodiments, a treatment is performed, such as (for example) chemotherapy, radiotherapy, chemoradiotherapy, immunotherapy, hormone therapy, targeted drug therapy, surgery, transplant, transfusion, medical surveillance, or any combination thereof. In some embodiments, an individual is assessed and/or treated by medical professional, such as a doctor, physician, physician’s assistant, nurse practitioner, nurse, caretaker, dietician, or similar.
[0125] In some embodiments, non-limiting examples of a treatment can include chemotherapy, radiotherapy, chemoradiotherapy, immunotherapy, adoptive cell therapy (e.g., chimeric antigen receptor (CAR) T cell therapy, CAR NK cell therapy, modified T cell receptor (TCR) T cell therapy, etc.) hormone therapy, targeted drug therapy, surgery, transplant, transfusion, or medical surveillance. A treatment for a condition of subject can comprise administering the subject with one or more therapeutic agents. The one or more therapeutic drugs can be administered to the subject by one or more of the following: orally, intraperitoneally, intravenously, intraarterially, transdermally, intramuscularly, liposomally, via local delivery by catheter or stent, subcutaneously, intraadiposally, and intrathecally.
Computational processing system
[0126] A computational processing system to assess differentially methylated regions in cfNA to detect a condition in accordance with the various methods of the disclosure
Figure imgf000027_0001
typically utilizes a processing system including one or more of a CPU, GPU and/or neural processing engine. In a number of implementations, methyl sequencing results of cfNA are processed and assessed to detect a condition based using a computational processing system. In some implementations, the computational processing system is housed within a computing device associated with a sequencer. In some implementations, the computational processing system is housed separately from the sequencer and receives the sequencing results. In certain embodiments, the computational processing system is implemented using a software application on a computing device such as (but not limited to) mobile phone, a tablet computer, a wearable device (e.g., watch), and/or portable computer.
[0127] A computational processing system in accordance with various embodiments of the disclosure is illustrated in Fig. 3. The computational processing system 300 includes a processor system 302, an I/O interface 304, and a memory system 306. As can readily be appreciated, the processor system 302, I/O interface 304, and memory system 306 can be implemented using any of a variety of components appropriate to the requirements of specific applications including (but not limited to) CPUs, GPUs, ISPs, DSPs, wireless modems (e.g., WiFi, Bluetooth modems), serial interfaces, depth sensors, IMUs, pressure sensors, ultrasonic sensors, volatile memory (e.g., DRAM) and/or nonvolatile memory (e.g., SRAM, and/or NAND Flash). The memory system is capable of storing a sequencing data 308, an application for feature generation 310, and a computational model to detect a condition 312. The application can be downloaded and/or stored in non-volatile memory. When executed, the application for feature generation and/or computational model to detect a condition is capable of configuring the processing system to implement computational processes including (but not limited to) the computational processes described above and/or combinations and/or modified versions of the computational processes described above. In several embodiments, the application for feature generation 310 utilizes the sequence data 308 to generate features based on differentially methylated regions. The computational model to detect a condition utilizes the generated features to determine whether a cfNA sample is derived from an individual with a condition such as cancer. Intermediate data and/or final results can be temporarily
Figure imgf000028_0001
stored in the memory system during processing and/or saved for use in downstream applications.
[0128] While specific computational processing systems are described above with reference to Fig. 3, it should be readily appreciated that computational processes and/or other processes utilized in the provision of assessing differentially methylated regions of cfNA in accordance with various embodiments of the disclosure can be implemented on any of a variety of processing devices including combinations of processing devices. Accordingly, computational devices in accordance with the disclosure should be understood as not limited to specific computational processing systems, but can be implemented using any of the combinations of systems described herein and/or modified versions of the systems described herein to perform the processes, combinations of processes, and/or modified versions of the processes described herein.
Examples
[0129] The systems and methods of the disclosure will be better understood with the several examples provided. Validation results are also provided.
Establishing a cell-free DNA methylation detection assay
[0130] Cancer Personalized Profiling by deep Sequencing (CAPP-Seq), the method developed for detecting tumor variants in cfDNA in a disease-specific manner, served as a template from which to design a novel cfDNA methylation detection method. While most existing cfDNA methylation methods focus on broad genomic coverage, the current methodology aimed to target a relatively small portion of the genome. This would enable high depth of coverage at relatively low sequencing costs, which would allow incorporation of barcoding and error suppression as in CAPP-Seq. Furthermore, the assay was kept disease-specific with the understanding that initial blood-based detection tests would best serve a high-risk population. To achieve these goals, a targeting sequencing panel was designed that would cover regions of informative methylation status for lung cancer detection.
Figure imgf000029_0001
Designing a targeted methylation sequencing panel for NSCLC detection
[0131] In SNV-based CAPP-Seq, sequencing space is focused on regions that contain recurrent mutations across patients to maximize the number of variants observed per patient while keeping sequencing costs (total bases covered) relatively low. Similarly, in designing a targeted methylation panel, a goal was to enrich for regions that would distinguish lung tumor-derived ctDNA molecules from the background of healthy, primarily blood-derived DNA making up the rest of the cfDNA pool, and that would also distinguish lung cancer from normal lung tissue. To do this, publicly available Infinium HumanMethylation450 (450k) array data from lung tumors was downloaded from The Cancer Genome Atlas (TCGA)15 16 and from blood17-23 and normal lung2425 samples from published datasets to identify differentially methylated regions. Comparison of lung adenocarcinoma (LUAD) to blood and normal lung identified numerous differentially methylated CpGs (Figs. 4A and 4B) (E. A. Collisson, et al., Nature. 2014 Jul 31 ;511 (7511 ):543-50; P. S. Hammerman, et al., Nature. 2012 Sep 27;489(7417):519-25; A. Arpon, et al., Nutrients. 2017 Dec 23;10(1 ):15; M V. Dogan, et al., BMC Genomics. 2014 Feb 22; 15:151 ; L. E. Reinius, et al., PLoS One. 2012;7(7):e41361 ; S. A. Langie, et al., PLoS One. 2016 Mar 21 ;11 (3):e0151109; S. Horvath, et al., Genome Biol. 2012 Oct 3;13(10):R97; W. P. Accomando, et al., enome Biol. 2014 Mar 5;15(3):R50; R. A. Harris, et al., Inflamm Bowel Dis. 2012 Dec;18(12):2334-41 ; M. M. Bjaanaes, et al., Mol Oncol. 2016 Feb;10(2):330-43; and J. Shi, et al., Nat Commun. 2014 Feb 27 ;5:3365; the disclosures of which are incorporated herein by reference). Setting a maximum Benjamini-Hochberg adjusted P-value of 0.0001 , the top 500 most differentially methylated CpGs (DMCs) as ranked by the absolute difference in average beta values (Ap) between LUAD and normal tissue were selected. Filtering this list to probes with low background in other normal tissues including brain, liver, and skin yielded 412 DMCs for LUAD to include in the panel design (Fig. 4B). Of these, 379 were hypermethylated in LUAD compared to blood and normal lung, while 33 were hypomethylated. A similar analysis for lung squamous cell carcinoma (LUSC) selected 370 differentially methylated CpGs (Figs. 4C and 4D). Of these, 223 were hypermethylated in LUSC compared to blood and normal lung, while 147 were hypomethylated. 131 differentially methylated CpGs were shared between LUAD and LUSC, and additional CpGs that were only
Figure imgf000030_0001
selected for one histology had methylation levels trending in the same direction in the other histology, suggesting that those sites may aid in detection for both histologies (Fig. 4E). In total, 651 CpGs were chosen as differentially methylated sites (NSCLC DMRs) for the sequencing panel. The majority of the 651 selected sites were in CpG islands, as might be expected based on the composition of the 450K array (Fig. 4F). Most NSCLC DMRs were associated with genes (481/651 ), and of those, many were located upstream or in promoters of genes (Fig. 4G).
[0132] The results of this differential methylation analysis confirmed prior reports. For example, differentially methylated CpGs were identified in several homeobox and homeobox-related genes, including HOXA10, H0XA11, H0XD12, 0TX1, 07X2, P7X1, PAX6 and PAX9, consistent with prior NSCLC studies (M. Shiraishi, et al., Oncogene. 2002 May 16;21 (22):3659-62; J. A. Hwang, et al., Oncotarget. 2013 Dec;4(12):2317-25; and T. Rauch, et al., Proc Natl Acad Sci U S A. 2007 Mar 27;104(13):5527-32; the disclosures of which are incorporated herein by reference). TP73 was identified as hypomethylated in LUSC as previously observed (A. Daskalos, et al., Cancer Lett. 2011 Jan 1 ;300(1 ):79-86; the disclosure of which is incorporated herein by reference). For further validation of our selected markers, 450k array methylation data was downloaded from NSCLC cell lines and found that cell lines showed similar methylation states as TCGA primary tumors at our selected sites (Fig. 5A) (K. Walter, et al., Clin Cancer Res. 2012 Apr 15;18(8):2360-73; the disclosure of which is incorporated herein by reference). [0133] Intriguingly, when the methylation status of the 651 NSCLC DMRs in other cancers from TCGA were examined, it was found that many other cancer types shared a similar methylation state with NSCLC at these regions, suggesting that these marks may reflect a ‘cancer’ signature rather than a lung cancer-specific one (Fig. 5B). To preserve the ability to specifically distinguish lung cancer from other cancer types when evaluating a patient’s cfDNA, a subset of major cancer types were selected from TCGA — LUAD, LUSC, bladder (BLCA), breast (BRCA), colorectal (COADREAD), B-cell lymphoma (DLBCL), hepatocellular carcinoma (LIHC), pancreas (PAAD), and prostate (PRAD) — and a differential methylation analysis was performed to identify cancer type-specific methylation signals. Setting a minimum average methylation difference (A ) of 0.3 between each cancer type and all other cancers and healthy blood (one vs. rest) yielded
Figure imgf000031_0001
varying number of differentially methylated CpGs for each cancer type (range 1 -33, median 15.5) (Fig. 5C). Interestingly, comparison of NSCLC to the other cancer types failed to identify any hypermethylated sites that were specific to lung cancer. However, it was reasoned that evaluating the methylation status of the regions specific to the other cancers would allow one to confidently claim lung cancer detection via lack of signal in those regions.
[0134] In addition to regions that differed between lung tumors and a background of healthy blood and normal lung tissue, we wanted to include regions in which the methylation state was consistent across these different tissue types as controls. To do this, we selected the CpG sites in the 450k array data for which methylation levels were either consistently hyper- or hypomethylated with low variance across all sample types. This analysis generated a list of 683 control CpGs, with 449 hypomethylated controls and 234 hypermethylated controls (Fig. 5D, Table 1 ). CpGs whose methylation status in blood has been shown to be correlated with age, smoking history, and BMI were also included (Table 2) (A. T. Lu, et al., Nat Aging. 2023 Sep;3(9):1144-1166; S. Bocklandt, et al., PLoS One. 2011 ;6(6):e14821 ; R. Joehanes, et al., Circ Cardiovasc Genet. 2016 Oct;9(5):436- 447; X. Gao, et al., Clin Epigenetics. 2015 Oct 16;7: 113; and S. Wahl, et al., Nature. 2017 Jan 5;541 (7635):81 -86; the disclosures of which are incorporated herein by reference). Finally, regions containing the most commonly mutated genes in NSCLC — including TP53, KRAS, EFGR, NFE2L2, among others — were added to preserve the possibility to perform SNV genotyping.
Validating panel design in cell-free DNA
[0135] The targeted sequencing panel was designed to detect the presence of lung tumor-derived DNA in a background of healthy cfDNA. Thus, to mimic this scenario in a controlled manner, admixtures of sheared NCI-H441 lung cancer cell line DNA were created into cfDNA from a healthy donor at serially decreasing tumor fractions: 100% tumor (pure cell line), 5% tumor, 0.5% tumor, 0.05% tumor, 0.005% tumor, and 0% tumor (pure cfDNA). Sequencing libraries from each tumor fraction were generated in triplicate using a preliminary methylation-CAPP-Seq (mCAPP-Seq) protocol. In this protocol, conversion-resistant methylated Y-adapters were ligated to the DNA before performing
Figure imgf000032_0001
bisulfite conversion and finally PCR. Libraries were captured with the targeted sequencing panel and sequenced on a HiSeq4000. Alignment to the human genome and methylation calling was performed with Bismark (F. Krueger and S. R. Andrews; Bioinformatics. 2011 Jun 1 ;27(11 ): 1571 -2; the disclosure of which is incorporated herein by reference).
[0136] It was found that regions hypermethylated in TCGA NSCLCs (hyper-DMRs) were also hypermethylated in NCI-H441 cell line DNA compared to healthy cfDNA (Fig. 6A). Furthermore, methylation beta values at hyper-DMRs were substantially different from healthy cfDNA at 5% tumor fraction (Fig. 6B).
Initial bioinformatic approach for the detection of tumor-derived methylation signals in cfDNA
[0137] DNA methylation states at neighboring CpGs are non-random and highly correlated. This observation has been leveraged for the deconvolution of plasma cell-free DNA, with the hypothesis that fragment-level CpG patterns would exhibit higher specificity than standard single-CpG beta values. It was therefore sought to develop a method that similarly utilized fragment-level methylation states. To do this, a threshold was initially set to require molecules to cover a least 10 CpGs. Focusing on hyper-DMRs, it was then required that at least 80% of those CpGs be methylated in a fragment for the fragment to be considered highly methylated (and thus likely tumor-derived). For each DMR region, a methylation ‘allele fraction’ (AF) was calculated as the proportion of fragments in the region with highly methylated states.
[0138] Looking at the distributions of these DMR AFs in the cell line spike experiment data, it was observed that methylation AFs were correlated with intended spike tumor fraction, confirming their biological plausibility (Figs. 7A and 7B). Next, it was desired to develop a methodology to determine whether a given DMR AF was elevated compared to what might be expected by chance. For a given DMR, fragments from the hypomethylated control regions of the selector equal to the number of fragments in the DMR of interest were randomly sampled. The AF of highly methylated fragments in those control region fragments were calculated using the same thresholds as above (minimum 10 CpGs with minimum 80% methylation) and repeated this process 1000 times to generate a distribution of background AFs. Comparing the AF of the DMR of interest to
Figure imgf000033_0001
that background generated an empirical P-value and a percentile at which the DMR fell in relation to the background. This process was repeated for each hyper-DMR, and DMRs falling at the 100th percentile were considered to be ‘above background’. The total DMRs above background was then summed for each sample. Applying this methodology to the cell line spike experiment revealed a dose-response relationship between the number of DMRs above background and tumor fraction, suggesting that this metric was a good measure of plasma tumor burden (Fig. 7C). All spike levels except for the 0.005% condition had DMRs above background counts that were significantly higher than the unspiked healthy cfDNA (0%) (Student’s T-test, Fig. 7C). As further confirmation of the method, 12 additional healthy control samples and 12 advanced stage (stage IV) patient samples were sequenced using the preliminary mCAPP-Seq protocol and their DMRs above background were calculated. Confirming the results of the spike experiment, patient cfDNA samples showed elevated DMRs above background compared to control cfDNA (Fig. 7D).
Establishing the limit of detection of the assay
[0139] The controlled nature of the tumor fraction in an admixture experiment is useful for estimating the true limit of detection (LOD) of the assay, providing an early sense of the assay’s performance in clinical samples. It was found that the sum of DMRs above background was significantly elevated in tumor fractions as low as 0.5% compared to 0% tumor (pure cfDNA) (Fig. 7C, P < 0.01 , Student’s T-test). It was observed that there was a log-linear relationship between spike level and DMRs above background between 0.005% and 0.5% AF. Above that level, DMRs above background became saturated, and below that, 0% AF was equivalent to 0.005% AF. The linear portion of the curve was focused on to estimate the LOD. Since 0% and 0.005% were equivalent, those 6 samples (labeled as 0.005% for the purposes of LOD calculation) were combined and the mean + 3 standard deviations was calculated as a conservative estimate of upper end of the DMRs metric at background. The AF at which that DMR value would be achieved to lie at approximately 0.013% was then estimated (Fig. 7E). This analysis estimated that tumor fractions as low as 0.02% would be significantly detected above background. The previous tumor-informed analysis from the Lung-CLiP cohort showed that at least half of
Figure imgf000034_0001
Stage I NSCLCs have ctDNA levels below 0.01 % in plasma. This suggests that a good portion of the earliest stage tumors might remain undetectable with the mCAPP-Seq assay. However, the orthogonal nature of methylation signal (relative to SNV signal) might still enable detection of cases missed by CLiP due to low mutation count or low genomewide copy number alteration.
Optimizing molecular biology for targeted cfDNA methylation sequencing
[0140] Given the limited amount of plasma that can be obtained from a given patient and the high depth of sequencing needed to resolve low tumor-fraction events, it was important to optimize our molecular biology protocol for this specific application.
Comparison of molecular conversion methods for detecting methylated cytosines
[0141] Sodium bisulfite conversion has long been the standard for detecting CpG methylation at base resolution. Bisulfite conversion works by deaminating unmethylated cytosines and converting them to uracils while leaving 5-methylcytosines (5mC) and 5- hydroxymethylcytosines (5hmC) intact. Thus, after sequencing, when comparing a given read to the reference genome, sites that have a C in the reference but are read as T can be inferred to have been unmethylated, whereas sites with a reference C that are read as C can be inferred to have been methylated.
[0142] However, bisulfite treatment is known to produce harsh temperature and pH conditions that can damage DNA and lead to loss of material. In recent years, enzymatic conversion approaches have emerged as alternatives to bisulfite that purport improved conversion and recovery. It was therefore determined whether these alternatives might outperform bisulfite for cfDNA applications and an experiment was designed to compare bisulfite to two enzymatic approaches.
[0143] The two enzymatic alternatives are Enzymatic Methyl-seq (EM-Seq) and TET- assisted pyradine borane sequencing (TAPS) (R. Vaisvila, et al., Genome Res. 2021 Jul;31 (7): 1280-1289; and Y. Liu, et al., Nat Biotechnol. 2019 Apr;37(4):424-429; the disclosures of which are incorporated herein by reference). EM-Seq employs the TET2 enzyme to first oxidize 5mC to 5caC, protecting it from further conversion by APOBEC3A. Subsequently, APOBEC3A deaminates C and 5mC (but not 5caC), converting them to
Figure imgf000035_0001
Ts. Thus, unmethylated cytosines are converted to thymines and methylated cytosines are protected from conversion, resulting in the same sequence as bisulfite would produce. TAPS also begins with a TET-mediated oxidation of 5mC to 5caC. However, TAPS then proceeds with a chemical treatment of pyridine borane to reduce those 5caCs to dihydroxyuracil (DHU), which then become T through PCR. Therefore, TAPS converts methyl-C to T, unlike bisulfite or EM-Seq, resulting in a different final sequence. By converting only methylated Cs, the final TAPS sequence would have higher complexity. However, the chemical treatment may also prove harsh on the DNA. All three conversion methods were compared and evaluated for their performance using cfDNA.
[0144] The main readouts of this experiment would be molecule recovery (measured as unique sequencing depth), conversion efficiency, and mapping rate. Conversion efficiency can be measured with non-human or synthetic spike-in control DNA that is either fully unmethylated (for bisulfite and EM-Seq) or fully methylated (for TAPS). Mapping rate can be measured with any reads distributed across the human genome. However, unique molecule recovery would be best measured in high-depth sequencing, necessitating a targeted sequencing approach. A 44kb ‘miniselector’ was designed that covered the DMR regions, control regions, and a few tissue-specific regions from the larger sequencing panel. As TAPS produces a different final sequence than bisulfite or EM-Seq, distinct probe pools for bisulfite/EM-seq and TAPS were required. We also hoped to compare the three conversion methods to unconverted DNA as a reference. Thus, a third bait set covering the same regions was designed for unconverted DNA.
[0145] Using a fixed DNA input, methylation sequencing libraries was prepared from 3 healthy donor cfDNA samples using either bisulfite conversion, EM-Seq, TAPS, or no conversion, captured the libraries with the appropriate bait set, and subjected the samples to sequencing. It was found that TAPS had a higher mapping rate than EM-Seq or bisulfite (Fig. 8A). However, after deduplication, EM-Seq showed a higher unique molecule recovery than either of the other two conversion methods despite DNA input being equal (Fig. 8B). Furthermore, EM-Seq demonstrated significantly higher conversion efficiency than bisulfite as measured by unmethylated lambda control DNA (Fig. 8C). EM-Seq was selected as the preferred conversion method for the mCAPP-Seq protocol moving forward.
Figure imgf000036_0001
Minimum input requirements of EM-Seq
[0146] Next, given the precious and limited nature of cfDNA material alluded to above, it was desired to optimize the DNA amount that should be input into library preparation. As a first step, it was desired to assess the lower limit of input required to make a successful sequencing library. To do this, libraries were prepared in duplicate with 5 different cfDNA inputs ranging from 30ng down to 1 ng. After sequencing, it was found that deduplicated depth was correlated with DNA input, and that libraries could be prepared with as little as 1 ng cfDNA (Fig. 9A). However, the loss in depth associated with this low input may not be conducive to low-AF detection problems. It was therefore desired to find the sweet spot at which sensitivity for detection would be maximized while preserving the remaining cfDNA to use for other applications.
Relationship between DNA input and LOD
[0147] Another cell line spike experiment was designed to investigate the relationship between DNA input and LOD. Using NCI-H441 cell line DNA, cell line-healthy cfDNA admixtures were generated at 4 different tumor fractions (0.5%, 0.1%, 0.05%, and 0%), and 3 different DNA input amounts (5ng, 10ng, and 30ng) of each AF were tested in triplicate (36 libraries total). After sequencing, it was found that unique molecule recovery was correlated with DNA input into library preparation (Fig. 9B). Furthermore, a relationship was observed between tumor fraction and DMRs above background as seen in the previous spike experiment (Fig. 9C). Importantly, it was found that while DMR signal in the 0.05% AF samples was obviously elevated over the 0% condition at 30ng input, it was not elevated in the 5ng or 10ng input conditions, demonstrating how LOD is dependent on DNA input.
[0148] However, an important caveat of this experiment was that our DNA inputs reflected the total DNA put into library preparation, including any genomic DNA that may have been contaminating the cfDNA. As described earlier, cfDNA consists of short, nucleosome-bound DNA molecules that exhibit a multimodal size distribution. When residual intact blood cells remain in the plasma after centrifugation of the blood, their genomic DNA can end up in the ‘cfDNA’ eluate after DNA isolation. These large gDNA
Figure imgf000037_0001
molecules are counted toward the total quantification of DNA when measured by fluorescence but are unlikely to library prep successfully due to their size. To adjust for this, we typically perform fragment analysis on all cfDNA samples to view their fragment size distribution and check for gDNA contamination. The fragment analyzer results can be used to scale the DNA input by the percent of fragments in the typical cfDNA size range (50-450bp) to better control the input of molecules that will become sequenceable library. In the above spike experiment, the input was not scaled in this manner, and thus the 5, 10, and 30ng inputs included any gDNA present. Fragment analysis of the healthy cfDNA sample used as the denominator for the spike showed that in fact the sample contained only about 50% of molecules in the 50-450bp size range. Therefore, the adjusted inputs for the spike experiment were closer to 2.5ng, 5ng, and 15ng DNA.
[0149] To zoom in on the LOD further, and to test whether a higher adjusted DNA input would improve sensitivity, the spike experiment was repeated with different inputs and AFs, again with each combination in triplicate: 10ng adjusted, 15ng adjusted, and 30ng adjusted; and 0.1 %, 0.05%, 0.025%, and 0% (Fig. 9D). This time, we found that 0.05% had elevated DMRs detected compared to 0% for all three DNA inputs. While 0.025% did not reach statistical significance, 2 of 3 replicates of 0.025% were elevated above 0% for both 15ng and 30ng inputs. 15ng (in the 50-450bp size range) was selected as the appropriate input for mCAPP-Seq to maximize sensitivity while also preserving DNA for additional uses, including the Lung-CLiP assay.
Applying mCAPP-Seq to NSCLC patients and risk-matched controls
[0150] Ultimately, a goal with mCAPP-Seq was to test its ability to detect the presence of lung cancer DNA in real patient plasma samples. It was planned to test its sensitivity and specificity for detection in an early-stage (Stage l-lll) lung cancer cohort and risk- matched controls. First, though, it was desired to again confirm with the optimized method that observed the expected signal in a high tumor-burden setting in which one could expect substantial ctDNA content. The method was applied to a pilot cohort of healthy control and stage IV NSCLC patient samples.
Figure imgf000038_0001
Targeted methylation signal in advanced disease
[0151] mCAPP-Seq libraries were made with the optimized EM-seq protocol from the cfDNA of 12 healthy controls and 12 Stage IV NSCLC patient samples, captured them with the targeted panel, and sequenced them to a target of 100 million paired reads per sample. After mapping and deduplication, methylation calls were extracted for every CpG and DMRs above background were calculated as described above. Confirming prior observations using bisulfite sequencing (Fig. 7D), AFs in the DMR regions were higher in stage IV patients compared to controls (Fig. 10A). Consequently, DMRs above background were also significantly higher in patients (Fig. 10B). This test case again confirmed that this method would work beyond the contrived setting of admixture samples in patient plasma. However, detection of early-stage, low-ctDNA cases would need to be assessed.
Applying mCAPP-Seq to early-stage disease
Assembling training and validation cohorts of early-stage NSCLC patients and high-risk controls
[0152] Cohorts of patients were assembled from NSCLC patients of Stanford, MGH, MSK, and the Mayo Clinic that treated with intent-to-cure and had plasma collected prior to treatment. Patients from Stanford and Mayo (n = 117) were assigned to the training cohort for the methylation assay. The control cohorts were collected from patients at Stanford and MGH who were undergoing preventative low-dose CT screening for lung cancer based on their significant smoking history, making them risk-matched to the patient cohort. Risk-matched healthy controls from both Stanford and MGH (n = 97), as well as patients with benign granulomas collected at the Mayo Clinic (n = 10) were assigned to the training cohort. Having an independently collected validation cohort was considered critical for testing the assay’s performance in a robust and reliable way. NSCLC patient samples collected at MGH and MSK (n = 84) were assigned to assigned to the validation cohort. Control samples collected at MGH after 2018 (n = 78) were assigned to the validation cohort.
Figure imgf000039_0001
cfDNA availability in patient samples
[0153] cfDNA yield was primarily determined by available plasma volume, which varied by center and collection protocol. Across all control and patient samples extracted, a median of 6.7 ml of plasma was available for extraction (range 1.7-14.0, Fig. 11A). cfDNA isolation from all samples yielded a median of 59.9 ng for controls and 87.5 ng for NSCLC patients (Fig. 11 B). All samples were analyzed for genomic DNA contamination with Agilent Fragment Analyzer, showing high cfDNA fractions (50-450bp) for most samples (Fig. 11 C). After adjusting for gDNA contamination in the plasma by considering only the cfDNA in the 50-450 bp size range, total adjusted cfDNA yields remained higher on average in patients than in controls (Fig. 11 D). Even after normalizing for plasma volume, plasma cfDNA concentration in ng/ml considering only 50-450bp cfDNA also remained higher in patients than controls (Fig. 11 E).
Experimental plan
[0154] In addition to the goal of evaluating the performance of mCAPP-Seq to detect early-stage NSCLC noninvasively, it was also desired to compare mCAPP-Seq to the previously published method Lung-CLiP (J. J. Chabon, Nature. 2020 Apr;580(7802):245- 251 ; the disclosure of which is incorporated herein by reference). It was reasoned that while detection between the two methods might be correlated, there may be cases detected by one method and not the other, and that an integrated approach might outperform either method alone. To accomplish this, both assays we would need to be run on the same samples, necessitating the 15ng adjusted cfDNA required for mCAPP- Seq plus an additional minimum 20ng cfDNA to input for CAPP-Seq. Based on the plasma volumes available and resultant cfDNA yields, only 55-69% of the samples, depending on cohort, would have sufficient cfDNA for both assays (Fig. 11 F). Still, this would provide a pilot cohort to test the integration of the two methods. The experimental plan would be to test the methylation-only assay on the full cohort as outlined above. Then, in patients for whom enough cfDNA was available to additionally run CAPP-Seq, a subset analysis would be performed to test the integration. In the end, 73 patients and 56 controls would be run with both methods for the training cohort, and 58 patients and 45 controls would be run with both methods for the validation cohort.
Figure imgf000040_0001
Initial Results
[0155] Samples were first analyzed from the training cohort. Libraries were prepared from 97 risk-matched controls, 10 granuloma controls, and 117 early-stage NSCLC cfDNA samples, using a fixed input of 15 ng cfDNA in the 50-450bp size range. Samples were ligated to UMI-containing methylated duplex adapters and unmethylated cytosines were converted to uracils via EM-Seq before amplification with 11 cycles on PCR. Libraries were captured with the targeted sequencing panel designed specifically for NSCLC detection. After capture, libraries were sequencing to a target of 80 million paired- end reads (i.e. 40 million read pairs).
[0156] The quality of the sequencing data was assessed. Despite aiming for equal library representation on each lane, libraries received a wide range of total read counts (median 37.5M read pairs, range 14.8M - 63M), but read counts were not different between patients and controls (Fig. 12A). After removal of PCR duplicates using unique molecular identifiers (UMIs), median selector-wide depths were also similar between patients and controls (median 1256x across all training samples, range 263x - 2467x, Fig. 12B). There was a significant effect of non-deduped read count on deduped depth as might be expected, also suggesting that some low-depth samples could be rescued if necessary by adding more sequencing reads (Fig. 12C). Conversion efficiencies as measured by unmethylated lambda control DNA were high across all samples.
[0157] The DMRs above background analysis were applied to this cohort of patient and control samples. Methylated allele fractions in hyper-DMR regions were higher in patients than controls (Fig. 12C). While DMRs above background also trended higher in patients than controls (Figs. 12D), it was found that the sensitivity of this metric at 95% specificity was insufficient for early detection (Fig. 12E). However, it was reasoned that DMRs above background was a simplistic measure of signal, and that might be possible to improve detection performance through statistical learning.
Development of a methylation-based classifier for cancer detection
[0158] As described above, methylation ‘allele fractions’ - or the percent of fragments having highly methylated states in a region - were calculated for each DMR of interest.
Figure imgf000041_0001
Instead of summing the regions above background as in previous analyses, these AFs were used as features in a machine learning model. For all samples in the training cohort, a leave-one-out (LOO) framework was developed in which a LASSO logistic regression classifier was trained with all but one sample with DMR AFs as features. The held-out sample was scored with the trained model and repeated the process for each patient sample.
[0159] Because the initial definition of a ‘highly methylated molecule’ was based on heuristics, that definition was optimized to maximize LOO performance, a range of minimum CpGs per fragment was tested and minimum percent methylated in all pairwise combinations and LOO sensitivity at 95% specificity was used to measure performance. In each iteration of the LOO, 95% specificity was set in the training data, and the held-out sample was called as detected if its score exceeded that threshold. True specificity was calculated as the percent of controls that, when held out, were correctly classified as noncancer. From this analysis, it was found that in contrast to the 80% minimum methylation threshold that had been previously employed, 60% minimum methylation showed increased detection sensitivity across all minimum CpGs per fragment thresholds (Fig. 13A). True specificity was well-calibrated at approximately 96% in the held-out samples (Fig. 13B).
[0160] Next, it was considered how we else features can be extracted beyond the DMR AFs themselves to include in a classifier. First, methods to summarize the DMR AF distribution to higher-order summary statistics were explored. It was hypothesized that these descriptive summary statistics might create a more robust and generalizable model. To do this, the distribution of DMR AFs for each sample was visualized in the early-stage cohort as well as the prior stage IV patients that had been sequenced (Fig. 13C). In looking at the distribution of AFs, a highly visible upward shift in the distribution in Stage IV patients compared to controls was observed. In the early-stage cohort, distributions were mixed, with some being clearly elevated and others looking more like control samples. It was sought to summarize the distribution of AFs and use those features as inputs to a machine learning model. To do this, for each patient the percentiles of the AF distribution ranging from the 99th percentile to the 50th percentile (median) were calculated and used as features in the logistic regression model.
Figure imgf000042_0001
[0161] To explore additional sources of signal that could additionally enhance detection sensitivity, fragment-level features of highly methylated molecules in DMR regions were investigated. It was found that patients tended to have fragments with more methylated CpGs in hyper-DMRs compared to controls (Fig. 13D). To featurize this observation, for each sample, the maximum number of methylated CpGs in any fragment was identified for every region. The distribution of those maximums across all DMR regions was summarized, taking the median, maximum, and 90th percentile of the distribution. It was found that the median maximum number of methylated CpGs, which was termed Fragment Methylation Index at the 50th percentile (FMIso), was associated with Stage, suggesting it was a tumor-associated signal consistent with ctDNA biology (Fig. 14A). To further validate this metric and ensure that the increase in FMI in patients was not due to gross fragment size disparities between patients and controls, the median fragment was calculated size across all DMR fragments and found no difference between patients and controls (Fig. 14B). There was also no difference between the fraction of all DMR fragments longer than 300bp (Fig. 14C). When considering only fragments with sufficient CpGs to meet the DMR AF filtering criteria (min. 12 CpGs), again no difference in median fragment length or fraction of fragments >300bp was observed between patients and controls (Figs. 14D and 14E). Finally, the fragments in the DMR regions with the most CpGs were considered, regardless of methylation state. The median lengths of these fragments was not different between patients and controls (Fig. 14F) and the median number of CpGs in these fragments was slightly higher, but very highly overlapping, in patients compared to controls (Fig. 14G). Together, these data suggested that there was not a gross fragment size distribution difference between patients and controls that was driving the FMI signal.
[0162] Using the different classes of identified features- AF features, AF distribution percentile features, and fragment methylation index features -a series of statistical models were trained in a leave-one-out (LOO) framework for classifying samples as cancer vs. control. Because this cohort was highly enriched for small, stage I tumors, making detection especially difficult, it was important to consider detection sensitivity by stage, where one would expect a correlation between stage and sensitivity that would be an important confirmation of the biological plausibility of the model. In a model using only
Figure imgf000043_0001
DMR AFs, detection of Stage IA tumors was 28%, with sensitivity then increasing with increasing stage (Fig. 15A). As a further validation of the model, a single model was trained using all of the Stage l-lll samples and controls and that model was used to score held-out Stage IV samples. Stage IV showed the highest sensitivity, again confirming biological plausibility (Fig. 15A). Digging into the features that were selected by each LOO iteration of the LASSO, it was found that most models had 4 nonzero coefficients (Fig. 15B) and that a handful of regions were selected by most models, suggesting their feature importance (Fig. 15C). AFs of these regions were also correlated with patient stage (data not shown).
[0163] Next, a model was trained using the AF percentile features described above (Fig. 15D). Sensitivity by stage for that model also increased with stage, although sensitivity was slightly lower than the AF-based model (Fig. 15D). However, it had yet to be determined whether the AF-based model is more prone to overfitting compared to the summary statistics model. DMR AF features were then combined with the fragment methylation index features described above and again achieved stage-associated sensitivity (Fig. 15E). In this model, both AF features and fragment-based features were selected across iterations of the leave-one-out (Fig. 15F).
[0164] An additional model was trained using the AF percentile features and the fragment methylation index features. This model again shows a nice stage-based sensitivity and specificity that was calibrated to a target of 95% (Fig. 16A), with a positive predictive value (PPV) of 89%. Interestingly, only fragment-level features were selected in these models (Fig. 16B). With this model, the detection rate for lung squamous cell carcinomas was found to be higher than for adenocarcinoma (Fig. 16C). To compare the results to the leading commercial method, performance was calculated by stage for adenocarcinoma alone, and found 22% detection for Stage I tumors, a 3x improvement over the leading method (Fig. 16D) (for more on the leading model, see, X. Chen, et al., Clin Cancer Res. 2021 Aug 1 ;27(15):4221 -4229; the disclosure of which is incorporated herein by reference). Finally, in thinking about how this method might be applied clinically, it was considered that in screening a high-risk population where disease prevalence would be higher than the general population, a lower specificity threshold could be tolerated while still maintaining a high PPV. Sensitivities at 80% specificity were reported,
Figure imgf000044_0001
where increased detection rates for stage I and stage II tumors were found (Fig. 16E). An Increase in Stage III detection was not observed, but this may be due to low ctDNA levels for these patients.
Conclusions
[0165] Lung cancer screening has the potential to significantly improve patient outcomes, and blood-based assays represent an attractive complement to imaging. Building on previous work in which a classifier was developed for determining Lung- Cancer Likelihood in Plasma (Lung-CLiP) from genetic features of cell-free DNA, in this study a method was developed to test whether harnessing epigenetic cell-free DNA features would enhance early lung cancer detection. To do this, methyl-CAPP-Seq (mCAPP-Seq) was developed, a variant of the CAPP-Seq method that incorporates targeted, deep methylation sequencing and bioinformatic tools to extract tumor- associated methylation states from cfDNA. It was found that NEB Enzymatic Methyl-seq (EM-seq) outperformed the gold standard bisulfite conversion both in preserving DNA integrity and properly converting unmethylated cytosines to thymines. A framework was developed to identify highly methylated reads and found that signal was associated with tumor DNA fraction in the plasma in both cell line admixture experiments and primary NSCLC patient samples. Finally, logistic regression classifiers were developed to distinguish healthy plasma samples from NSCLC samples via their methylation signals and found that model performance was better than prior methods and biologically plausible.
[0166] Although lung cancer early detection from the blood remains a technical challenge, this study represents a technological advance in the field. First, this study is singular in its targeted sequencing approach which allows for high unique coverage over a small genomic space to efficiently utilize sequencing reads. From this data, features of cell-free DNA methylation were identified that are correlated with lung cancer tumor burden and which have not previously been described. Most significantly, the models developed herein have demonstrated significantly higher sensitivity compared to a leading published method, in which sensitivity for Stage I adenocarcinoma was ~7%.
Figure imgf000045_0001
Even a 15% increase in sensitivity for these small tumors would mean significantly improved outcomes for those patients.
Methods
Differential methylation from 450K array data
[0167] Illumina Infinium HumanMethylation450 (450k array) data were downloaded in processed form (beta values) from TCGA via the UCSC Xena Browser or from published datasets via GEO (accession numbers GSE32148, GSE41169, GSE54670, GSE73745, GSE53045, GSE107205, GSE35069 for blood samples; GSE52401 and GSE66836 for normal lung). Beta values were transformed to M-values (P. Du, et al., BMC Bioinformatics. 2010 Nov 30;11 :587; the disclosure of which is incorporated herein by reference) before using limma (M. E. Ritchie, et al., Nucleic Acids Res. 2015 Apr 20;43(7):e47; the disclosure of which is incorporated herein by reference) to identify differentially methylated CpG sites between either LUAD or LUSC samples and blood and normal lung samples (as one group). Limma P-values were adjusted for multiple hypothesis testing with the Benjamini-Hochberg method. CpG island annotations and gene context annotations for selected CpGs were performed with the R package ‘annotate (R. G. Cavalcante and M.A. Sartor, Bioinformatics. 2017 Aug 1 ;33(15):2381 - 2383; the disclosure of which is incorporated herein by reference). mCAPP-Seq library preparation
[0168] Libraries from cfDNA or sheared genomic DNA were prepared with the KAPA HyperPrep kit. Unmethylated sheared lambda phage DNA (~1 pg) and methylated pUC19 DNA (~0.3pg) were added to each sample, before bringing the sample volume to 50ul in nuclease-free water. End-repair and A-tailing (ER/AT) were performed per the KAPA HyperPrep protocol. After ER/AT, 100-fold molar excess of methylated partial Y-adapters was added to each sample. Adapters contained insert UMIs but were methylated at every cytosine position to be resistant to conversion. Samples were ligated overnight at 4C. After ligation, a 1X SPRI bead cleanup was performed and samples were eluted in the volume of water required for either the bisulfite or EM-Seq protocol.
Figure imgf000046_0001
Bisulfite conversion
[0169] After ligation with methylated Y-adapters, bisulfite conversion was performed with the Qiagen Epitect kit using the “Sodium Bisulfite Conversion of Unmethylated Cytosines in Small Amounts of Fragmented DNA” protocol according to the manufacturer’s instructions.
EM-Seq conversion
[0170] After ligation to methylated adapters and bead cleanup, EM-Seq conversion was performed with the NEB EM-seq conversion module (NEB #E7125) according to manufacturer’s instructions, using formamide as the denaturing agent. After conversion, PCR was performed as described in “Post-conversion grafting PCR” for 7 cycles. Libraries were further amplified with universal PCR for 4-7 cycles (depending on DNA input, 4 cycles in most cases.)
TAPS conversion
[0171] Libraries were ligated to standard (unmethylated) partial Y-adapters with KAPA HyperPrep overnight at 4C and purified with a 1X SPRI bead cleanup. Libraries were converted using a TAPS protocol as previously described (Y. Liu, et al., Nat Biotechnol. 2019 Apr; 37(4):424-429; the disclosure of which is incorporated herein by reference) with the following modifications. After conversion, PCR was performed with KAPA HiFi Uracil+ master mix as described in “Post-conversion grafting PCR,” using custom dual-index primers for 7 cycles. After grafting PCR, libraries were further amplified with 6 cycles of universal PCR.
Post-conversion grafting PCR
[0172] After conversion with either bisulfite, EM-Seq, or TAPS, libraries were amplified and indices added through grafting PCR. Briefly, 25ul KAPA HiFi Uracil+ enzyme master
Figure imgf000047_0001
mix and 2ul 12uM forward + reverse index primers were added to each sample, and amplified on a thermal cycler with the following protocol:
95°C for 2 min
Repeat for 7 cycles:
98°C for 30 sec
60°C for 30 sec
72°C for 4 min
72°C for 10 min
4°C hold
[0173] Primers contained dual-indexed sample barcodes. Post-PCR samples were cleaned up with a 1X SPRI bead cleanup and eluted in 24ul nuclease-free water.
Universal PCR
[0174] After index grafting PCR, universal PCR was performed to further amplify the libraries. Cycle number varied depending on DNA input and conversion method. 25ul KAPA HiFi HotStart ReadyMix and 1 ul 100uM forward + reverse universal primers were added to the 24ul of library. Samples were amplified on a thermal cycler using the following protocol:
98°C for 45 sec
Repeat for 4-7 cycles:
98°C for 15 sec
60°C for 30 sec
72°C for 30 sec
72°C for 1 min
4°C hold
Sequencing
[0175] Samples were sequenced on an Illumina HiSeq or NovaSeq6000. All samples for the early-stage detection cohort were sequenced on a NovaSeq6000 targeting 40M
Figure imgf000048_0001
pairs of reads per sample. Actual read counts ranged from 14.8-63M read-pairs, with a median of 37.5M read-pairs. Read counts were not significantly different between patients and controls.
Data processing, alignment, and deduplication
[0176] Sequencing data was demultiplexed using in-house scripts and adapter read- through was trimmed with fastp (S. Chen, et al., Bioinformatics. 2018 Sep 1 ;34(17):i884- i890; the disclosure of which is incorporated herein by reference). Samples were then mapped to the human genome with Bismark (F. Krueger and S. R. Andrews, ioinformatics. 2011 Jun 1 ;27(11 ):1571-2; the disclosure of which is incorporated herein by reference). PCR duplicates were removed with in-house scripts. Methylation states of all CpGs were extracted with Bismark. CpG states were summarized at the fragment- and region-level using custom python scripts.
Statistical modeling and analysis
[0177] Regularized logistic regression models for predicting cancer status from cfDNA methylation were fit in R with glmnet (J. Friedman, T. Hastie, and R. Tibshirani J Stat Softw. 2010;33(1): 1 -22; the disclosure of which is incorporated herein by reference) using the cv.glmnet function with family = “logistic”, alpha = 0 for ridge regression, alpha = 1 for LASSO regression, and otherwise default parameters. Model performance was summarized using pROC (X. Robin, et al., Bioinformatics. 2011 Mar 17; 12:77; the disclosure of which is incorporated herein by reference). All statistical analyses were performed in R 4.0.1 or 3.6.1 .
TABLE 1
Figure imgf000049_0001
TABLE 1
Figure imgf000050_0001
TABLE 1
Figure imgf000051_0001
TABLE 1
Figure imgf000052_0001
TABLE 1
Figure imgf000053_0001
TABLE 1
Figure imgf000054_0001
TABLE 1
Figure imgf000055_0001
TABLE 1
Figure imgf000056_0001
TABLE 1
Figure imgf000057_0001
TABLE 1
Figure imgf000058_0001
TABLE 1
Figure imgf000059_0001
TABLE 1
Figure imgf000060_0001
TABLE 1
Figure imgf000061_0001
TABLE 1
Figure imgf000062_0001
TABLE 1
Figure imgf000063_0001
TABLE 1
Figure imgf000064_0001
TABLE 1
Figure imgf000065_0001
TABLE 1
Figure imgf000066_0001
TABLE 1
Figure imgf000067_0001
TABLE 1
Figure imgf000068_0001
TABLE 1
Figure imgf000069_0001
TABLE 1
Figure imgf000070_0001
TABLE 1
Figure imgf000071_0001
TABLE 1
Figure imgf000072_0001
TABLE 1
Figure imgf000073_0001
TABLE 1
Figure imgf000074_0001
TABLE 1
Figure imgf000075_0001
TABLE 1
Figure imgf000076_0001
TABLE 1
Figure imgf000077_0001
TABLE 1
Figure imgf000078_0001
TABLE 1
Figure imgf000079_0001
TABLE 1
Figure imgf000080_0001
TABLE 1
Figure imgf000081_0001
TABLE 1
Figure imgf000082_0001
TABLE 1
Figure imgf000083_0001
TABLE 1
Figure imgf000084_0001
TABLE 1
Figure imgf000085_0001
TABLE 1
Figure imgf000086_0001
TABLE 1
Figure imgf000087_0001
TABLE 1
Figure imgf000088_0001
TABLE 1
Figure imgf000089_0001
TABLE 1
Figure imgf000090_0001
TABLE 1
Figure imgf000091_0001
TABLE 1
Figure imgf000092_0001
TABLE 1
Figure imgf000093_0001
TABLE 1
Figure imgf000094_0001
TABLE 1
Figure imgf000095_0001
TABLE 1
Figure imgf000096_0001
TABLE 1
Figure imgf000097_0001
TABLE 1
Figure imgf000098_0001
TABLE 1
Figure imgf000099_0001
TABLE 1
Figure imgf000100_0001
TABLE 1
Figure imgf000101_0001
TABLE 1
Figure imgf000102_0001
TABLE 1
Figure imgf000103_0001
TABLE 1
Figure imgf000104_0001
TABLE 1
Figure imgf000105_0001
TABLE 1
Figure imgf000106_0001
TABLE 1
Figure imgf000107_0001
TABLE 1
Figure imgf000108_0001
TABLE 1
Figure imgf000109_0001
TABLE 1
Figure imgf000110_0001
TABLE 1
Figure imgf000111_0001
TABLE 1
Figure imgf000112_0001
TABLE 1
Figure imgf000113_0001
TABLE 1
Figure imgf000114_0001
TABLE 1
Figure imgf000115_0001
TABLE 1
Figure imgf000116_0001
TABLE 1
Figure imgf000117_0001
TABLE 1
Figure imgf000118_0001
TABLE 1
Figure imgf000119_0001
TABLE 1
Figure imgf000120_0001
TABLE 1
Figure imgf000121_0001
TABLE 1
Figure imgf000122_0001
TABLE 1
Figure imgf000123_0001
TABLE 1
Figure imgf000124_0001
TABLE 1
Figure imgf000125_0001
TABLE 1
Figure imgf000126_0001
TABLE 1
Figure imgf000127_0001
TABLE 1
Figure imgf000128_0001
TABLE 1
Figure imgf000129_0001
TABLE 1
Figure imgf000130_0001
TABLE 1
Figure imgf000131_0001
TABLE 1
Figure imgf000132_0001
TABLE 1
Figure imgf000133_0001
TABLE 1
Figure imgf000134_0001
TABLE 1
Figure imgf000135_0001
TABLE 1
Figure imgf000136_0001
TABLE 1
Figure imgf000137_0001
TABLE 1
Figure imgf000138_0001
TABLE 1
Figure imgf000139_0001
TABLE 1
Figure imgf000140_0001
TABLE 1
Figure imgf000141_0001
TABLE 1
Figure imgf000142_0001
TABLE 1
Figure imgf000143_0001
TABLE 1
Figure imgf000144_0001
TABLE 1
Figure imgf000145_0001
TABLE 1
Figure imgf000146_0001
TABLE 1
Figure imgf000147_0001
TABLE 1
Figure imgf000148_0001
TABLE 2
Figure imgf000149_0001
TABLE 2
Figure imgf000150_0001
TABLE 2
Figure imgf000151_0001
TABLE 2
Figure imgf000152_0001
TABLE 2
Figure imgf000153_0001
TABLE 2
Figure imgf000154_0001
TABLE 2
Figure imgf000155_0001
TABLE 2
Figure imgf000156_0001
TABLE 2
Figure imgf000157_0001
TABLE 2
Figure imgf000158_0001
TABLE 2
Figure imgf000159_0001
TABLE 2
Figure imgf000160_0001
TABLE 2
Figure imgf000161_0001
TABLE 2
Figure imgf000162_0001
TABLE 2
Figure imgf000163_0001
TABLE 2
Figure imgf000164_0001
TABLE 2
Figure imgf000165_0001
TABLE 2
Figure imgf000166_0001
TABLE 2
Figure imgf000167_0001
TABLE 2
Figure imgf000168_0001
TABLE 2
Figure imgf000169_0001
TABLE 2
Figure imgf000170_0001
TABLE 2
Figure imgf000171_0001
TABLE 2
Figure imgf000172_0001
TABLE 2
Figure imgf000173_0001

Claims

WHAT IS CLAIMED IS:
1. A method of sequencing for identification of condition-related differentially methylated regions in cell-free nucleic acids, comprising: obtaining a cell-free nucleic acid sample comprising cell-free nucleic acid molecules; extracting a subset of the cell-free nucleic acid molecules from the cell-free nucleic acid sample using a panel of nucleic acid probes designed to hybridize to regions that are known to be differentially methylated in a condition; converting nucleobases of the subset of the cell-free nucleic acid molecules, wherein the conversion of a nucleobase is indicative of a methylated state of that nucleobase; and sequencing the subset of the cell-free nucleic acid molecules via high-throughput sequencing.
2. The method of claim 1 , wherein the condition is cancer.
3. The method of claim 2, wherein the cancer is non-small cell lung cancer.
4. The method of claim 3, wherein the regions that are known to be differentially methylated in a condition comprise at least 5% of the regions in Table 2.
5. The method of claim 4, wherein the regions that are known to be differentially methylated in a condition comprise at least 50% of the regions in Table 2.
6. The method of any one of claims 1-5, wherein the panel of nucleic acid probes excludes regions known to be associated with false discovery.
7. The method of any one of claims 1-6, wherein the panel of nucleic acid probes excludes regions known to be differentially methylated in blood cells.
8. The method of any one of claims 1 -7 further comprising extracting a subset of the cell-free nucleic acid molecules from the cell-free nucleic acid sample using a panel of nucleic acid probes designed to hybridize to regions known to be correlated with factors associated with the condition.
9. The method of claim 8, wherein the condition is cancer and the regions known to be correlated with factors associated with the condition comprise regions in Table 1.
10. The method of any one of claims 1 -9 further comprising extracting a subset of the cell-free nucleic acid molecules from the cell-free nucleic acid sample using a panel of nucleic acid probes designed to hybridize to regions known to be invariably hypermethylated or invariably hypomethylated.
11. The method of claim 10, the regions known to be invariably hypermethylated or invariably hypomethylated comprise regions in Table 1.
12. The method of any one of claims 1-11 , wherein converting nucleobases of the subset of the cell-free nucleic acid molecules comprises at least one of the following: bisulfite treatment, TET2 oxidation and APOBEC3A conversion, or TET2 oxidation and pyridine borane treatment.
13. The method of claim 12, wherein converting nucleobases of the subset of the cell- free nucleic acid molecules comprises TET2 oxidation and APOBEC3A conversion.
14. The method of any one of claims 1-13, wherein cell-free nucleic acid sample is derived from a collection of: blood, plasma, saliva, urine, stool, mucus, lymph, or another bodily fluid.
15. The method of any one of claims 1 -14, wherein cell-free nucleic acid sample comprises at least 100,000 nucleic acid molecules.
16. The method of any one of claims 1-15, wherein the cell-free nucleic acid of the cell-free nucleic acid sample is cell-free DNA.
17. The method of claim 16, wherein the cell-free nucleic acid of the cell-free nucleic acid sample comprises at least 1 ng of cell-free DNA.
18. The method of claim 16, wherein the cell-free nucleic acid of the cell-free nucleic acid sample comprises at least 15 ng of cell-free DNA.
19. The method of any one of claims 1 -18 further comprising attaching adapters to the comprising cell-free nucleic acid molecules; wherein the adapters are resistant to nucleobase conversion as performed in the step converting nucleobases of the subset of the cell-free nucleic acid molecules.
20. The method of any one of claims 1-19, wherein the panel of nucleic acid probes comprises at least 50 unique probes.
21. A method of sequencing that enhances detection of differentially methylated regions for assessing a condition of an individual, comprising: preparing a cell-free nucleic acid sample for targeted methyl sequencing, wherein the prepared cell-free nucleic acid sample is collected from an individual and comprises at least 100,000 cell-free nucleic acid molecules that are derived from a plurality regions that are known to be differentially methylated in a condition; sequencing the cell-free nucleic acid sample via a high-throughput sequencer to yield a sequencing result of the cell-free nucleic acid molecules that are derived from a plurality of regions that are known to be differentially methylated in a condition; computing, using a computational device and the sequencing result, a methylation metric, wherein the methylation metric is computed for one region of the plurality of regions that are known to be differentially methylated in a condition, wherein the methylation metric indicates an amount of methylation of cell-free nucleic acid molecules that align to the region; and entering, using the computational device, the computed methylation metric as a feature into a computational model to yield an assessment of the cell-free nucleic acid sample, wherein the assessment indicates the individual has the condition.
22. The method of claim 21 , computing a methylation metric for a region comprises: aligning each cell-free nucleic molecule sequencing result across a region, wherein the region is one of the plurality of regions that are differentially methylated in a condition; and for a set of cell-free nucleic acid molecules that align across the region, determining an amount of methylation for each cell-free nucleic acid molecule of the set; wherein the methylation metric is based on at least one cell-free molecule the set.
23. The method of claim 22, wherein computing methylation metric for a region further comprises: determining a number of cell-free nucleic acid molecules within the set that are methylated more than a threshold; and computing a methylated molecule fraction (MMF) for the region, wherein:
#molecules > methylation threshold MMF = - ■ - - - - ■ - total molecules assessed wherein #molecules > methylation threshold is the number of cell-free nucleic acid molecules within the set that are determined to be methylated more than a threshold; wherein total molecules assessed is number of total number of cell-free nucleic acid molecules within the set.
24. The method of claim 23, wherein the threshold is 60% of CpGs methylated.
25. The method of claim 22, wherein computing methylation metric for a region further comprises: identifying within the set of cell-free nucleic acid molecules that align across the region, the cell-free nucleic acid molecule that is most methylated, wherein the methylation metric is computed is an amount of methylation of the cell-free nucleic acid molecule that is most methylated.
26. The method of any one of claims 22-25, wherein each cell-free nucleic acid molecule of the set of cell-free nucleic acid molecules that align across the region has a number of CpGs greater than a threshold.
27. The method of any one of claims 22-26 further comprising: computing, using the computational device and the sequencing result, a methylation metric for each region of at least fifty percent of the plurality of regions that are known to be differentially methylated in a condition; and entering, using the computational device, each computed methylation metric as features into the computational model to yield an assessment of the cell-free nucleic acid sample, wherein the assessment indicates the individual has the condition.
28. The method of any one of claims 22-26 further comprising: computing, using the computational device and the sequencing result, a methylation metric for each region of the plurality of regions that are known to be differentially methylated in a condition; and entering, using the computational device, each computed methylation metric as features into the computational model to yield an assessment of the cell-free nucleic acid sample, wherein the assessment indicates the individual has the condition.
29. The method of any one of claims 21 -28 further comprising: computing, using the computational device and the sequencing result, a plurality of methylation metrics, wherein each methylation metric is computed for one region of the plurality of regions that are known to be differentially methylated in a condition; computing, using the computational device and the plurality of methylation metrics, a sample summary statistic that combines the plurality of methylation metrics; and entering, using the computational device, the sample summary statistic as a feature into a computational model to yield the assessment of the cell-free nucleic acid sample, wherein the assessment indicates the individual has the condition.
30. The method of claim 29, wherein each methylation metric of the plurality of methylation metrics to compute a sample summary statistic is computed as follows: determining a number of cell-free nucleic acid molecules within a set that are methylated more than a threshold; and computing a methylated molecule fraction (MMF) for the region, wherein:
#molecules > methylation threshold MMF = - - - total molecules assessed wherein #molecules > methylation threshold is the number of cell-free nucleic acid molecules within the set that are determined to be methylated more than a threshold; wherein total molecules assessed is number of total number of cell-free nucleic acid molecules within the set.
31 . The method of claim 30, wherein the sample summary statistic is a percentile of a number of regions with nonzero MMFs.
32. The method of claim 30, wherein the sample summary statistic is a percentile of a number of regions where MMF is zero.
33. The method of claim 30, wherein the sample summary statistic is a percentile of a number of regions where MMF is greater than threshold.
34. The method of claim 30, wherein each methylation metric of the plurality of methylation metrics to compute a sample summary statistic is computed as follows: identifying within the set of cell-free nucleic acid molecules that align across the region, the cell-free nucleic acid molecule that is most methylated, wherein the methylation metric is computed is an amount of methylation of the cell-free nucleic acid molecule that is most methylated.
35. The method of claim 34, wherein the sample summary statistic is a percentile of the amount of methylation of the cell-free nucleic acid molecule that is most methylated.
36. The method of claim 34, wherein the sample summary statistic is a median of the amount of methylation of the cell-free nucleic acid molecule that is most methylated.
37. The method of claim 34, wherein the sample summary statistic is a skewness of the amount of methylation of the cell-free nucleic acid molecule that is most methylated.
38. The method of any one of claims 21-37, wherein the plurality of regions that are known to be differentially methylated in a condition comprises at least 10 genomic regions associated with a condition.
39. The method of any one of claims 21-38, wherein the plurality of regions that are known to be differentially methylated in a condition comprises at least 50 genomic regions associated with a condition.
40. The method of any one of claims 21 -39, wherein the condition is a cancer.
PCT/US2023/083236 2022-12-08 2023-12-08 Systems and methods for cell-free nucleic acids methylation assessment WO2024124207A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263386557P 2022-12-08 2022-12-08
US63/386,557 2022-12-08

Publications (2)

Publication Number Publication Date
WO2024124207A2 true WO2024124207A2 (en) 2024-06-13
WO2024124207A3 WO2024124207A3 (en) 2024-07-11

Family

ID=91380337

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/083236 WO2024124207A2 (en) 2022-12-08 2023-12-08 Systems and methods for cell-free nucleic acids methylation assessment

Country Status (1)

Country Link
WO (1) WO2024124207A2 (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6791598B2 (en) * 2015-01-22 2020-11-25 ザ ボード オブ トラスティーズ オブ ザ レランド スタンフォード ジュニア ユニバーシティー Methods and systems for determining the ratio of different cell subsets
CA3155073A1 (en) * 2019-10-18 2021-04-22 Aadel Chaudhuri Methods and systems for measuring cell states

Also Published As

Publication number Publication date
WO2024124207A3 (en) 2024-07-11

Similar Documents

Publication Publication Date Title
US11473148B2 (en) Methods of diagnosing bladder cancer
EP3658684B1 (en) Enhancement of cancer screening using cell-free viral nucleic acids
Ledgerwood et al. The degree of intratumor mutational heterogeneity varies by primary tumor sub-site
EP3034624A1 (en) Method for the prognosis of hepatocellular carcinoma
JP2021518107A (en) Tissue-specific methylation marker
US20200370133A1 (en) Compositions and methods for characterizing bladder cancer
JP7499239B2 (en) Methods and systems for somatic mutations and uses thereof
JP2024020392A (en) Composition for diagnosing liver cancer by using cpg methylation changes in specific genes, and use thereof
JP2023516633A (en) Systems and methods for calling variants using methylation sequencing data
JP6395131B2 (en) Method for acquiring information on lung cancer, and marker and kit for acquiring information on lung cancer
KR20230025895A (en) Multimodal analysis of circulating tumor nucleic acid molecules
JP2023530463A (en) Detection and classification of human papillomavirus-associated cancers
CN116583904A (en) Sample validation for cancer classification
WO2024124207A2 (en) Systems and methods for cell-free nucleic acids methylation assessment
WO2017119510A1 (en) Test method, gene marker, and test agent for diagnosing breast cancer
CN115725730A (en) Gastric cancer specific methylation marker and application thereof in differential diagnosis of gastric cancer and other digestive tract tumors
WO2012135635A2 (en) Ovarian cancer biomarkers
CN111201572A (en) Integrated genomic transcriptome tumor-normal-like genomic suite analysis for cancer patients with improved accuracy
EP4234720A1 (en) Epigenetic biomarkers for the diagnosis of thyroid cancer
CN116194596A (en) Method for detecting and predicting grade 3 cervical epithelial neoplasia (CIN 3) and/or cancer
CN118749032A (en) Molecular analysis of disease classification using long free DNA molecules
WO2022255944A2 (en) Method for detection and quantification of methylated dna
CN115074436A (en) Application of lung cancer early diagnosis marker in preparation of lung cancer early diagnosis reagent

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23901702

Country of ref document: EP

Kind code of ref document: A2