WO2024056720A1 - Détermination de l'état de santé et surveillance de traitement avec de l'adn acellulaire - Google Patents

Détermination de l'état de santé et surveillance de traitement avec de l'adn acellulaire Download PDF

Info

Publication number
WO2024056720A1
WO2024056720A1 PCT/EP2023/075122 EP2023075122W WO2024056720A1 WO 2024056720 A1 WO2024056720 A1 WO 2024056720A1 EP 2023075122 W EP2023075122 W EP 2023075122W WO 2024056720 A1 WO2024056720 A1 WO 2024056720A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleosomal
dyad
cfdna
computer
peaks
Prior art date
Application number
PCT/EP2023/075122
Other languages
English (en)
Inventor
Martin Kircher
Samantha HASENLEITHNER
Benjamin SPIEGL
Michael Speicher
Original Assignee
Medizinische Universität Graz
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Medizinische Universität Graz filed Critical Medizinische Universität Graz
Publication of WO2024056720A1 publication Critical patent/WO2024056720A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/106Pharmacogenomics, i.e. genetic variability in individual responses to drugs and drug metabolism
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Definitions

  • the present invention relates to a method of determining the health status of a subject, monitoring the treatment success of a patient, and/or determining the cell type by analyzing cell free DNA (cfDNA) extracted from a sample.
  • cfDNA cell free DNA
  • liquid biopsy approaches have been developed aiming cancer diagnosis, cancer therapy monitoring and disease relapse. However, most liquid biopsy approaches are limited so far in patients with cancer for detection of minimal residual disease and for screening approaches.
  • Circulating cell-free DNA is an informative biomarker in prenatal diagnostics and in cancer patients.
  • cfDNA consists of highly degraded DNA fragments, which are detectable in the peripheral blood of every warm-blooded subject.
  • cfDNA contains DNA derived from tumor cells, termed circulating tumor DNA (ctDNA), which can be utilized to obtain information about the tumor genome (Heitzer et al., 2019).
  • cfDNA is derived from the hematopoietic system, whereas the remaining cfDNA stems from organs, such as liver or endothelial cells.
  • the nucleosome is the fundamental protein subunit of chromatin around which the DNA is wrapped to enable its packaging.
  • DNA is held in complexes with structural proteins in chromosomes. These proteins organize the DNA into a compact structure called chromatin. In eukaryotes, this structure involves DNA binding to histones. The histones form a disk-shaped complex called histone octamer. The combination of a stretch of DNA which is wrapped around this histone octamer is called a nucleosome.
  • Chromatin accessibility profiling is helpful for various applications in biology and medicine because changes in chromatin accessibility are implicated in multiple diseases, where they reflect disease-linked changes in cell composition, gene regulation, and epigenetic cell states.
  • chromatin accessibility profiling of plasma samples may identify disease-linked changes in chromatin structure and transcription regulation.
  • nucleosome occupancy varies across tissues and cell types so that they reveal information about cfDNA tissue of origin.
  • Hall and colleagues provided a detailed map of histone-DNA interactions. To this end, they used a mechanical unzipping method, which allowed unzipping single molecules of DNA, which contained a single nucleosome, to map the locations of the histone-DNA interactions to near base-pair accuracy along the DNA (Hall et al., 2009). It was found that the histone-DNA interactions within a nucleosome are not uniform, and the nucleosomal dyad is the region where nucleosomal DNA was most tightly bound. As the central region has the strongest interaction, the nucleosome stability is most sensitive to DNA sequences near the dyad. Once a dyad region of interactions is disrupted, the nucleosome becomes unstable, and histones dissociate from the sequence (Hall et al., 2009).
  • Nucleosome positioning in cells is impacted by intrinsic factors such as DNA sequence, shape, and DNA bendability, as well as extrinsic factors such as chromatin remodelers and other cofactors (Michael and Thoma, 2021).
  • higher CG content may correspond to more stable nucleosomes.
  • DNA wrapped in nucleosomes is sterically occluded, creating obstacles for proteins that must bind to it for gene regulation, transcription, replication, recombination, and repair. Therefore, mechanisms are required to access buried stretches of nucleosomal DNA.
  • nucleosomes are highly dynamic structures and may transiently expose their DNA so that they are, in general, not roadblocks for DNA-binding proteins.
  • Nucleosomes may temporarily expose portions of their wrapped DNA through spontaneous unspooling from either end. This process by which DNA transiently disengages from the histone octamer is called “site exposure” or “nucleosome breathing”. During nucleosome breathing, nucleosomal DNA ends unwrap from the histone core partially and reversibly on a rapid time scale.
  • thermally driven dynamics represents the spontaneous “mobility” or “thermal sliding” of nucleosomes by which their center of mass repositions on the DNA in an unprompted longitudinal-like movement.
  • nucleosome-bound DNA is more methylated than flanking DNA.
  • methylation patterns may change.
  • placental DNA is globally hypomethylated, and therefore nucleosomal DNA in placental tissue has more open chromatin structures than the methylated maternal somatic tissue.
  • nucleosome-bound placental DNA has increased accessibility to endonucleases during apoptosis and hence alternative cleavage sites compared with maternal DNA, which may explain why placentally derived DNA is shorter than maternally derived DNA in the plasma of pregnant females (Sun et al., 2018).
  • the size distribution of DNA from cancer cells has been reported to be shorter than DNA fragments from nonmalignant cells (Jiang et al., 2015; Mouliere et al., 2018).
  • the shorter tumor derived cfDNA fragments have been attributed to the genome-wide hypomethylation often observed in tumor genomes or other mechanisms such as cfDNA release during cell proliferation rather than apoptosis.
  • These physiological or pathological states impact the accessibility of nucleosomal DNA and may result in asymmetric DNA digestion where the dyad is displaced from the center of cfDNA fragments.
  • NGS next-generation sequencing
  • WGS whole-genome sequencing
  • US 2017/211143 A1 discloses methods of determining tissue and cell types contributing to cfDNA and methods of identifying a disease or disorder in a subject as a function of determined tissue and cell types contributing to cfDNA in a sample. Thereby, mapping of nucleosome positions is based on sequence coverages using the windowed protection score (WPS).
  • WPS windowed protection score
  • the predominant local positions of nucleosomes in tissue(s) contributing to cfDNA is inferred from the distribution of aligned cfDNA fragment endpoints. These cfDNA fragment endpoints should cluster adjacent to nucleosome core particle (NCP) boundaries while also being depleted on the NCP itself.
  • NCP nucleosome core particle
  • Snyder et al. (2016) developed the windowed protection score (WPS), which is the number of DNA fragments completely spanning a 120 bp window centered at a given genomic coordinate minus the number of fragments with an endpoint within that same window (Fig 25 in US 2017/211143 A1 and Fig. 2A in Snyder et al., 2016). High WPS values indicate increased DNA protection from digestion; low values indicate that DNA is unprotected.
  • the inventors of the present invention surprisingly found that analyzing the positioning of the nucleosomal dyad highly increases the resolution of cfDNA analysis and that the position of the nucleosomal dyad can be obtained by using the methods described herein.
  • the position of a nucleosomal dyad indeed provides information on the health status of a subject and allows monitoring of the treatment success of a patient.
  • additional information is provided by obtaining the nucleosomal dyad, namely the cell type and/or tissue contribution of cfDNA can be determined.
  • the present invention provides the determination of nucleosomal dyad positions in cfDNAs resulting in an unprecedented increase in resolution.
  • the inventors of the present invention surprisingly found that the herein described method allows mapping of the relative position of nucleosome dyads to individual cfDNA fragments and, using this newly found information, mapping the location of nucleosome dyads back to the reference genome with unprecedented high resolution, whereas so far known methods estimated nucleosome dyad positions solely from coverage data.
  • the present invention provides a computer-implemented method for determining nucleosomal dyads from cell-free DNA (cfDNA) fragments from a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from the sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; and iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments.
  • cfDNA cell-free DNA
  • nucleosomal dyad positions are determined, specifically nucleosomal dyad positions in cfDNA fragments.
  • step iii. comprises mapping of nucleosomal dyads to cfDNA fragments within a coverage peak.
  • step iii. comprises establishment of a peak specific and cfDNA length specific statistics.
  • step iii. comprises establishing a distribution of probabilities of the presence of a nucleosomal dyad.
  • step iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.
  • determining the average probability of the presence of a nucleosomal dyad comprises Bayesian interference.
  • step vi. chaining the mapped peaks across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing.
  • chaining is grouping the peaks of the average probability of the presence of a nucleosomal dyad that occur consecutively along the reference genome.
  • peaks are chained if a distance of at least 100 bp is between the peaks.
  • peaks are chained if a distance of at least 115, 120, 125, 130, 135, 140, 145, or 146 bp is between the peaks.
  • each chain represents a specific cfDNA origin.
  • the specific cfDNA origin is a cell line or a tissue.
  • chaining is performed genome-wide.
  • chaining is performed in coding and non-coding regions. Specifically, comprising determining an index of fragment length and dyad position.
  • the sample is a biological sample from a subject or from a cohort of subjects.
  • nucleosomal dyads further comprising comparing the determined nucleosomal dyads, mapped peaks and/or chained peaks with one or more standard nucleosomal dyads, standard maps of nucleosomal dyads, and/or standard maps of nucleosomal dyad chains.
  • comparing comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is for a specific classification.
  • nucleosomal dyads further comprising screening for a correlation of determined nucleosomal dyads, mapped peaks and/or chained peaks with one or more standard nucleosomal dyads, standard maps of nucleosomal dyads, and/or standard maps of nucleosomal dyad chain peaks.
  • the one or more standard nucleosomal dyads, standard maps of nucleosomal dyads, and/or standard maps of nucleosomal dyad chain peaks is determined for one or more cohorts of subjects having a specific classification.
  • the specific classification is associated with a condition.
  • the condition is selected from the group consisting of health status, aging status, cell type, tissue type, and specific disease status.
  • markers for specific conditions are defined.
  • the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments for different lengths of cfDNA fragments is indicative for the health status of a subject.
  • the length of cfDNA fragments is obtained in the fragmentation profile.
  • a health status deviating from a healthy status is indicated if the z- score of the set of cumulative deviations and/or informative counts ratios deviate from the set of distributions of cumulative deviations and/or informative counts ratios recorded from healthy subjects.
  • a health status deviating from a healthy status is cancer or pregnancy-associated complications.
  • the health status of a subject is determined.
  • the mapped peaks are compared with a standard map derived from heathy subjects, a standard map derived from unhealthy subjects, an outlier map of nucleosomal dyads derived from unhealthy subjects, and/ or a standard map of nucleosomal dyad chains derived from healthy subjects.
  • comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning.
  • a. congruence with the standard maps derived from healthy subjects and difference with the standard maps derived from unhealthy subjects is characteristic for a healthy status
  • b. congruence with the standard maps derived from unhealthy subjects and difference with the standard maps derived from healthy subjects is characteristic for an unhealthy status
  • c. congruence with the outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps derived from healthy and unhealthy subjects is characteristic for an unhealthy status
  • d. difference with a standard map of nucleosomal dyad chains derived from healthy subjects is characteristic for an unhealthy status.
  • the unhealthy subjects are subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.
  • the subject is considered unhealthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads and standard maps of nucleosomal dyads characteristic for a healthy subject is more than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; wherein the subject is considered healthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads and standard maps of nucleosomal dyads characteristic for a healthy subject is less than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; wherein a subject is considered unhealthy if the z-score of the changes of the informative counts ratio
  • the subject is a patient undergoing treatment of a health condition.
  • the one or more standard maps are mapped peaks of a previous result from said patient, a standard map of nucleosomal dyads characteristic for the treatment success, chained peaks of a previous result from said patient, and/or a standard map of nucleosomal dyad chains characteristic for the treatment success.
  • differences and/or congruences provide information on the treatment success of the patient.
  • the treatment success is monitored for the treatment of cancer, specifically of prostate cancer, colon cancer, breast cancer, bladder cancer, and/or lung cancer; and for the treatment of inflammatory diseases, specifically of inflammatory bowel disease, systemic lupus erythematosus, ulcerative colitis; chronic inflammatory diseases such as thyroiditis, Crohn‘s disease, chronic obstructive pulmonary disease; and/or asthma.
  • cancer specifically of prostate cancer, colon cancer, breast cancer, bladder cancer, and/or lung cancer
  • inflammatory diseases specifically of inflammatory bowel disease, systemic lupus erythematosus, ulcerative colitis
  • chronic inflammatory diseases such as thyroiditis, Crohn‘s disease, chronic obstructive pulmonary disease
  • asthma chronic inflammatory diseases
  • the standard map is a map of nucleosomal dyads of specific tissues or cell types, or a map of nucleosomal dyad chains of specific tissues or cell types.
  • the cell type and/or tissue contribution of cfDNA in a sample is determined.
  • the tissue and/or cell types are selected from the group consisting of cancer cells, specifically lung cancer, colorectal cancer, breast cancer, prostate cancer, and bladder cancer; and normal cells, specifically hematopoietic cells, liver cells, epithelial cells, and bone marrow.
  • the standard map is a map characteristic for an aging status.
  • the standard map is determined from a cohort of subjects having a specific aging status.
  • the cohort of subjects having a specific aging status is selected from healthy subjects older than 55 years, healthy subjects between 20 and 30 years, pregnant females, and subjects having a disease.
  • the disease is cancer, specifically selected from colorectal cancer and prostate cancer.
  • the aging status of a subject is determined.
  • the present invention further provides a data processing apparatus comprising means for carrying out the method described herein.
  • the present invention further provides a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method described herein.
  • the present invention further provides a computer-readable medium having stored thereon the computer program described herein.
  • the present invention further provides an in vitro method for analyzing cell-free DNA (cfDNA) fragments from a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv.
  • cfDNA cell-free DNA
  • an in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning the DNA sequences of the cfDNA fragments from ii.
  • iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence; vii. chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing; and viii. comparing the mapped peaks obtained in vi.
  • comparing the mapped peaks obtained in vi. comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning; wherein a.
  • congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status
  • b. congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status
  • c. congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects is characteristic for an unhealthy status
  • d congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects
  • said library of standard maps derived from unhealthy subjects and/or said library of outlier maps of nucleosomal dyads derived from unhealthy subjects is from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.
  • the subject is considered unhealthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for a healthy subject is more than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; wherein the subject is considered healthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads obtained in vi.
  • nucleosomal dyads characteristic for a healthy subject is less than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; and/or wherein a subject is considered unhealthy if the z-score of the changes of the informative counts ratios between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3; and/or wherein a subject is considered unhealthy if the z-score of the changes of the cumulative deviations between the sample set of cumulative deviations and the standard set of cumulative deviations exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3.
  • an in vitro method for monitoring the treatment success of a patient comprising the steps of: i. extracting cfDNA fragments from a sample of said patient; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv.
  • comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, and/or determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, wherein differences and/or congruences obtained in viii. provide information on the treatment success of the patient.
  • the treatment success is monitored for the treatment of cancer, specifically of prostate cancer, colon cancer, breast cancer, bladder cancer, and/or lung cancer; and for the treatment of inflammatory diseases, specifically of inflammatory bowel disease, systemic lupus erythematosus, ulcerative colitis; chronic inflammatory diseases such as thyroiditis, Crohn‘s disease, chronic obstructive pulmonary disease; and/or asthma.
  • cancer specifically of prostate cancer, colon cancer, breast cancer, bladder cancer, and/or lung cancer
  • inflammatory diseases specifically of inflammatory bowel disease, systemic lupus erythematosus, ulcerative colitis
  • chronic inflammatory diseases such as thyroiditis, Crohn‘s disease, chronic obstructive pulmonary disease
  • asthma chronic inflammatory diseases
  • an in vitro method for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv.
  • the tissue and/or cell types are selected from the group consisting of cancer cells, specifically lung cancer, colorectal cancer, breast cancer, prostate cancer, and bladder cancer; and normal cells, specifically hematopoietic cells, liver cells, epithelial cells, and bone marrow.
  • a computer-implemented method for determining the health status of a subject comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii.
  • v mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. comparing the mapped peaks obtained in v. with a library comprising standard maps, comparing the mapped peaks obtained in v. with a library comprising outlier maps of nucleosomal dyads, and/ or comparing the chained peaks obtained in vi.
  • comparing the mapped peaks obtained in v. comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status; c.
  • a computer-implemented method for monitoring the treatment success of a patient comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii.
  • v mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. comparing the mapped peaks obtained in v. with the mapped peaks of a previous result from said patient, comparing the mapped peaks obtained in v. with a standard map of nucleosomal dyads characteristic for the treatment success, comparing the chained peaks obtained in vi.
  • comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, and/or determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, wherein differences and/or congruences obtained in vii. provide information on the treatment success of the patient.
  • a computer-implemented method for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv.
  • a library comprising mapped nucleosomal dyads of specific tissues or cell types and/or determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains of specific tissues or cell types.
  • a computer-implemented method in a method described herein, said method comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. performing at least one of the steps iii. to viii. according to a method described herein.
  • an in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. determining the sequence of the cfDNA fragments by performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; and iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; wherein the probability obtained from iv. for different fragment lengths of the cfDNA fragments as obtained from iii.
  • a health status deviating from a healthy status is indicated if the z-score of the set of cumulative deviations and/or informative counts ratios deviate from the set of distributions of cumulative deviations and/or informative counts ratios recorded from healthy subjects, preferably wherein a health status deviating from a healthy status is cancer, unhealthy aging, or pregnancy-associated complications.
  • Figure 1 Principle of liquid biopsy for circulating tumor DNA analysis
  • Figure 7 Overview heatmap of dyad count distributions with different distribution truncation strategies
  • Figure 10 Extraction and selection of features from biologically relevant genomic loci for machine learning
  • Figure 11 Machine learning classifiers detect pathophysiological states
  • Figure 12 New insights from nucleosome prior distributions and posterior nucleosome dyad position mapping
  • Figure 14 Kurtosis of dyad placement.
  • DNA refers to deoxyribonucleic acid. DNA is a type of nucleic acid.
  • nucleic acid generally refers to a polynucleotide comprising two or more nucleotides.
  • a nucleotide is a monomer composed of three components: a 5-carbon sugar, a phosphate group, and a nitrogenous base.
  • the four naturally occurring types of DNA nucleotides are: adenine (A), thymine (T), guanine (G), and cytosine (C).
  • cfDNA refers to “cell free DNA”, “cell-free DNA”, “circulating free DNA”, or “circulating-free DNA”.
  • cfDNA consists of highly degraded DNA fragments, which are detectable in the peripheral blood of every human. In healthy individuals, the vast majority of cfDNA is derived from the hematopoietic system. However, the preferential DNA contribution to the cfDNA pool may change under certain physiological or pathological conditions. Furthermore, cfDNA can also provide information about physiological processes such as aging.
  • cfDNA may comprise a footprint representative of its underlying chromatin organization, which may capture one or more of: expressing-governing nucleosomal occupancy, RNA Polymerase II pausing, cell death-specific DNase hypersensitivity, and chromatin condensation during cell death.
  • a footprint may carry a signature of cell debris clearance and trafficking, e.g., DNA fragmentation carried out by caspase- activated DNase (CAD) in cells dying by apoptosis, but also may be carried out by lysosomal DNase II after the dying cells are phagocytosed, resulting in different cleavage patterns.
  • CAD caspase- activated DNase
  • cfDNA represents an essential component of “liquid biopsies”, which refers to the analyses of non-solid biological sources (e.g., blood, urine, CSF, ascites) to obtain information similar to tissue biopsies. Analyses of cfDNA are of extraordinary relevance, particularly in oncology, since in patients with cancer, cfDNA contains circulating tumor DNA (ctDNA) shed from tumor cells into the circulation.
  • liquid biopsies refers to the analyses of non-solid biological sources (e.g., blood, urine, CSF, ascites) to obtain information similar to tissue biopsies.
  • Analyses of cfDNA are of extraordinary relevance, particularly in oncology, since in patients with cancer, cfDNA contains circulating tumor DNA (ctDNA) shed from tumor cells into the circulation.
  • Mechanisms for DNA release into the bloodstream can be apoptosis, necrosis, and active release, specifically cfDNA is released by apoptosis.
  • DNA is wrapped around histones to form nucleosomes, which are the basic structure of DNA packing.
  • typical cfDNA fragment lengths have a modal distribution of 167 bp. This length corresponds approximately to the size of DNA wrapped around a nucleosome ( ⁇ 147 bp) and a linker fragment ( ⁇ 20 bp).
  • This particular cfDNA size pattern corresponds to fragmentation patterns after enzymatic processing in apoptotic cells.
  • the cfDNA fragmentation patterns reflect the association between cfDNA with nucleosome core particles and linker histones, determining where nuclease cleavage may occur.
  • DNA is frequently cleaved between nucleosomes and only rarely within nucleosomes. The latter circumstance is also called “cleaving resistance” and associated with cfDNA fragments described herein.
  • the architecture of individual nucleosomes determines access to nucleosomal DNA.
  • the individual nucleosome core particle contains 147 bp of DNA wrapped in ⁇ 1.7 left-handed superhelical turns around a central octamer composed of two copies of each of the four core histones H2A, H2B, H3, and H4. These fundamental nucleosome units are connected with intervening linkers ranging from 20 to 100 bp (Michael and Thoma, 2021 ). Usually, the DNA is tightly wrapped around this histone octamer and sharply bent.
  • the nucleosome core particle architecture is pseudo-2 -fold symmetric, with the DNA position at the symmetry axis.
  • the symmetry axis i.e., the dyad, is designated as location 0.
  • the superhelix locations (SHLs) are labeled with ⁇ 1 , ⁇ 2, and so on and denote where the minor grooves of the DNA double helix structure face away from the histone octamer (shown in Michael and Thoma, 2021).
  • the methods described herein are based on the analysis of the presence of one or more nucleosomal dyads in cfDNA fragments.
  • nucleosomal dyad is the region occupied by the center of the nucleosome or the base position of nucleosomal DNA that marks the midpoint of the nucleosomal base pair sequence (see Michael and Thoma, 2021).
  • the nucleosomal DNA With its two juxtaposed DNA gyres, the nucleosomal DNA itself places most DNA motifs directly adjacent to a second DNA strand on the neighboring gyre, except the dyad where only one DNA strand is present (shown in Michael and Thoma, 2021).
  • an in vitro method for analyzing cfDNA fragments from a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v.
  • sample generally refers to a biological sample obtained from or derived from a subject.
  • Biological samples may be cell-free biological samples or substantially cell-free biological samples, or may be processed or fractionated to produce cell-free biological samples.
  • the biological sample used in the method of the invention is a biofluid sample.
  • useful biofluid samples include, e.g., a blood sample, a serum sample, a plasma sample, a cerebrospinal fluid (CSF) sample, a lymph sample, an endometrial fluid sample, a urine sample, a saliva sample, a tear fluid sample, a synovial fluid sample, an amniotic fluid sample, and a sputum sample.
  • the biofluid sample is selected from a blood sample, a urine sample, a cerebrospinal sample, or an amniotic fluid sample.
  • cfDNA can, e.g., be obtained by a standard blood draw, i.e. , a minimally invasive approach.
  • a standard blood draw i.e. , a minimally invasive approach.
  • the blood vial after the blood draw contains both the cellular components of blood and the cell-free fraction, which is referred to as plasma, extraction steps such as centrifugation steps may be required to separate these components.
  • extract in the context of extracting cfDNA fragments refers to the isolation of the cfDNA or cfDNA fragments from the sample. Isolation, extraction, and or purification of cfDNA or cfDNA fragments may be performed through collection of bodily fluids using a variety of techniques. In some cases, collection may comprise aspiration of a bodily fluid from a patient using a syringe. In other cases, collection may comprise pipetting or direct collection of fluid into a collecting vessel. After collection of bodily fluid, cfDNA or cfDNA fragments may be isolated and extracted using a variety of techniques known in the art.
  • cfDNA may be isolated, extracted and prepared using commercially available kits such as the Qiagen Qiamp® Circulating Nucleic Acid Kit protocol.
  • Qiagen QubitTM dsDNA HS Assay kit protocol AgilentTM DNA 1000 kit, orTruSeqTM Sequencing Library Preparation; Low-Throughput (LT) protocol may be used.
  • the cfDNA or cfDNA fragments are pre-mixed with one or more additional materials, such as one or more reagents (e.g., ligase, protease, polymerase) prior to sequencing.
  • additional materials such as one or more reagents (e.g., ligase, protease, polymerase) prior to sequencing.
  • a cell-free fraction of a biological sample may be used as a sample in the methods described herein.
  • the term “cell-free fraction” of a biological sample generally refers to a fraction of the biological sample that is substantially free of cells.
  • the term “substantially free of cells” generally refers to a preparation from the biological sample comprising fewer than about 20,000 cells per mL, fewer than about 2,000 cells per mL, fewer than about 200 cells per mL, or fewer than about 20 cells per mL.
  • Genomic DNA may not be excluded from the acellular sample and typically comprises from about 50% to about 90% of the nucleic acids that are present in the sample.
  • liquid biopsy refers to a broad category for sampling and minimally invasive testing done of a biofluid (e.g., blood, blood plasma or blood serum) to look for fragments of e.g., tumor derived cfDNA that are in the blood.
  • a biofluid e.g., blood, blood plasma or blood serum
  • the methods described herein may comprise a step of amplifying a nucleic acid.
  • amplifying and amplification generally refer to increasing the size or quantity of a nucleic acid molecule.
  • the nucleic acid molecule may be single-stranded or double-stranded.
  • Amplification may include generating one or more copies or “amplified product” of the nucleic acid molecule.
  • Amplification may be performed, for example, by extension (e.g., primer extension) or ligation.
  • Amplification may include performing a primer extension reaction to generate a strand complementary to a single-stranded nucleic acid molecule, and in some cases generate one or more copies of the strand and/or the single-stranded nucleic acid molecule.
  • DNA amplification generally refers to generating one or more copies of a DNA molecule or “amplified DNA product.”
  • a method comprises performing DNA sequencing e.g., whole genome sequencing, Sanger sequencing, targeted next-generation sequencing (NGS), whole-genome NGS.
  • DNA sequencing e.g., whole genome sequencing, Sanger sequencing, targeted next-generation sequencing (NGS), whole-genome NGS.
  • whole genome sequencing is performed on extracted cfDNA fragments for obtaining the DNA sequence of the cfDNA fragment.
  • the result of this sequencing of the cfDNA fragment is also referred to herein under “sequenced cfDNA fragment” or the “read”.
  • sequenced refers to a sequence read from a portion of a nucleic acid sample, i.e., is the result of the sequencing experiment.
  • a read represents a short sequence of contiguous base pairs in the sample.
  • the read may be represented symbolically by the base pair sequence (in ATCG) of the sample portion. It may be stored in a memory device and processed as appropriate to align the sequences with another sequence, to determine whether it matches a reference sequence, or if it meets other criteria.
  • a sequence or a read may be obtained directly from a sequencing apparatus or indirectly from stored sequence information concerning the sample.
  • sequenced fragment or “fragment sequence” as used herein refers to the combined sequence and length information of a DNA fragment which is gained, for example, from a pair of sequencing reads which were created by sequencing both ends of that DNA fragment, a process which is known as “paired-end read sequencing”, and subsequently aligning the obtained sequences to a reference genome.
  • the length information is obtained from start and end coordinates of the paired sequence alignments.
  • This information can also be extracted from a single sequencing read of a DNA fragment which was created by exhaustive sequencing of a DNA fragment until an adjacent sequencing adapter is read during the sequencing process. This type of sequencing process is known as “single-end read sequencing”.
  • the adapter sequence is removed computationally from the read sequence afterwards.
  • the DNA sequences of the cfDNA fragments have different lengths.
  • the length may vary from tens to hundreds of base pairs.
  • the sequence reads are about 25bp, about 30bp, about 35bp, about 40bp, about 45bp, about 50 bp, about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110 bp, about 120 bp, about 130 bp, about 140 bp, about 150 bp, about 175 bp about 200 bp, about 250 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500 bp.
  • the sequence reads are 151 bp for each end of a DNA fragment that is sequenced in paired-end read sequencing mode.
  • paired-end reads are 50 bp, 75 bp, 100 bp, 101 bp, 150 bp, 151 bp, or 175 bp long.
  • aligning refers to the process of comparing a DNA sequence with a reference sequence. In other words, aligning means comparing a read or sequence obtained by sequencing to a reference sequence and thereby determining whether the reference sequence contains the read sequence, the location where the read sequence is aligned in the reference sequence, and/or how the read sequence aligns with the reference sequence.
  • the read may be mapped to the reference sequence or, in certain embodiments, to a particular location in the reference sequence. In some cases, alignment simply tells whether or not a read is a member of a particular reference sequence (i.e. , whether the read is present or absent in the reference sequence).
  • a “reference sequence” or a “reference genome sequence” is a sequence of a biological molecule, which is frequently a nucleic acid such as a chromosome or genome. Typically, DNA sequences of multiple cfDNA fragments are members of a given reference sequence.
  • the reference sequence is significantly larger than the sequenced portions or reads that are aligned to it.
  • the reference sequence is the sequence of a full length genome of a subject, specifically it is a full length human genome. Such sequences may be referred to as reference genome sequences. Such sequences may also be referred to as chromosome reference sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, sub-chromosomal regions, e.g., strands of any species.
  • the reference sequence is a consensus sequence or other combination derived from multiple individuals. However, in certain applications, the reference sequence may be taken from a particular individual.
  • a "site” is a unique position in a reference sequence corresponding to a read or a DNA sequence of a cfDNA fragment.
  • a DNA sequence of a cfDNA fragment is aligned with a reference genome sequence in order to determine the cfDNA fragmentation profile.
  • fragmentation profile refers to evaluation of fragmentation patterns of cfDNA across the genome.
  • Such an evaluation can include cfDNA fragment lengths, positions of aligned fragments relative to the reference genome sequence, relative to a specific point on the reference genome, or alignment positions of multiple fragments relative to each other, the ratio between cfDNA fragments with different lengths (e.g., ratio between all cfDNA below a certain length (e.g., 150 bp) vs. all fragments above this length), or whether the nucleosome patterns computed from the cfDNA fragments correspond to nucleosome patterns of a particular cell type, such as white blood cells.
  • the fragmentation profile of cfDNA fragments is used to generate a nucleosome map that identifies the position of nucleosomes in the sample.
  • the nucleosome map displays positions of nucleosome peaks, indicating open and closed chromatin regions in the subject’s genome.
  • Open chromatin regions indicate regions of the genome that do not contain nucleosomes. These open regions are able to be bound by various protein factors and regulatory elements and transcribed.
  • Closed chromatin regions are regions of the genome that surround nucleosomes and are inaccessible to protein factors, regulatory elements, and other molecules. These closed chromatin regions are not able to be transcribed.
  • the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments is determined by determining the dyad count distribution for specific fragment lengths, performing a fragment length-based truncation, determining probability density functions, and removing of the non-informative portion.
  • This probability is also be termed “nucleosome dyad prior distribution ”, “nucleosome prior distribution ”, or “nucleosome prior” herein.
  • the step of determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments comprises mapping of nucleosomal dyads to cfDNA fragments within a coverage peak.
  • mapping of nucleosomal dyads to cfDNA fragments within a coverage peak refers to initial nucleosomal dyad mapping to all cfDNA fragments within a coverage peak. Specifically, from the cfDNA fragmentation profile.
  • mapping of nucleosomal dyads to cfDNA fragments within a coverage peak may be performed as follows: In each coverage peak, the position of maximum coverage overlap is mapped to each individual cfDNA fragment contributing to the peak, i.e., within each cfDNA fragment, the relative position of the dyad is inferred. Specifically, this is illustrated in the enlarged panel of Figure 5, left side, where the putative localization of the nucleosome dyad is indicated as a dashed line, which meets some cfDNA fragments within the peak. Specifically, the relative position of the nucleosome dyad may map to the center of a cfDNA fragment or off the center or may not be determinable.
  • nucleosomal dyads maps to cfDNA fragments within a coverage peak
  • initial nucleosomal dyad mapping to all cfDNA fragments within a coverage peak is depicted in Figure 5, right side, left panel (“nucleosomal fragments”), where the cfDNA fragments from the coverage peak from the enlarged panel of Figure 5, left side, are displayed sorted by size and where an arrow on the respective cfDNA fragment indicates the location of the nucleosomal dyad.
  • the step of determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments comprises establishment of a peak specific and cfDNA length-specific statistic.
  • establishment of a peak-specific and cfDNA length-specific statistic may be also referred to as establishment of a locus-specific and cfDNA length-specific statistic.
  • establishment of a peak-specific and cfDNA length-specific statistic allows a detailed cfDNA fragment length-specific dyad statistic for each peak (locus).
  • all cfDNA fragments mapping to the same locus and that have the same length, the inferred nucleosome dyad positions are recorded (see e.g., Figure 5, right side, center panel (Fragment Length Specific Dyad Statistics)).
  • the step of determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments comprises establishing a distribution of probability of the presence of a nucleosomal dyad.
  • establishing a distribution of probability of the presence of a nucleosomal dyad may be also referred to as Establishment of the nucleosome prior distribution (yi).
  • Figure 6 depicts the normalization based on cfDNA counts. The sum of counts is needed to generate an AUC of 1 for the entire region within the range defined in the previous step. Next, the area of random position signals is determined, which involves an update of the 0’ axis. Then, the random area is subtracted from the entire area resulting in an AUC of ⁇ 1 for the respective areas. This facilitates comparing the relative height between priors to relate different priors to each other. The result is cfDNA fragment length-specific high-confidence information about dyad positioning. These nucleosome dyad prior distributions contain the data required to calculate the posterior nucleosome localization probability from cfDNA fragmentation.
  • the prior localizations are retrieved and overlaid with the BAM file, and the original fragments from the BAM file are used for calculations.
  • the overlaid nucleosome prior signals are then used to calculate the nucleosome localization posterior probability. After this step, the nucleosome dyad posterior distribution is available.
  • the step of determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments comprises mapping of nucleosomal dyads to cfDNA fragments within a coverage peak, establishment of a peak-specific and cfDNA length-specific statistic, and establishing a distribution of probability of the presence of a nucleosomal dyad.
  • the nucleosome fragment-specific prior distributions allow calculation of a per-base average across these distributions, resulting in the nucleosome posterior signal (see e.g., Figure 8).
  • the average probability of the presence of a nucleosomal dyad at certain base positions in the reference genome sequence is determined based on the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments and the fragmentation profile obtained from aligning DNA sequences of the cfDNA fragments with a reference genome sequence.
  • This average probability also termed “nucleosome dyad posterior distribution”, “nucleosome posterior distribution”, or “nucleosome posterior” herein, is determined by Bayesian inference as described herein.
  • Bayesian inference is used to compute the positions of nucleosome dyads based on coverage maxima and cfDNA fragmentation by using Baye’s Theorem.
  • Bayes theorem describes the probability of an event based on prior knowledge of conditions that might be related to the event.
  • two signals are generated and evaluated in the methods described herein: the first is based on sequencing coverage, i.e. , the number of sequencing reads aligned to a specific locus in a reference genome (see e.g., Figure 6, left side: assumption 1); the second signal is the “posterior nucleosome signal” based on Bayesian inference.
  • the methods described herein comprise the step of mapping the peaks of the average probability of the presence of a nucleosomal dyad across a reference genome sequence.
  • said reference genome sequence may be the same as the reference genome sequence used for aligning DNA sequences of cfDNA fragments.
  • peaks refers to local maxima of said average probability along the reference genome sequence. Specifically, these local maxima must be at least 2 bp, 3 bp, 4 bp, 5 bp, 7 bp, 10 bp, 12 bp, 15 bp, or more apart from each other and must be supported by more than 1 , 2, 3, 4, 5, 6, 7, 8, or more cfDNA fragments. Higher minimum distance values yield stricter peak calling and peak grouping results whereas lower values allow for a more permissive peak calling and grouping. The number of required supporting fragments must also regard the target sequencing depth of the sequencing dataset. A cfDNA fragment supports a peak if one of the highest local maxima of the fragment’s associated nucleosome prior distribution is located within 20 bp of the local maximum of the nucleosome posterior distribution or within a smaller base range.
  • nucleosomal dyads or the peaks of the average nucleosomal dyad probability are mapped for the whole genome or for sub-regions thereof.
  • the methods described herein may further comprise analyzing the depth of coverage.
  • depth of coverage refers to the number of fragment sequences that align with a particular site of the reference genome.
  • coverage describes whether or not any fragment sequence aligns with a particular site or region of a reference genome. In another embodiment, it is also used to describe the average target coverage across an entire reference genome.
  • coverage pattern generally refers to a spatial arrangement of fragment sequences after alignment of read sequences to a reference genome.
  • the coverage pattern identifies the extent and depth of coverage of nextgeneration sequencing methods.
  • the methods described herein may further comprise determining the fragment support of inferred nucleosome dyads.
  • the method for analyzing cfDNA described herein further comprises step vii. of chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing.
  • chain as used herein in the context of chaining nucleosomal dyads refers to the grouping of peaks of the posterior probability that occur consecutively along a reference genome following rules of naturally occurring nucleosomal spacing and regularity of fragment support.
  • rules for naturally occurring nucleosomal spacing and regularity of fragment support are as follows. Peaks may only be chained if there is a minimum distance of 100 bp, 115 bp, 120 bp, 125 bp, 130 bp, 135 bp, 140 bp, 145 bp, or 146 bp between them, whereas chaining becomes more stringent if a higher minimum distance is chosen. Based on these possible lower distance bounds and by using fragment support information, one or more low-fragment-support sub-chains may be identified at the site of an already established nucleosome chain that has higher average fragment support.
  • nucleosome dyads may be arbitrarily far apart from each other because the association of DNA with histone octamer cores is not strictly necessary for the existence of a DNA molecule.
  • a formal definition of chain termination conditions is used nevertheless to obtain in the genomic space confined chains which equates working with a higher nucleosome chain resolution.
  • Well-covered stretches of DNA of at least 50% of the data set’s target x-fold coverage that exceed a length of 471 bp and that are found to be devoid of nucleosome dyad peaks are unexpected to be observed in natural chromatin. This criterion is used for termination of all nucleosome chains and sub-chains neighboring the nucleosome-deserted reference stretch.
  • Shorter distances can be used to establish a more stringent chain termination behavior.
  • Such more stringent termination distances can be around 450 bp, 430 bp, 410 bp, 390 bp, 370 bp, 350 bp, 340 bp, 330 bp, 320 bp, 310 bp, 300 bp, 290 bp, 280 bp, 270 bp, 260 bp, 250 bp, 240 bp, 230 bp, 220 bp, 210 bp, 200 bp, 190 bp, or 185 bp.
  • Another chain termination condition is defined by diminished fragment support of consecutive nucleosome peaks that would otherwise fulfill spacing constraints.
  • Sudden changes in fragment support for the next peak that is to be chained of 2 standard deviations of the previous average fragment support as estimated from the current chain, or a reduction below 40% of the average fragment support of chained peaks also indicates the termination of a nucleosome chain.
  • Higher percentage values/smaller number of standard deviations of the fragment support drop can be used to achieve a more stringent chain termination. Such values are 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, and 99% of the average fragment support of chained peaks.
  • the herein provided method allows determining nucleosome positions with higher precision and thus, adjacent signals can be resolved (see e.g., Figures 9 and 10) which with so far known methods appeared to be just one signal. This allows to determine which nucleosomal signals belong together, i.e., whether the nucleosomes form one chain (only one cell lineage caused the pattern) or more than one chain (several cell lineages or tissues caused the nucleosome pattern).
  • biological knowledge and well-established average distances between nucleosomes are included following determining which signals belong together, e.g., were contributed by a specific cell type.
  • adjacent signals are not only chained, but due to the increased resolution, the number of cell lineages that make up the nucleosome signals in any region of the genome can be inferred, which includes unprecedented resolution at the individual gene level.
  • one or more chains of mapped peaks are obtained in the methods described herein.
  • each chain represents a cell lineage/tissue of origin of the cfDNA.
  • mapping or chaining may result in one or more nucleosome maps.
  • this is made possible by the superior resolution of the herein described methods and allows to resolve nucleosome peaks as representing several peaks and hence determining, for each region in the genome, how many cell lineages/tissues have contributed.
  • nucleosome positions are evaluated for a specific region; distance evaluations between these nucleosome peaks are applied, where these distances are derived from “biology knowledge” (“naturally occurring nucleosomal spacing”).
  • biology knowledge naturally occurring nucleosomal spacing
  • chaining is performed genome-wide.
  • chaining is performed in coding and non-coding regions.
  • the herein described methods allow chaining for established regulatory regions, such as TSSs or TFBSs, and genome-wide. Genomewide chaining may include coding and non-coding regions.
  • the herein described method allow chaining in non-coding regions, e.g., such as introns.
  • chaining refers to analysis of nucleosome occupancies.
  • an in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning the DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv.
  • a reference nucleosome map is not necessarily needed.
  • the nucleosome positions from the nucleosome positions it can be deduced which genes and pathways are active or silent in the cells that release their DNA into the circulation. Specifically, from these gene and pathways activities, it can be directly inferred which cells contribute to the cfDNA pool as gene expression and signal pathways are highly cell and tissue specific.
  • the in vitro method for determining the health status of a subject further comprises the step of chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing after step vi.. Specifically, the further step of chaining is performed after mapping in step vi. and before comparing in step vii.. Thereby, the step of chaining is performed as step vii. and comparing is performed as step viii..
  • comparing in step viii. comprises comparing the mapped peaks obtained in vi. with a library comprising standard maps, comparing the mapped peaks obtained in vi. with a library comprising outlier maps of nucleosomal dyads, and/ or comparing the chained peaks obtained in vii. with a standard map of nucleosomal dyad chains, wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b.
  • congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status
  • congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects is characteristic for an unhealthy status
  • d. difference with a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for an unhealthy status.
  • a congruence of 50, 60, 65, 70, 75, 80, 85, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99, or 100% with standard maps derived from healthy subjects is characteristic for a healthy status.
  • a difference of 50, 60, 65, 70, 75, 80, 85, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99, or 100% with standard maps derived from unhealthy subjects is characteristic for a healthy status.
  • a congruence of 50, 60, 65, 70, 75, 80, 85, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99, or 100% with standard maps derived from unhealthy subjects is characteristic for an unhealthy status.
  • a difference of 50, 60, 65, 70, 75, 80, 85, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99, or 100% with standard maps derived from healthy subjects is characteristic for an unhealthy status.
  • the library comprises standard maps derived from healthy subject and/or standard maps derived from unhealthy subjects.
  • comparing the mapped nucleosomal dyads with a library comprising standard maps of nucleosomal dyads comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio (e.g. Figure 21), determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning.
  • the subject is considered healthy if the deviation of nucleosomal dyad positioning on cfDNA fragments between the mapped nucleosomal dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for a healthy subject is less than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for an unhealthy subject. Specifically, said deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads obtained in vi.
  • nucleosomal dyads characteristic for a healthy subject is 99, 98, 97, 96, 95, 94, 93, 92, 91 , 90, 89, 88, 87, 86, 85, 84, 83, 82, 81 , 80,75, 70, 65, 60, 55, 50, 45, 40, 35, 30, 25, 20, 15, 10, or 5% of the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for an unhealthy subject.
  • congruence of a subject s nucleosomal dyad chains with a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for a healthy status.
  • deviation of a subject s nucleosomal dyad chains from a standard map of nucleosomal dyad chains obtained from unhealthy subjects is characteristic for a healthy status.
  • congruence of a subject s nucleosomal dyad chains from a standard map of nucleosomal dyad chains obtained from unhealthy subjects is characteristic for an unhealthy status.
  • a machine learning model for binary classification between healthy and unhealthy regarding a specific disease group or pregnancy can be trained on the set of standard dyad chains from samples of both groups to learn patterns of dyad chains that signify an unhealthy sample. Multiple such models can be combined to achieve multi-class classification.
  • an in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning the DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv.
  • the library comprises standard maps derived from healthy subject and/or standard maps and outlier maps derived from unhealthy subjects.
  • comparing the mapped nucleosomal dyads obtained in vi. with a library comprising standard maps and outlier maps of nucleosomal dyads comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio (e.g. Figure 21), determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning.
  • the subject is considered unhealthy if the deviation of nucleosomal dyad positioning on cfDNA fragments between the mapped nucleosomal dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for a healthy subject is more than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for an unhealthy subject.
  • the subject is considered unhealthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads obtained in vi.
  • nucleosomal dyads characteristic for a healthy subject is 1.1 , 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5., 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10, 15, 20-fold or even higher than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for an unhealthy subject.
  • changes of the congruence between the sample dyad map and the standard dyad map of healthy subjects are expressed as z- score and a subject is identified as unhealthy, if said z-score exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3.
  • changes of the congruence between the sample dyad map and the standard dyad map of healthy subjects are expressed as z-score and a subject is identified as unhealthy, if the z-score exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, or even more.
  • changes of the congruence between the sample dyad map and the standard dyad map of unhealthy subjects are expressed as z-score and a subject is identified as healthy, if the z-score exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, or even more, more preferably if the absolute value of the z-score exceeds 2.
  • a machine learning model may be used to learn classification from multiple algorithms.
  • Callable regions are defined as regions that exceed the lower bound for minimum fragment support for calling main peaks from the nucleosome posterior signal.
  • the term “subject” generally refers to an individual, entity or a medium that has or is suspected of having testable or detectable genetic information or material.
  • a subject can be a person, individual, or patient.
  • the subject can be a vertebrate, such as, for example, a mammal.
  • Non-limiting examples of mammals include humans, simians, farm animals, sport animals, rodents, and pets.
  • the subject may be displaying a symptom(s) indicative of a health or physiological state or condition of the subject, such as a cancer or a stage of a cancer of the subject.
  • the subject can be asymptomatic with respect to such health or physiological state or condition.
  • the standard maps of nucleosomal dyads characteristic for a healthy subject are derived from healthy subjects.
  • Healthy subjects may be understood as subjects not having the symptoms that the subject to be tested is suffering from.
  • healthy subjects are not suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.
  • the term “cohort” or “cohort of subjects” shall refer to a group of subjects having a specific classification and may specifically refer to the samples received from said subjects.
  • the number of subjects of a cohort can vary, i.e. it may comprise 2, 3, 4, 5, 6, 7 or more subjects, however it also may be a larger group of subjects, like for example but not limited to 10, 50, 100 or more subjects.
  • the cohort may also comprise large cohorts of 500 or more subjects.
  • the cohort of subjects as described herein shall refer to a group of subjects being associated with or having a condition. These subjects of a cohort can thereby be assigned to a specific classification or status, e.g.
  • a certain condition such as a clinical, physiologic, or pathologic condition, specifically, selected from but not limited to health status, aging status, cell type, tissue type, and specific disease status.
  • the cohort of subjects shall refer to a group of subjects being healthy, unhealthy, of a certain age, and/or having a specific disease.
  • Markers for specific conditions may be, but are not limited, to patterns of dyad positions indicating a specific condition of a subject or a cohort of subjects.
  • Aging is a combination of processes of deterioration that follow the period of development of an organism. Aging is generally characterized by a declining adaptability to stress, increased homeostatic imbalance, increase in senescent cells, and increased risk of disease. Because of this, death is the ultimate consequence of aging.
  • Unhealthy aging may be induced by stress conditions including, but not limited to chemical, physical, and biological stresses. Unhealthy aging is also referred to as “inflammaging”. For example, accelerated aging can be induced by stresses caused by UV and IR irradiation, drugs and other chemicals, chemotherapy, intoxicants, such as but not limited to DNA intercalating and/or damaging agents, oxidative stressors etc; mitogenic stimuli, oncogenic stimuli, toxic compounds, hypoxia, oxidants, caloric restriction, exposure to environmental pollutants, for example, silica, exposure to an occupational pollutant, for example, dust, smoke, asbestos, or fumes.
  • stress conditions including, but not limited to chemical, physical, and biological stresses. Unhealthy aging is also referred to as “inflammaging”.
  • accelerated aging can be induced by stresses caused by UV and IR irradiation, drugs and other chemicals, chemotherapy, intoxicants, such as but not limited to DNA intercalating and/or damaging agents, oxidative stressors etc
  • the standard maps of nucleosomal dyads characteristic for an unhealthy subject are derived from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.
  • described herein is the establishment of a library comprising standard maps of nucleosomal dyads.
  • the standard maps of nucleosomal dyads are established from samples of healthy and/or unhealthy subjects as described herein.
  • the preparation of standard maps of nucleosomal dyad comprises analyzing cfDNA as described herein. Specifically, recurring peaks are integrated into a standard map of peak positions for a specific group of samples. Peak positions are regarded as recurring in a homogeneous group of samples, if a peak of nucleosome posterior distribution is called within a region of 5 bp, 10 bp, 15 bp, 20 bp at a specific site for 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, or 95% of the samples, if the samples are homogeneous regarding a specific health status or health characteristic or a particular pathology.
  • Samples may be excluded from the computation if the candidate site is not sufficiently covered, e.g. depth of coverage is below the minimum fragment support required for peak calling from the posterior nucleosome signal. Highly pronounced singular peaks are recorded for every non-healthy sample group in a separate outlier map of nucleosomal dyads as described later on. Standard maps of non-healthy groups may also include locations of recurring peak locations from the healthy standard map of nucleosomal dyads, if these are recurrently absent in the non-healthy group.
  • machine learning model for binary classification between healthy and unhealthy regarding a specific disease group or pregnancy can be trained on the set of standard dyad peaks from samples of both groups to learn patterns of dyad peaks that signify an unhealthy sample. Multiple such models can be combined to achieve multi-class classification.
  • the outlier maps of nucleosomal dyads characteristic for an unhealthy subject are derived from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.
  • described herein is the establishment of a library comprising outlier maps of nucleosomal dyads for non-healthy groups.
  • the outlier peak maps of nucleosomal dyads are established from samples of healthy and/or unhealthy subjects as described herein.
  • the preparation of outlier maps of nucleosomal dyads comprises analyzing cfDNA as described herein. Specifically, outlier peaks occurring only in a subset of samples of a specific non-healthy group, i.e. not among recurring peaks of the same group, are integrated into a map of outlier peak positions for that specific sample group. Peaks that qualify as outliers must carry a trait or multiple traits that indicate a pronounced character that supports their presence in order to be regarded in the outlier map of nucleosomal dyads.
  • Pronounced peaks can not only be defined by high prominence values, but also by a combination of high confidence values, high fragment support values, low peak dilation values, high prominence values, and/or high phasedness values.
  • Outlier maps of nucleosomal dyads of non-healthy groups may also include locations of recurring peak locations of the healthy standard map of nucleosomal dyads, if these are recurrently absent only in a subgroup of the same non-healthy sample group.
  • Outlier maps may be created for specific subgroups of samples from a non-healthy group if the subgroup is sufficiently homogeneous in terms of outlier peaks (i.e. number of outliers are common among samples of the subgroup) and at least one pathological characteristic of these samples or signals obtained from these samples through methods described herein.
  • a method for monitoring the treatment success.
  • an in vitro method for monitoring the treatment success of a patient comprising the steps of: i. extracting cfDNA fragments from a sample of said patient; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v.
  • the in vitro method for monitoring the treatment success of a patient further comprises the step of chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing. Specifically, the further step of chaining is performed after mapping in step vi. and before comparing in step vii. Thereby, the step of chaining is performed as step vii. and comparing is performed as step viii..
  • comparing in step viii. comprises comparing the mapped peaks obtained in vi. with the mapped peaks of a previous result from said patient and/or a standard map of nucleosomal dyads characteristic for the treatment success, and/ or comparing the chained peaks obtained in vii. with the chained peaks of a previous result from said patient and/or a standard map of chained peaks characteristic for the treatment success, wherein differences and/or congruences obtained in the step of comparing provide information on the treatment success of the patient.
  • the step of comparing may comprise determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio (compare with raw dyad count distributions from Figure 21), determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning.
  • the treatment success is monitored for the treatment of cancer, specifically of prostate cancer, colon cancer, breast cancer, bladder cancer, and/or lung cancer; and for the treatment of inflammatory diseases, specifically of inflammatory bowel disease, systemic lupus erythematosus, ulcerative colitis; chronic inflammatory diseases such as thyroiditis, Crohn‘s disease, chronic obstructive pulmonary disease; and/or asthma.
  • a method is described for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii.
  • determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average nucleosomal dyad probability of v. across the reference genome sequence; and vii. determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the mapped peaks obtained from vi. with a library comprising mapped nucleosomal dyads of specific tissues or cell types.
  • the method for determining the cell type and/or tissue contribution of cfDNA in a sample further comprises the step of chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing. Specifically, the further step of chaining is performed after mapping in step vi. and before determining in step vii.. Thereby, the step of chaining is performed as step vii. and determining is performed as step viii..
  • determining in step viii. comprises comparing the mapped peaks obtained from vi. with a library comprising mapped nucleosomal dyads of specific tissues or cell types, and/ or comparing the chained peaks obtained in vii. with a library comprising chained peaks of specific tissues or cell types.
  • the tissue and/or cell types are selected from the group consisting of cancer cells, specifically lung cancer, colorectal cancer, breast cancer, prostate cancer, and bladder cancer; and normal cells, specifically hematopoietic cells, liver cells, epithelial cells, and bone marrow.
  • an in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. determining the sequence of the cfDNA fragments by performing whole genome sequencing on the extracted cfDNA fragments; iii.
  • determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; and iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; wherein the probability obtained from iv. for different fragment lengths of the cfDNA fragments as obtained from iii. provides information on the health status of said subject, preferably wherein a health status deviating from a healthy status is indicated if the z-scores of the set of cumulative deviations and/or informative counts ratios deviate from the set of distributions of cumulative deviations and/or informative counts ratios recorded from healthy subjects.
  • changes of the informative counts ratios as obtained in step iv between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects are expressed as z-score and a subject is identified as unhealthy, if said z-scores exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3.
  • changes of the informative counts ratios between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects are expressed as z-score and a subject is identified as unhealthy, if the z-scores exceed a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, or even more.
  • a subset of the least varying informative counts ratios may be selected to reduce the complexity of the task or a machine learning model may be trained to learn a binary classification based on all informative counts ratios across a set of frequently occurring fragment lengths, such as fragments with a length between 120 bp and 180 bp, and between 290 bp and 320 bp.
  • changes of the cumulative deviations as obtained in step iv between the sample set of cumulative deviations and the standard set of cumulative deviations of healthy subjects are expressed as z-score and a subject is identified as unhealthy, if said z-score exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3.
  • changes of the cumulative deviations between the sample set of cumulative deviations and the standard set of cumulative deviations of healthy subjects are expressed as z-score and a subject is identified as unhealthy, if the z-score exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, or even more.
  • a subset of the lowest cumulative deviations may be selected to reduce the complexity of the task or a machine learning model may be trained to learn a binary classification based on all cumulative deviations across a set of frequently occurring fragment lengths, such as fragments with a length between 120 bp and 180 bp, and between 290 bp and 320 bp.
  • a health status deviating from a healthy status is cancer, unhealthy aging, or pregnancy-associated complications.
  • the non-random fragment counts inside of a centralized 41 bp window divided by those outside of said window for a specific nucleosome dyad distribution like for 167 bp fragments not deviating abnormally from the same count ratio obtained from healthy subjects with a z-score of 1 is considered healthy.
  • a health status may be diagnosed.
  • Such a health status can be an unhealthy status.
  • a certain disease, health condition, or also a predisposition may be diagnosed.
  • the term “diagnose” or “diagnosis” of a status or outcome generally refers to predicting or diagnosing the status or outcome, determining predisposition to a status or outcome, monitoring treatment of a subject, diagnosing a therapeutic response of a subject, and prognosis of status or outcome, progression, and response to particular treatment.
  • Non-limiting examples of the diagnosed, monitored, or treated diseases include, neurodegenerative diseases, cancers, chemotherapy-related toxicities, irradiation induced toxicities, organ failures, organ injuries, organ infarcts, ischemia, acute vascular events, a stroke, graft-versus-host-disease (GVHD), graft rejections, sepsis, systemic inflammatory response syndrome (SIRS), cytokine releasing syndrome (CRS), multiple organ dysfunction syndrome (MODS), traumatic injuries, aging, diabetes, atherosclerosis, autoimmune disorders, eclampsia, preeclampsia, infertility, pregnancy- associated complications, coagulation disorders, asphyxia, drug intoxication, poisoning, and infections.
  • the disease is a cancer.
  • Cancer cells as most cells, can be characterized by a rate of turnover, in which old cells die and are replaced by newer cells. Generally dead cells, in contact with vasculature in a given patient, may release DNA or fragments of DNA into the bloodstream. This is also true of cancer cells during various stages of the disease. This phenomenon may be used to detect the presence or absence of cancers in individuals using the methods described herein.
  • cfDNA fragment patterns and features that may be unique to certain cancers present.
  • the method may detect the presence of cancerous cells in the body, despite the absence of symptoms or other hallmarks of disease.
  • the method may also help to detect different subtypes of cancer based on the features of the cfDNA fragments detected in the patient sample.
  • the types and number of cancers that are detected, monitored, or treated include, but are not limited to, blood cancers, brain cancers, lung cancers, skin cancers, nose cancers, throat cancers, liver cancers, bone cancers, lymphomas, pancreatic cancers, bowel cancers, rectal cancers, thyroid cancers, bladder cancers, kidney cancers, mouth cancers, stomach cancers, solid state tumors, heterogeneous tumors, homogeneous tumors and the like.
  • the methods provided herein may be used to monitor already known cancers or other diseases in a particular patient. This allows a practitioner to adapt treatment options in accordance with the progress of the disease.
  • the methods described herein may track cfDNA or ctDNA in a particular patient over the course of the disease.
  • cancers can progress, i.e. become more aggressive and genetically unstable. In other examples, cancers remain benign, inactive, dormant or in remission.
  • the methods of this disclosure may be useful in determining disease progression, remission or recurrence and the appropriate adjustments in treatment that are required for the disease state.
  • Biological samples are collected longitudinally over time from a single patient and comparison of the cfDNA profiles in all of the different samples collected illustrates how the cancer or disease is progressing or diminishing.
  • Bayesian inference is used to compute the positions of nucleosome dyads based on coverage maxima and cfDNA fragmentation by using Baye’s Theorem.
  • the theorem is shown in the equation I.
  • H is the hypothesis
  • E is the evidence.
  • Probabilities are P(H) as the prior probability
  • H) as the likelihood
  • P(E) is called the model evidence or marginal likelihood
  • E) is the posterior probability which is computed according to the methods described herein.
  • the hypothesis is that the position of a nucleosome, represented by the position of its dyad, can be derived from the location of an observed cfDNA fragment, which originates from that very same nucleosome, by taking into account the length of the fragment and prior knowledge about the relationship between the dyad’s location and the fragment length.
  • the evidence E is the combined information about cfDNA fragments gained from read alignment against the reference sequence e.g., a high-quality human reference genome, after sequencing.
  • the sequence alignment step produces the length and position information for each fragment.
  • the evidence E at a specific locus will also be called “observed fragmentation” or “fragmentation evidence”.
  • the fragment length-specific prior probability P(H) gives the probability of a nucleosome, which is represented by its dyad in our model, being positioned relative to each base of the fragment. Based on the knowledge that nucleosome dyads confer by far the highest cleaving resistance to cfDNA fragments, the probability distribution of the dyad location across a fragment can be approximated by the associated cleaving resistance distribution. The maximum or the most pronounced local maxima of this cleaving resistance distribution gives or give the expected location of the nucleosome dyad or the locations of multiple nucleosome dyads from multiple DNA-associated histone complexes (i.e. di-nucleosomal fragments) relative to the fragment before all of the cfDNA fragmentation evidence of the alignment locus of that fragment has been taken into account.
  • H) is the probability of observing a cfDNA fragment locally under the hypothesis that nucleosomal DNA in immediate genomic vicinity was the origin of the fragment before degradation.
  • the likelihood reduces to the observed local fragmentation after taking into account that observing unprotected fragments by chance is highly unlikely. Observation of cfDNA fragments in bodily fluids of living mammals can only be justified by DNA being in a protective nucleosomal structure before fragmentation that hinders rapid clearance and recycling of cellular debris.
  • the denominator P(E) is either called marginal likelihood or model evidence.
  • the posterior probability is only proportional to the combination of observed fragmentation and prior knowledge (equation II).
  • E) is what is of interest in the methods described herein, i.e., the probability of the hypothesis H being true after observing E.
  • Finding local maxima/calling peaks of this signal yields positions that show relatively high probability of harboring a nucleosome dyad in at least one of the contributing tissues since cleaving resistance maxima are considered to be conferred by nucleosomes, i.e. , the maxima is the resulting average expected location of the nucleosomal dyad at that locus.
  • local peaks of the posterior probability in equation II refer to the base positions in the reference genome sequence where a nucleosomal dyad is most likely to be present as determined in the methods described herein.
  • the observed fragmentation refers to the cfDNA fragmentation profile obtained by aligning the DNA sequences of the cfDNA fragments with a reference genome sequence.
  • the prior knowledge refers to the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments.
  • the method described herein is used to assess the health status of a person using features based on native cfDNA characteristics which describe either parameters of the fragmentome of an individual or specifically the nucleosome positioning along the genome of cfDNA contributing cells of this individual, and, thus, indirectly the chromatin state at the time of cfDNA shedding of these cells (chromatin-associated parameters).
  • the informative character of a feature is defined based on its distribution among homogeneous groups of samples sharing an identical or at least similar health/disease status and its ability to distinguish between different sample groups based on these distributions.
  • the deviation of a feature from its expected distribution may be expressed as a z-score (using the standard deviation from the normal group).
  • fragmentome refers to all aspects related to cfDNA fragmentation pattern analysis, such as the cfDNA length and the frequency of these at specific loci, the relationships between DNA sequence and cfDNA fragment end locations, whether the cfDNA fragment ends are jagged or blunt, relationship to nucleosomes and open chromatin regions, fragment strandedness (i.e., single stranded vs. double stranded DNA) and observation frequency, their coordinates compared to the reference genome, and directional information of the cfDNA ending locations.
  • the parameters describing an individual’s fragmentome that are extracted from a sequencing dataset are the dyad count distribution, parameters derived from dyad count distributions, parameters derived from approximate probability density functions of nucleosome dyads (priors).
  • the minimum number of occurrences e.g., 100000, 125000, 150000, 175000, 200000, 225000, or 250000, depends on whether the distribution of approximate location is derived from a single sample or pool of samples and on the achieved sequencing depth of the sequencing experiment.
  • a parameter derived from dyad count distributions is the fragment lengths and length ranges for which the informative counts ratio deviates significantly from the expected healthy distribution (“fragments showing aberrant dyad distributions”). Significance may be obtained from analytical derivations (e.g. confidence interval of the proportion) or empirical methods (e.g. bootstrapping or jackknifing).
  • a parameter derived from dyad count distributions is the fragment length with the highest informative count ratio (“most informative fragment length”).
  • a parameter derived from dyad count distributions is based on a specific dyad count distribution: e.g., certain fragment lengths selected for their biological relevance (e.g., 147 bp nucleosome core, 167 bp nucleosome and linker - with different linker lengths being described across tissues).
  • a parameter derived from dyad count distributions is the approximated probability density functions of nucleosome dyad placement over fragments of identical length (bp) with a minimum number of occurrences in the sequencing data set (“dyad probability densities” or also “nucleosome dyad prior probability distribution” or abbreviations thereof).
  • a parameter derived from approximate probability density functions of nucleosome dyads is the cumulative deviation of a sample’s dyad probability density function from the one extracted from the healthy control group for each fragment length (with a minimum number occurrences in the data set) in a centralized window of predefined size per fragment length (“cumulative deviation of positioning”).
  • the cumulative deviation of positioning is illustrated in Figure 15.
  • a parameter derived from approximate probability density functions of nucleosome dyads is the fragment lengths and length ranges with a significant “cumulative deviation of positioning”, e.g., z-score greater than 2.
  • the dyad signal-to- noise ratio is illustrated in Figure 16.
  • the features describing parameters of the chromatin might be computed for the whole genome or sub regions thereof.
  • Sub regions might be continuous in their genomic coordinates, or derive from sets of regions.
  • “Informative features” may be used to derive a set of sub regions. These regions can be single loci or certain functional regions of the genome that appear homogeneous with respect to one or more of these features (see “Types of regions”).
  • predefined compositions of (also different types of) regions to which same or similar biological functionality/meaning can be attributed or which contain entities (genes, cis- regulatory elements) that were described to interact in molecular signaling pathways.
  • An example would be the regions that belong to a molecular pathway known to have an important role in diseases (e.g.
  • regions used in the methods described herein are the inferred nucleosome locus, gene-specific regions, regions of the genome around cis-regulatory elements, and/or other potentially relevant loci.
  • the inferred nucleosome locus is the immediate vicinity of a locus for which the presence of a nucleosome dyad was predicted or experimentally derived. This usually relates to a region spanning up to 300 bp centered on an inferred nucleosome dyad signified by a nucleosome posterior peak.
  • Existing data e.g. ATAC-seq, MNase seq, ChlP-seq, etc.
  • the gene-specific regions contain multiple nucleosomes, such as transcription start sites (TSSs), transcription termination sites (TTSs), intron/exon borders of the gene, the “gene body” (whole CDS region), adjacent cis-regulatory elements like the upstream promoter region.
  • TSSs transcription start sites
  • TTSs transcription termination sites
  • intron/exon borders of the gene the “gene body” (whole CDS region)
  • CDS region whole CDS region
  • a symmetric window of several kbp encompassing that region is usually used to describe local nucleosome positioning characteristics.
  • the regions of the genome around cis- regulatory elements specifically transcription factor binding sites (TFBSs), that may not be immediately assignable to certain genes and can show long-range regulatory effects on multiple genes - like enhancers, silencers, and insulators.
  • TFBSs transcription factor binding sites
  • DHSs DNase hypersensitive sites
  • these regions may be linked to adjacent genes based on genomic distance, published literature and databases, and based on detected TAD boundaries surrounding both the region and the gene.
  • the links may be used to inform a selection of regions (e.g., those associated with tumor suppressors or oncogenes).
  • regions e.g., those associated with tumor suppressors or oncogenes.
  • chromatin-associated features may be used in the methods described herein. Chromatin associated features are mainly based on inferred nucleosome loci and the fragment support of these inferred dyad positions. Inference of nucleosome dyad positions is based on the determination of local maxima of cleaving resistance, i.e., the nucleosome posterior signal. These local peaks are called from the nucleosome dyad posterior signal. Subsets of fragments may be used to reflect different nucleosomal associations which are linked to fragment length. To this end, only mononucleosomal fragments (e.g.
  • 82 bp - 270 bp di-nucleosomal fragments (e.g. >270 bp), or sub-mononucleosomal fragments (e.g. ⁇ 82 bp) may be used to compute a feature.
  • Combinations of features computed from different sets of fragments may be combined in the computation of new features. Fragment length ranges may be optimized based on fragment occurrence in the sequencing data set and therefore might be smaller or larger than indicated.
  • nucleosome dyad positions are used as chromatin-associated features and nucleosome dyad positions are inferred from the genomic origin-resolved cfDNA fragment pool and prior information about the nucleosome dyad location relative to fragments of different length.
  • the nucleosome dyad’s fragment support is used as chromatin-associated feature.
  • An inferred nucleosome dyad’s fragment support is calculated from the locally observed fragmentation or the subset of fragments which was used during inference, and the proximity of the maximum of each individual fragment’s nucleosome prior distribution. The maximum of each such prior distribution must be in immediate vicinity of the inferred nucleosome dyad (e.g. within 35 bp, 30 bp, 25 bp, 20 bp, 15 bp, 10 bp, 7 bp, 5 bp, or 3 bp up- and downstream) to be regarded as supporting it. Lower distance values result in a more stringent estimation of the dyad fragment support.
  • features describing inferred nucleosome dyads and peak groups may be used in the methods described herein. These features refer to the occurrence of one main peak and no or multiple side peaks.
  • An alternative name for the peak group category is “peak core”. These features are peak prominence, core-peak prominence, peak dilation, raw fragment support, main fragment support, detailed fragment support, main support ratio, naive phase, main phase, detailed phase, peak dispersion, peak dispersion distance, peak confidence, main peak confidence and detailed peak confidence, side peaks, chained, primary chain, secondary chain, pathologic chain, relative genomic distance, upstream/downstream end of primary chain, and upstream/downstream end of secondary chain.
  • peak prominence refers to the height of a main peak (called from the dyad posterior signal) compared to surrounding main peaks in the vicinity or how it scores against a distribution of peaks at that locus from a group of normal samples (e.g. rank in list of peaks from normal sample plus current peak).
  • core-peak prominence refers to a feature like peak prominence but relative to the side peaks of the group of peaks which was assigned to the main peak during peak grouping.
  • peak dilation refers to the distance in bp which describes the stretch of bases that the flat posterior signal is covering if it was limited to a certain maximum height around the peak (based on peak height); only for main peaks.
  • the dilation feature of posterior peak is illustrated in Figure 17.
  • raw fragment support refers to the number of fragments that support a specific peak call based on maxima of nucleosome priors for fragments in close vicinity of the peak call (e.g. within 75 bp or depending on dilation of peak). This metric disregards any other peak calls (main peaks and side peaks) at the locus. The number of fragments can be replaced by the sum over their GC-bias correction weights.
  • main fragment support is the number of fragments supporting a main peak call. Fragments can be assigned only to one main peak. This metric disregards side peaks. The number of fragments can be replaced by the sum over their GC-bias correction weights.
  • detailed fragment support refers to the number of fragments supporting a peak call. Fragments can be assigned only to one peak inside a peak group. This metric resolves fragment support inside peak groups. The number of fragments can be replaced by the sum over their GC-bias correction weights. Can be computed for side peaks.
  • main support ratio refers to the detailed fragment support of the main peak of a peak group over the sum of detailed fragment support of all side peaks of that group.
  • the number of fragments can be replaced by the sum over their GC-bias correction weights.
  • naive phase refers to the “phasedness” of all signals belonging to the same peak group; similarity of nucleosome placement across tissues with similar dyad placement. Computationally, this is average of the metric described below over all N fragments with midpoint inside a window of certain size around the peak (e.g. 149 bp symmetric window, or a window based peak dilation metric — includes more fragments for broader peaks).
  • the maximum possible distance of a prior’s maximum is half the chosen window size w.
  • main phase refers to a feature like naive phase, but the fragments are filtered for those for which the current main peak is the closest main peak. Only for main peaks.
  • detailed phase refers to a feature like main phase, but the fragments are additionally divided among the closest side peaks in the peak group. It can be computed for side peaks.
  • peak dispersion distance refers to the distance in bases between the most upstream peak and the most downstream peak of the peak group.
  • peak confidence refers to the approximate average contribution of a fragment’s prior signal to the nucleosome dyad peak (posterior signal).
  • the proximity factor is multiplied (i.e., weighted) with the non-random signal strength of the prior (which is represented by the absolute height of the maximum of the prior pmax ; could also be replaced by the non-random fraction of the prior) before forming the sum.
  • the result is divided by the maximal possible sum of signal contributions. This feature combines information about signal strength of a fragment’s prior distribution and the relative location of the fragment to a specific dyad call.
  • the confidence feature of posterior peak is illustrated in Figure 19.
  • main peak confidence and detailed peak confidence are computed from different sets of fragments, similar to those used for main phase and detailed phase computations.
  • side peaks refer to the number of sub-peaks that are called around a main peak.
  • the chained feature describes whether or not a peak was chained together with other peaks to form a chain of inferred nucleosome dyads (1 if in a chain, 0 otherwise).
  • a pathologic chain such as a “CRC chain”
  • the peak is part of a chain (primary or secondary) that was associated with a certain pathology.
  • a certain type e.g. EVX2 TFBS
  • a specific locus e.g. TSS of TP53 gene
  • the upstream/downstream end of primary chain refers to the first/last inferred nucleosome dyad of a primary chain; relative to the + strand.
  • the upstream/downstream end of secondary chain refers to the first/last inferred nucleosome dyad of a secondary chain; relative to the + strand.
  • features describing multiple inferred nucleosome dyads and or chained nucleosomes are nucleosome repeat length, chain length, chain regularity, nucleosome density, and positioning diversity (might be expressed as a ratio).
  • the nucleosome repeat length (NRL; or inter-nucleosome distance) describes an average or median distance between a main peak and a specific number of surrounding peaks or the average over all main peaks found in a certain region or set of regions or in a primary or secondary chain.
  • the chain length is the length of a nucleosome chain as number of chained dyads and/or as absolute distance in bp.
  • the chain regularity is a metric describing how regular the inter-nucleosome distance of inferred nucleosomes in a chain or a region is.
  • the nucleosome density is the number of inferred nucleosome dyads per kilo base across defined region or set of regions, primary chain or secondary chain; might exclude inferred dyads from secondary chains wherever applicable.
  • the positioning diversity is the average number of inferred dyads including peaks from secondary chains divided by the nucleosome density of the region or set of regions.
  • a feature using nucleosome chains is “secondary chains” describing the number of secondary chains in region.
  • a feature using nucleosome chains is the relative genomic distance of peak or chain to next instance of certain type (e.g. EVX2 TFBSs) or a specific instance (e.g. TSS of TP53 gene).
  • certain type e.g. EVX2 TFBSs
  • specific instance e.g. TSS of TP53 gene
  • compound features can be created by mathematically combining features in a meaningful way.
  • Features that are not derived from posterior signal such as depth of fragment coverage across specific regions or at specific loci can be combined with features that are inferred from nucleosome positions and/or fragment support and/or other features that are derived from the underlying nucleosome peaks. If features are combined over regions, typical descriptive statistics like mean, median, standard deviation and other metrics and derivatives of these are used.
  • a compound feature is the chromatin condensation if concordant or higher than a chosen threshold, e.g. 0.5, or if a nucleosome chain with certain regularity and NRL characteristics is present: 30 nm chromatin fibril (dense), 10 nm “beads on a string” conformation (loose) based on depth of fragment coverage, support metrics for peak(s) of a chain, inter-peak distance and possibly further features to refine prediction of the state.
  • a chosen threshold e.g. 0.5
  • a nucleosome chain with certain regularity and NRL characteristics is present: 30 nm chromatin fibril (dense), 10 nm “beads on a string” conformation (loose) based on depth of fragment coverage, support metrics for peak(s) of a chain, inter-peak distance and possibly further features to refine prediction of the state.
  • a compound feature is the DNA accessibility of a locus or average across loci: this feature is derived from nucleosome positions, derivatives, and depth of coverage. If at least one chain spans the locus, instead of coverage, the average fragment support across chain peaks, the fragment support for neighboring peaks and/or overlapping peaks (e.g. within 47 bp) and the internucleosome distance and chain regularity can be used to compute accessibility or to predict the accessibility of the locus for each chain.
  • classifiers and predictors may be used in the methods described herein.
  • Models for classification of health status of a patient and prediction of certain health parameters e.g. response to therapy, development of tumor resistance to treatment, recurrence free survival, time to recurrence, tumor metastasized or not, time to sepsis, etc. are trained on features and feature sets using machine learning methods.
  • PCA principal component analysis
  • NMF non-negative matrix factorization
  • random forests or gradient boosting machines also to limit the number of allowed decisions
  • auto-encoders to reduce feature space to important hyperparameters (also de-noising) or similar methods and/or any combination of these.
  • Suppression of batch effects on the feature selection procedure is achieved by applying standard controlling procedures involving computing of correlation metrics, regression analysis, (hierarchical) clustering of features based on similarity and testing resulting clusters against known possible confounding variables like sequencing batch, sequencing technology, depth of coverage, age of sample, sample sex (wherever applicable) an any other example.
  • tissue deconvolution is used or performed in a method described herein.
  • Tissue deconvolution refers to the inference of cfDNA contribution by individual tissues and/or cell-types to the cfDNA pool uses a reference catalog of tissue-/cell type-specific feature signatures.
  • the catalog is created from existing sequencing data sets. Signatures may consist of single features or combinations of features and sets of these as described above. Features and sets may be restricted to certain regions or sets of regions of the genome, especially in the case of chromatin- associated features.
  • NMF non-negative matrix factorization
  • NMF can also be used to compute a “best fit” linear combination of signatures from a sequencing dataset. Signatures may not scale linearly with the abundance of their corresponding cfDNA releasing cells. Therefore, other methods than NMF might be used to achieve a more accurate deconvolution. Tissue deconvolution yields an estimate of tissue-/cell types that are described by the reference catalog as values between 0 and 1. Minor contributions might be ignored and only a ranked list of top contributing tissues used for further model training and/or use in regression or classification tasks.
  • the methods described herein comprise determining an index of fragment length and dyad position. Specifically, wherein for each cfDNA length, it is determined how often the dyad is in the center of the cfDNA fragments.
  • determining an index of fragment length and dyad position may be used for determining the health status of a subject.
  • the dyad is in the center of a cfDNA fragment.
  • an asymmetric distribution may be an indication of disease.
  • This disease can be cancer, for example.
  • Unlimited examples for such change of fragment length and dyad position could be due to mutations in histone genes, altered composition of nucleosomes, mutations in the cfDNA e.g., due to degradation machinery.
  • a nucleosome prior fragmentation index is determined. Specifically, on fragment length and dyad position.
  • each cfDNA fragment length it is determined how often is the dyad in the center or off the center. Specifically, the healthier a person, the more often is the dyad in the center of cfDNA fragments.
  • p% has a length of 167 bp; of those, the dyad is in q% in the center.
  • the methods described herein comprise characterization of genomic regions where the dyad is preferential in the center of the cfDNA and genomic regions where the dyad has more variable positions. Specifically, regions with preferential dyad-center cfDNAs are more regulatory important than regions where the dyad is more frequently off the center.
  • nucleosome phasing means “strict phasing”, e.g., +1 nucleosomes vs. “fuzziness”.
  • information on a single gene level can be deduced. Specifically, absence of a nucleosome at the TSS is compatible with being expressed. Specifically, the presence of a nucleosome at the TSS is incompatible with being expressed. According to a specific embodiment, further nucleosome positions can be included. Specifically, in an active gene, the peak downstream to TSS reflecting the +1 nucleosome is high and located approx, at position +50 bp, the peak upstream of TSS (-1 nucleosome) is increased, and its maximum located between -175 bp and -225 bp. Specifically, further may be included the distances of the downstream nucleosome to each other.
  • stable genes are referred to as genes which are always the same nucleosome pattern.
  • unstable genes are referred to as genes having various nucleosome patterns.
  • the methods described herein may comprise determining for specific gene sets the number of genes which are without a nucleosome and/or the variability of the genes.
  • the nucleosome position for all HKs may be determined. Specifically, the number of genes without nucleosome at TSS may be determined. Specifically, the variability for HK genes may be determine.
  • the nucleosome position for PAU genes may be determined. Specifically, the number of genes without nucleosome at TSS may be determined. Specifically, the variability for PAU genes may be determined.
  • the methods described herein may comprise the determination of the aging status of a subject.
  • each genomic region can be determined whether the genomic region has a regulatory function. For example, it can be determined whether it is a TSS or TFBS or a region with another regulatory function. This step may be further facilitated by including reference genome annotations. Then, the cfDNA fragmentation pattern may be determined for each regulatory region/TF binding site, e.g., the number, the lengths of cfDNA fragments covering the respective regulatory regions and their relative positioning in the region or to the binding site.
  • cfDNA fragmentation patterns within their regulatory context allow estimations about the age of a subject, as for example, gene transcription changes in an age-dependent way.
  • the present invention provides a computer-implemented method.
  • all of the herein and above described features relating to an in vitro method apply also for the herein described computer- implemented methods.
  • a computer-implemented method for analyzing cell-free DNA (cfDNA) fragments from a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv.
  • the computer-implemented method described herein for analyzing cell-free DNA (cfDNA) fragments from a sample further comprises the step vi. of chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing.
  • a computer-implemented method for determining the health status of a subject comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning the DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii.
  • v mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; and vi. comparing the mapped peaks obtained in v. with a library comprising standard maps and outlier maps of nucleosomal dyads; wherein congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status, wherein congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status, and/or wherein congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects indicates an association with an unhealthy status.
  • a computer implemented method for determining the health status of a subject comprised the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv.
  • comparing the mapped peaks obtained in v. comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b.
  • congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status
  • congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects is characteristic for an unhealthy status
  • d. difference with a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for an unhealthy status.
  • the library of standard maps derived from unhealthy subjects and/or said library of outlier maps of nucleosomal dyads derived from unhealthy subjects is from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.
  • a computer-implemented method for determining the health status of a subject comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii.
  • a computer-implemented method for monitoring the treatment success of a patient comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii.
  • a computer-implemented method for monitoring the treatment success of a patient comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii.
  • determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v.
  • comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, and/or determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, wherein differences and/or congruences obtained in vii. provide information on the treatment success of the patient.
  • a computer-implemented method for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv.
  • a computer-implemented method for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv.
  • a library comprising mapped nucleosomal dyads of specific tissues or cell types and/or determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains of specific tissues or cell types.
  • a computer-implemented method for determining the health status of a subject comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; and iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; wherein the probability obtained from iii. for different fragment lengths of the cfDNA fragments as obtained from i.
  • a health status deviating from a healthy status is indicated if the z-score of the informative counts ratios as obtained in step iii. between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects are expressed as z-score and a subject is identified as unhealthy if said z-scores deviate from the distributions of ratios and/or cumulative deviations recorded from healthy subjects, preferably wherein a health status deviating from a healthy status is cancer or pregnancy-associated complications.
  • a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the computer-implemented method described herein.
  • a computer-readable medium having stored thereon the computer program described herein for performing the computer-implemented method.
  • the computer-implemented method described herein comprises the step of receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample of a subject.
  • said data may be generated using a sequencing-device connected to the computer or apparatus used for performing the computer-implemented method described.
  • a data processing apparatus comprising means for carrying out the computer-implemented methods described herein is provided by the invention.
  • said data processing apparatus may be connected to an apparatus or device capable of sequencing cfDNA fragments.
  • said data processing apparatus may be connected to an apparatus or device capable of extracting cfDNA from a sample. Specifically, said data processing apparatus is further connected to an apparatus or device capable of sequencing cfDNA fragments.
  • a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out a computer-implemented method described herein.
  • said computer program may be combined with a computer program comprising instructions to cause the device capable of extracting cfDNA from a sample to execute its function of extracting cfDNA from a sample.
  • said computer program may be further combined with a computer program comprising instructions to cause the device capable of sequencing cfDNA fragments to execute its function of sequencing cfDNA fragments.
  • said computer program may be combined with a computer program comprising instructions to cause the device capable of sequencing cfDNA fragments to execute its function of sequencing cfDNA fragments.
  • an apparatus for performing a method described herein.
  • Such apparatus may be characterized by the following features: (a) a sequencer configured to (i) receive DNA extracted from a sample of the bodily fluid comprising DNA, and (ii) sequence the extracted DNA under conditions that produce DNA fragment sequences; and (b) a computational apparatus configured to (e.g., programmed to) instruct one or more processors to perform various operations such as those described with two or more of the method operations described herein.
  • the computational apparatus is configured to perform one or more of the steps of the computer-implemented method described herein.
  • the apparatus also includes a tool for extracting DNA from the sample under suitable conditions.
  • the apparatus includes a module configured to extract cfDNA obtained from plasma for sequencing in the sequencer.
  • the apparatus includes a database of reference genome sequences and/or a library comprising standard maps and outlier maps of nucleosomal dyads.
  • the computational apparatus may be further configured to instruct the one or more processors to map the cfDNA fragments obtained from the blood of the individual to the database of reference genome.
  • the computational apparatus may be further configured to instruct the one or more processors to map the nucleosomal dyads obtained from the analysis of cfDNA in a sample as described herein to the database of reference genome(s).
  • Said mapped nucleosomal dyads or peaks of the average probability of the presence of a nucleosomal dyad may be compared by the apparatus with the library comprising standard maps and outlier maps of nucleosomal dyads.
  • the computational apparatus may perform all steps of the method described herein that can be performed by such an apparatus.
  • Embodiments of the invention also relate to an apparatus for performing these operations.
  • This apparatus may be specially constructed for the required purposes, or it may be a general-purpose computer (or a group of computers) selectively activated or reconfigured by a computer program and/or data structure stored in the computer.
  • a group of processors performs some or all of the recited analytical operations collaboratively (e.g., via a network or cloud computing) and/or in parallel.
  • a processor or group of processors for performing the methods described herein may be of various types including microcontrollers and microprocessors such as programmable devices (e.g., CPLDs and FPGAs) and other devices such as gate array ASICs, digital signal processors, and/or general purpose microprocessors.
  • microcontrollers and microprocessors such as programmable devices (e.g., CPLDs and FPGAs) and other devices such as gate array ASICs, digital signal processors, and/or general purpose microprocessors.
  • certain embodiments relate to tangible and/or non-transitory computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations.
  • Examples of computer-readable media include, but are not limited to, semiconductor memory devices, magnetic media such as disk drives, magnetic tape, optical media such as CDs, magneto-optical media, and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM).
  • ROM read-only memory devices
  • RAM random access memory
  • the computer readable media may be directly controlled by an end user or the media may be indirectly controlled by the end user. Examples of directly controlled media include the media located at a user facility and/or media that are not shared with other entities.
  • Examples of indirectly controlled media include media that is indirectly accessible to the user via an external network and/or via a service providing shared resources such as the "cloud.”
  • Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
  • a computer-implemented method is described herein, wherein said computer-implemented method is used in a method described herein, specifically in an in vitro method described herein.
  • a computer-implemented method is used the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; and ii. performing at least one of the steps of a method described herein, specifically of an in vitro method described, more specifically performing at least one of the steps iii. to viii. of an in vitro method described herein.
  • An in vitro method for analyzing cell-free DNA (cfDNA) fragments from a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv.
  • An in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning the DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv.
  • comparing the mapped peaks obtained in vi. comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning; wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status; c.
  • nucleosomal dyads characteristic for a healthy subject is less than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; and/or wherein a subject is considered unhealthy if the z-score of the changes of the informative counts ratios between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3; and/or wherein a subject is considered unhealthy if the z-score of the changes of the cumulative deviations between the sample set of cumulative deviations and the standard set of cumulative deviations exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3.
  • An in vitro method for monitoring the treatment success of a patient comprising the steps of: i. extracting cfDNA fragments from a sample of said patient; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi.
  • mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence vii. chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing; and viii. comparing the mapped peaks obtained in vi. with the mapped peaks of a previous result from said patient, comparing the mapped peaks obtained in vi. with a standard map of nucleosomal dyads characteristic for the treatment success, comparing the chained peaks obtained in vii. with the chained peaks of a previous result from said patient, and/or comparing the chained peaks obtained in vii.
  • comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, and/or determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, wherein differences and/or congruences obtained in viii. provide information on the treatment success of the patient.
  • An in vitro method for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv.
  • tissue and/or cell types are selected from the group consisting of cancer cells, specifically lung cancer, colorectal cancer, breast cancer, prostate cancer, and bladder cancer; and normal cells, specifically hematopoietic cells, liver cells, epithelial cells, and bone marrow.
  • a computer-implemented method for determining the health status of a subject comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v.
  • mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. comparing the mapped peaks obtained in v. with a library comprising standard maps, comparing the mapped peaks obtained in v. with a library comprising outlier maps of nucleosomal dyads, and/ or comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains, preferably wherein said comparing the mapped peaks obtained in v.
  • nucleosomal dyad positioning comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status; c.
  • a computer-implemented method for monitoring the treatment success of a patient comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v.
  • mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. comparing the mapped peaks obtained in v. with the mapped peaks of a previous result from said patient, comparing the mapped peaks obtained in v. with a standard map of nucleosomal dyads characteristic for the treatment success, comparing the chained peaks obtained in vi. with the chained peaks of a previous result from said patient, and/or comparing the chained peaks obtained in vi.
  • comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, and/or determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, wherein differences and/or congruences obtained in vii. provide information on the treatment success of the patient.
  • a computer-implemented method for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii.
  • v mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the mapped peaks obtained from v. with a library comprising mapped nucleosomal dyads of specific tissues or cell types and/or determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains of specific tissues or cell types.
  • a data processing apparatus comprising means for carrying out the method of any one of items 8 to 10.
  • a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of items 8 to 10.
  • a computer-readable medium having stored thereon the computer program of item 12.
  • An in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. determining the sequence of the cfDNA fragments by performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; and iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; wherein the probability obtained from iv. for different fragment lengths of the cfDNA fragments as obtained from iii.
  • a health status deviating from a healthy status is indicated if the z-score of the set of cumulative deviations and/or informative counts ratios deviate from the set of distributions of cumulative deviations and/or informative counts ratios recorded from healthy subjects, preferably wherein a health status deviating from a healthy status is cancer or pregnancy-associated complications.
  • Tumor cells may release their DNA into the circulation. Tumors often have an increased vascularization due to their demand for nutrition because of the accelerated growth and higher number of tumor cell divisions. If tumor cells are in direct contact with blood, the likelihood that their DNA will be shed into the circulation increases.
  • Apoptosis may eventually result in the death and destruction of the respective cell and the release of DNA and associated molecules into the circulation.
  • a standard blood draw will contain blood cells, such as white blood cells (WBCs) and erythrocytes, as a minimally invasive sampling method. Furthermore, circulating tumor cells (CTCs) may be detectable in the blood of patients with cancer. In addition to cells, multiple other molecules are in the bloodstream. For this patent application, cfDNA molecules are the most relevant ones.
  • WBCs white blood cells
  • CTCs circulating tumor cells
  • GE diploid genome equivalents
  • 10ng of DNA is used as input and PCR is performed to selectively amplify library fragments containing adapters for subsequent sequencing. Libraries are then quantified and sequenced in paired-end mode (150bp x 2 or 100bp x 2) at high coverage ( ⁇ 30x). For humans, 30x coverage can be achieved with 600 million reads of 150 bp (or 300M paired-end reads).
  • quality control and pre-processing of sequencing reads e.g. adapter trimming, base quality filtration, GC-correction
  • the fragment coverage defined as the number of nucleosome prior distributions per base, is extracted from each sample dataset.
  • the blood cells Within the blood vial, the blood cells must be separated from the cell-free or liquid component of whole blood, and the latter is referred to as blood plasma. This separation is done by centrifugation. After centrifugation, the different blood fractions are visible in the blood vial where the yellow fraction corresponds to the plasma ( ⁇ 55% of whole blood).
  • the thin white layer is the buffy coat containing WBCs and platelets ( ⁇ 1% of whole blood).
  • the red fraction ( ⁇ 45% of whole blood) comprises mainly erythrocytes.
  • cfDNA is isolated from the yellow part, i.e., the blood plasma, and kept in Eppendorf tubes for subsequent analysis.
  • cfDNA fragments are subjected to library preparation for subsequent NGS, including ligating adapters to the cfDNA fragments.
  • NGS sequencer may be the Illumina NextSeq; however, other NGS sequencers from this manufacturer or other vendors can equally be used for this purpose.
  • Bioinformatics tools are applied to infer from the sequence reads the depth of coverage (red line) and distinguish between well-protected loci, which should correspond to nucleosomal bound DNA, moderately protected loci, or unprotected regions, which should reflect increased enzymatic digestion due to the lack of protective nucleosomes. Peak regions of protected loci may represent nucleosome dyad positions (black dashed lines), i.e., regions occupied by the center of a nucleosome.
  • Plasma DNA depth of coverage analysis of distinct genomic regions may provide biologically relevant information.
  • the left coverage plot illustrates a drop in average depth of coverage across many TSSs of genes that are likely to be expressed.
  • nucleosomes are removed to create an NDR over the promoter, allowing transcription factors to bind.
  • the flanking regions upstream and downstream show periodic oscillations, reflecting the organization of nucleosomes adjacent to the TSS. From such a pattern, it can be inferred that the genes used in the compound depth computation are likely expressed in the cells that shed their DNA into the circulation.
  • the right coverage plot reflects the average across regions with increased, uniform coverage, indicating densely packed nucleosomes with less defined positioning, suggesting that these genes are unlikely to be expressed.
  • Figure 4 An example is shown in Figure 4: This figure is intended to give a comparison between the relative occurrence of fragment lengths within a single dataset and not absolute values for actual counts since these may vary greatly depending on the seguencing platform, targeted seguencing depth, and other factors. Hence, the linear y- axis shows no values.
  • fragment lengths in an untargeted whole genome seguencing dataset from isolated double-stranded cfDNA is shown. Fragments were seguenced paired-end to be able to deduce the length based on the start and end positions of reads from both ends. Only fragments falling in the local windows around hypothetical nucleosome dyads were counted during the computation of nucleosome prior distribution.
  • the lower coverage plot illustrates three regions where different numbers of cfDNA fragments map. To the left is a locus with high coverage of seguencing reads, indicating high resistance to enzymatic digestion. The region on the right side has fewer sequencing reads suggesting moderate resistance to cleaving. In contrast, the region in the center hardly overlaps with any cfDNA sequencing reads, which indicates no protection from enzymatic digestion.
  • nucleosomes offer protection from enzymatic digestion during apoptosis and as the nucleosomal dyad is the region where nucleosomal DNA is most tightly bound, it is possible to translate the sequencing depths into nucleosome position maps where the position of maximum coverage overlaps with the nucleosome dyad (dashed light grey line in the upper coverage plot, the position of the inferred nucleosome dyad axis). Hence, nucleosomal dyad positions are inferred from sequencing read depth analysis for each cfDNA fragment in this step.
  • the individual cfDNA fragment nucleosomal dyad information is then used to infer within each cfDNA fragment the relative position of the dyad (block arrows; left panel “Nucleosomal Fragments”).
  • the nucleosomal dyad may map in the center of a cfDNA fragment, be somewhere off the center, or may not be determinable.
  • the next step involves fragment length-specific dyad statistics. The inferred nucleosome dyad positions are recorded for all fragments that map to the same locus and have the same length (center panel, black triangles).
  • nucleosome prior distribution yi
  • the initial dyad count distribution for a specific fragment length is first truncated according to a certain strategy (step 2; the strategy shown is fragment length-based truncation), then normalized to an area under the curve of one to resemble a probability density function (step 3) and finally, the non-informative constant portion of counts is removed by adjusting the zero level (step 4).
  • Count distributions are shown relative to the central base(s) (medium gray vertical line) of fragments. Counts for fragment lengths from 50bp to 300bp are depicted. Inferred dyads were counted beyond fragment ends, which are indicated by the medium gray dashed lines. The distributions shown in the figure are used for computing prior distributions of nucleosome dyads. Medium gray areas indicate low counts. The transition from medium gray to darker gray and further to lighter grays up to white, as in the center of the figure, indicates increasing counts. The darker spots to the far left and right of the center mark an increase in counts that can be attributed to neighboring nucleosomes of the observed fragment.
  • the degree of how precisely neighboring nucleosomes are positioned relative to the nearest one can be derived from the spread of the spot for that fragment length.
  • the most accurate neighboring dyad positioning seems to exist for fragments between 200bp and 220bp.
  • Horizontal bands of different grays appearing approximately every 10bp vertically for smaller fragments up to a length of about 160bp originate from the increasing cleaving resistance based on the DNA helix twisting with an approximate 10bp periodicity, which causes steric hindrance of the cleaving process by making the DNA backbone facing towards the histone complex in the same periodic manner.
  • White lines indicate the count minima that are closest to the fragment center, which would be likely chosen by the short-range truncation method.
  • Fragment length truncation would terminate count distributions at the fragment ends instead (medium gray dashed lines).
  • the uniform truncation strategy would use an identical distance from the center of each fragment to the truncated bases at both sides (end-to-end distance here: 170bp; white dotted lines).
  • An unmarked version of the count heatmap is shown in the small panel on the top right.
  • A Display of a coverage-based signal (fine broken light gray line) in comparison to a nucleosome priors-based signal (medium gray line) for a selected human reference DNA sequence (black bar). On the left, a region is shown where the nucleosome-prior position (black dotted vertical line) differs from the highest read coverage (gray dotted line). The difference between the two inferred dyad positions is indicated (A).
  • our new calculation method maps nucleosome dyad positions with an increased resolution.
  • the top panel illustrates how the coverage-based signal (fine broken light gray line) is generated from sequencing reads (gray horizontal bars), which map to this particular human reference DNA sequence.
  • the coverage-based analysis results in the identification of one peak region (gray arrow (C)).
  • a nucleosome priors-based analysis reveals that the large coverage-based signal consists of two closely positioned nucleosomes (dashed light gray line, indicating the respective dyad axes), with partially overlapping signals.
  • an interpretation based only on the coverage-based signal can result in wrong conclusions.
  • the left side illustrates how the combination of depth of coverage (fine broken light gray line) and nucleosome priors-based signal (medium gray line) can be leveraged to calculate the local DNA accessibility (ac) from the depth of coverage (d) and the spacing between nucleosome dyads (And).
  • the accurate calculation of nucleosome dyad positions allows for determining genomic regions where nucleosomes are highly phased or less well organized (“variability of placement” in the second column of this panel).
  • the local organization of nucleosomes can be investigated in more detail using the number and spread of side-peaks of a main peak, which would be the highest local maximum, and the width of the main peak.
  • Machine learning classifiers detect pathophysiological states.
  • examples 2-5 and example 11 show examples of the layout or implementation of the methods described herein.
  • the L-WPS signal (Snyder et al., 2016), another nucleosome positioning signal, is visualized in some cases examples.
  • Another low-resolution positioning signal the central 31 fragment bases depth of coverage (c31 DoC)
  • the c31 DoC computes the depth of coverage using only the central 31 bp of a cfDNA fragment as determined by the start and end positions of a sequence read pair that was aligned to the reference genome.
  • nucleosome dyad its position within individual cfDNA fragments, and nucleosome mapping with unprecedented resolution lead to a tremendous increase in the accuracy of nucleosome positioning analysis by several orders of magnitude and allow for new applications of cfDNA tests like high accuracy single locus nucleosome positioning analysis from liquid biopsy samples and deconvolution of multiple positioning signals at any locus.
  • this approach provides sufficient coverage to simultaneously capture other critical and clinically relevant features from cfDNA besides nucleosome positioning and open chromatin mapping, such as somatic copy number alterations, tumor fraction represented in plasma, and germline variants. Because all this information can be harvested from a whole-genome sequencing dataset, there is no requirement for a separate library preparation or sequencing procedure, making this workflow more time- and cost-efficient compared to previous approaches for which special assays had to be developed. As such, our high-resolution nucleosome dyad mapping approach maximizes the genomic insights obtained from whole-genome sequencing in an unprecedented fashion.
  • Rows are as follows: central 31 fragment bases depth of coverage (c31 DoC), nucleosome majority positioning peak calls created using the c31 DoC signal, L-WPS positioning signal, nucleosome majority positioning peak calls created using the L-WPS signal, aligned read depth of coverage as computed by IGV program, and the read alignments from the BAM file which was used to create all signals and majority calls above.
  • the c31 DoC signal is visualized as colored bars (dark grey).
  • the L-WPS signal is displayed as a line plot with the value zero as a thin horizontal line. It can be seen that L-WPS is mainly negative due to the signal being primarily determined by the local presence of fragment ends, which reduces the L-WPS value.
  • the height of the majority positioning calls is the prominence of the peaks called by scipy.find_peaks() function in Python.
  • the peaks from both signals show a high concordance, indicating both signals can be used as a source to compute the majority positioning of nucleosomes.
  • Two nucleosome chains are marked in the case of the c31 DoC signal: a more prominent A chain of majority positions and a less pronounced B chain in between the A chain majority call positions. No such second chain can be seen in the L-WPS panel of the figure. This shows the low usability of the L-WPS signal for determining secondary nucleosome positions and conducting multiple signature deconvolution. Judging from the c31 DoC signal, it is unclear if only one secondary nucleosome chain is present or multiples because of the wide spread of the c31 DoC peaks.
  • Panels A and B each show two positioning signals computed for the TSS or the TTS of a gene. Genes are shown in 5’ to 3’ direction left to right, according to the strand on which the gene is located. This causes the left end of the x-axis to equal “upstream of the TSS/TTS” and the right end to equal “downstream of the TSS/TTS”.
  • the novel posterior nucleosome dyad signal (gray)
  • the L-WPS signal (light gray broken line).
  • the L- WPS signal was computed according to the algorithm published by Snyder et al. in 2016. The latter is only defined for a fragment length range of 120-180 bps.
  • the dark gray line tracks fragment coverage for each position of the genome and can be seen to be insufficient for the bp-accurate determination of the DNA’s resistance to cleaving by DNases based on its local maxima.
  • the medium gray lines show nucleosome positioning patterns that are compatible with the TSS/TTS of an unexpressed gene.
  • the positioning signals are shown separately for the intermediate fraction (fragments between 120bp and 180bp) and the long fraction (fragments longer than 180bp) separately. Although the intermediate fraction is highly over-represented in the cfDNA pool, these two fragment groups show high concordance of posterior dyad signal maxima positioning for both loci.
  • TSSs of genes expected to be not or only occasionally expressed in hematopoietic cells are provided in Figures 24A,B, and D.
  • the patterns of main peak positioning between the TSS of PI_A2G2E and the TTS of ZNF648 show high similarity with ZNF648 exhibiting more positioning variability around the main peaks of the intermediate fraction.
  • the nucleosome repeat distance (NRD) for both loci is relatively consistent within each locus and falls inside the expected interval of possible NRD values that occur in a 30 nm chromatin fibril of densely packed nucleosomes (i.e. closed chromatin).
  • the L-WPS signal is only defined for fragments in the intermediate fraction and generally shows a lower resolution in its ability to discern close neighboring peaks compared to the posterior signal. Repeating positioning patterns occur more often in the posterior signal than in the L-WPS signal (e.g. intermediate fraction of ZNF648). Further examples of TTSs of inactive genes are provided in Figures 24F and G.
  • the dark gray depth of coverage signals shows large NDRs upstream to the TSS of LPGAT1 in the proximal promoter region and downstream to the TTS of PFKFB2.
  • the gray posterior nucleosome positioning signals of the intermediate fragment length fraction show well-positioned 1 and +1 nucleosomes around the NDR for the TSS and well-positioned -1 and 0 nucleosomes for the TTS (the low support peak in the TTS NDR is omitted during major peak assessment).
  • the positioning signals are shown separately for the intermediate fraction (fragments between 120bp and 180bp) and the long fraction (fragments longer than 180bp) separately.
  • TSS of LPGAT1 and L-WPS signal around the masked nucleosome dyad posterior peak in the NDR of PFKFB2) because of its dependence on the depth of coverage (compare to L-WPS positioning signal amplitude loss in the region upstream to ATP1 A1 TSS in Figure 22B). It also increases in a non-intuitive way (i.e. the region of lowest resistance to cleaving shows high values of windowed protection score) which is indicated for the regions marked with “L-WPS zero-drift”.
  • NDRs can be detected more easily using the distance between the main peaks of the posterior nucleosome dyad signal after applying the minimum fragment support criterion to the set of main peaks in the region of interest.
  • H1 is a histone which can loosely bind linker stretches of mononucleosomal DNA. Neighboring nucleosomes are visible as increased counts upstream and downstream of the cfDNA fragment ends (darker gray). The darkest gray counts mark the level of random dyad positioning signals. The proportion of random positioning is determined by the amount of counted fragments that did not agree with the predominant nucleosome positioning at all counting loci of hypothetical nucleosome positions. These fragments either had their actual nucleosomal dyad in a randomly close vicinity or, based on the amount of these observations, overlapped with the hypothetical nucleosome dyad position only by chance.
  • (D) Nucleosome-conferred DNA cleaving resistance The x-axis displays the values for observed fragment lengths, i.e., from 50-350bp.
  • the y-axis shows the “conferred cleavage resistance”, i.e., a metric related to a specific transformation of the dyad count distributions, which is displayed in Figure 12E.
  • the limited dyad count distribution, including the random fraction is normalized to an area under the curve of 1 .
  • the end frequency distribution is expressed relative to an expected nucleosome dyad position on a cfDNA fragment of a specific length which is equal to the position of the maximum of the nucleosome prior distribution.
  • the normalized end frequency distribution is the fragment’s mirrored normalized dyad count distribution, with the mirrored distribution shifted towards the fragment end of interest such that the mirrored fragment end overlaps with the expected dyad position on the fragment (which is equal to the maximum of the mirrored distribution overlapping with the fragment end of interest).
  • the cleaving resistance is computed from the ratio of the maximum observed end-frequency (either 5’, darkest gray; or 3’, medium gray) and the end’s frequency at the dyad. For example, an unweighted cleaving resistance of 2 means that the maximum end frequency for a given fragment length is twice the end frequency at the dyad.
  • the plot can also be interpreted as the confidence in the determined fragment end locations relative to the dyad. This equals our confidence in deciding on the location of the dyad relative to a fragment of a certain size. For DNA fragments with a length of 149bp, for example, we are most confident in locating the relative dyad position. It is centered as expected.
  • DNA fragments of 167bp show already a higher uncertainty of dyad placement, which is in accordance with the 10bp flanking unprotected linker segments being randomly cleaved (i.e., the cleaving resistance is lower).
  • Weighted versions take into account how the distance between the end count maximum and the dyad relates to the total fragment length.
  • (E) Fragment end frequency transformation the expected dyad position on a fragment is determined by the position of the maximum of the empiric dyad count distribution. The count distribution is normalized to an area under the curve of 1 after fragment length dependent truncation. The resulting distribution is mirrored and placed such that the maximum overlaps with the fragment end of interest. If the maximum of the count distribution is well pronounced, like in the case of 149 bp fragments, the resulting fragment end distribution is representative of where to expect fragments to end relative to the nucleosome dyad for fragments of a specific length.
  • Figures 23 and 24 show a nucleosome positioning signal on top, as described in Figure 12, either at the TSS or the TTS of an actively transcribed (Figure 23) or an inactive (Figure 24) gene.
  • the data obtained by the invention are in line with known expression patterns.
  • genes with a nucleosome positioned at the TSS and arrays of nucleosomes upstream and downstream with similar phasing are examples of genes with a per-gene positioning track in cfDNA from healthy individuals ( Figure 12A).
  • PI.A2G2E and ZNF648 are examples of genes with a per-gene positioning track in cfDNA from healthy individuals ( Figure 12A).
  • the nucleosomes at positions -1 and 0 block the bulky transcription machinery’s binding, this nucleosome pattern is in concordance with the expectation for an unexpressed gene.
  • the interpretation “unexpressed” is based on a high likelihood of unexpressed genes having nucleosomes at typical NDR locations (e.g.
  • NDR nucleosome- depleted region
  • LPGAT1 or PFKFB2 the pattern is reversed for the TTS since upstream corresponds to the gene body and downstream corresponds to transcription factor binding sites in contrast to the TSS example
  • PFKFB2 the pattern is reversed for the TTS since upstream corresponds to the gene body and downstream corresponds to transcription factor binding sites in contrast to the TSS example
  • RNA polymerase II proximal promoter regions at the TSS containing a nucleosome-free region of ⁇ 200bp around the TSS, mostly upstream flanked by well- positioned nucleosomes at both sides.
  • This correlates well with the expected expression of a gene, as the differences of nucleosome occupancy patterns between Figures 23 and 24 clearly show. Additionally, regulatory information can be obtained by the bp- accurate measurement of the +1 and the -1 nucleosome positions.
  • a gene’s promoter region could be in a poised chromatin state, i.e. , bearing simultaneously both activation- associated and repression-associated histone modifications, which enable the gene to switch rapidly between an active and repressive state.
  • the difference between poised state and active transcription can be determined by a slight shift of the +1 nucleosome and a change in the fragmentation pattern beneath it due to nucleosome unwrapping and rewrapping (the latter happening upstream of the NDR) in the case of active transcription which does not occur in the poised state.
  • the NDR pattern corresponds to bound transcription factors of the transcription termination machinery.
  • Example 3 mainly refers to results depicted in panels A-F of Figure 22, which are described in the following:
  • the posterior nucleosome dyad signal (gray line) shows two distinct nucleosome dyad positioning series, supposedly coming from two separate nucleosome positioning patterns.
  • Capital “A” and “B” characters at peak positions mark the dyad positions of these two manually chained posterior dyad peak series.
  • the L-WPS signal is shown to illustrate the high accuracy of the posterior signal, even for a lower depth of coverage.
  • the L-WPS signal shows a strong upward drift and signal amplitude loss with reduced depth of coverage.
  • the low depth of coverage around the TSS region suggests that the gene might be active, which is in line with known gene expression levels of normal human tissues.
  • the low support of peaks around the TSS causes the central three A chain peaks and the central two B chain peaks to be disregarded during the main peak assessment. Therefore, these peaks are not part of a major chain and do not represent the main regulatory state of the RIT1 gene.
  • (B) Nucleosome occupancy plot based on 120-180 bp fragments (“intermediate fraction”) for the TSS of the ATP1A1 gene shown for comparison of L-WPS and nucleosome dyad posterior probability with signals shown in (A).
  • the posterior signal is visible as a series of repeating peaks with the highest signal amplitude in the region with almost no reduction in signal amplitude compared to L-WPS.
  • the peak situation in the upstream region of the TSS of ATP1A1 is very similar to the one described in (A) for the RIT1 gene, except there is no second chain upstream to the TSS. Peaks between -200 bp and +550 bp are excluded from the main peak assessment because of the low support of only 3 fragments. Therefore, these peaks likely stem from a few fragments from a closed chromatin state.
  • the signal amplitude loss of L-WPS is even more extreme than in (A).
  • three chains can be created from local peaks.
  • the A chain was started from the high posterior peak at the TTS, grouping peaks with internucleosomal very close to the distance to the highest upstream peak. Chaining can be done at least up to the -2 nucleosome (slightly downstream of position -400 bp).
  • the second chain (“B chain”) started from the highest peak between the -2 and the -1 nucleosome of the A chain. Both the A and B chains are subject to a substantial decrease in the height of chained peaks after the TTS peak.
  • the nucleosome repeat length of this chain is smaller than for the A chain.
  • a third chain (“C chain”) can be started from the central peak at the TTS and continued downstream for at least three additional posterior peaks.
  • the C chain exhibits a nucleosome repeat length similar to the B chain.
  • the prominent height of the posterior peak which is chosen for seeding the chaining process for the A and C chain here, is in concordance with the regulatory importance of the location where the posterior peak is located, namely the TTS of the NOTO gene.
  • This example shows a chaining algorithm that uses the posterior probability signal created from the long fraction of cfDNA fragments. Since the long fraction is thought to originate from one or more closed chromatin states, the local maxima of this signal can be used to extract nucleosome chains from the intermediate fraction posterior probability signal. Intermediate fraction peaks close to or overlapping with peaks of the long fraction can be chained together to form a chain of nucleosomes that form less accessible chromatin.
  • the A chain contains posterior peaks compatible with closed chromatin, which is in concordance with known OR4F5 gene expression for normal human tissues. Other peaks between A chain peaks are grouped to form the B chain.
  • the B chain also contains peaks that would be incompatible with each other but are interpreted as options for nucleosome positioning within the B chain (e.g., B1 and B4 positions).
  • the A chain and the B chain nicely separate.
  • the consistent nucleosome repeat length for both chains suggests that both chains come from the same chromatin state. Because of the derivation of the A chain from the long fraction, this means that both the A and B chains originate from closed chromatin.
  • the high nucleosome repeat length of about 200 bp indicates that the region might be in solenoid chromatin structure, which is a model for the less-accessible 30 nm DNA fiber compared to the most accessible 11 nm chromatin conformation termed “beads on a string”.
  • the variability of the alternative positioning peaks within the B chain is much higher upstream to the TSS than downstream, indicating active transcription because of extensive nucleosome repositioning in the promoter region and not a poised state in which nucleosomes would remain in place even though they have been displaced.
  • the high internucleosomal distance of about 200 bp of the A chain suggests a less accessible chromatin state than the typical “beads on a string” conformation.
  • the location of the seeding peak and the upstream B chain peak suggests the absence of the transcription termination machinery (lack of an NDR), which is in concordance with known expression levels of the gene in normal human tissue.
  • Example 4 Improved pathway analyses and tumor subclassification
  • the high-resolution single gene nucleosome positioning tracking gene regulatory analysis mentioned above critically facilitates the identification of altered pathways and tumor subclassification.
  • Current views of cancer biology suggest that cancer driver genes can be classified into 12 signaling pathways that regulate three core cellular processes, i.e., cell fate, cell survival, and genome maintenance. Therefore, a common and limited set of driver genes and pathways is responsible for most common forms of cancer. We can classify the activity status of these genes and pathways with far- reaching consequences, for example, for early cancer detection and therapy decisions.
  • Example 5 A nucleosome prior fragmentation index: fragment length and dyad position
  • a clinical example would be the cfDNA analysis of a healthy individual, where the majority of cfDNA fragments has a clearly symmetrical pattern, i.e., the dyad can be positioned in the center of the cfDNA fragments.
  • a similar symmetrical pattern is expected from cfDNA fragments of pregnant women, although the fetal-derived cfDNA increases the number of shorter cfDNA fragments.
  • DNA digestion in the apoptotic cells is mostly symmetrical.
  • canonical digestion is disturbed in many cells resulting in an increase of cfDNA fragments with an asymmetric dyad position ( Figure 20).
  • the degree of well-positioned ends relative to the dyad enables us to establish detailed symmetry statistics. For example, several studies found that plasma DNA samples from patients with cancer are enriched for smaller cfDNA fragments ( ⁇ 150bp) and that the size distribution of small cfDNA fragments (100-150bp) to larger cfDNA fragments (151-220bp) can distinguish between healthy cfDNA patterns and those from patients with cancer (Mouliere et al., 2018).
  • we can add the nucleosomal dyad position for each fragment in addition to the fragment size fractions.
  • multiple processes such as nucleosome breathing, nucleosome sliding, and specific physiological and pathological states, affect the accessibility of nucleosomal DNA.
  • the dyad position within cfDNA fragments is highly informative about an individual's health status. For example, cfDNA fragments of cancer patients will have an increased variability of dyad positions per cfDNA fragment.
  • cfDNA fragments with the dyad at the center may indicate nucleosome stability, whereas cfDNA fragments with dyads off the center may reflect nucleosome instability. Since ctDNA has different fragmentation patterns, modeling cfDNA fragment length and dyad positions will facilitate distinguishing plasma samples from healthy donors from patients with cancer.
  • nucleosome prior-based fragmentation index the fraction of 167 bp fragments that have a dyad counted inside a defined central portion of the fragment over all observed 167 bp fragments, which we named “kurtosis of dyad placement” ( Figure 14). This can be extended to all mono-nucleosomal fragments. Hence, the more DNA fragments deviate from a center position of the nucleosome dyad (indicating aberration from the canonical apoptosis process), the smaller the index/the value of kurtosis will be.
  • the count of all fragments of fragment lengths that show a clear preference for relative dyad positioning is divided by the number of fragments where no such preference can be established.
  • a minimum threshold for the informative fraction of an empiric dyad count distribution is used here to define the existence of a preference for dyad placement.
  • Example 6 Plasma DNA tissue deconvolution: special handling of mixed signals (tissue deconvolution)
  • tissue deconvolution In individuals with a disease, the contribution of DNA from tissues may change if the diseased organ releases its DNA into the bloodstream. Our algorithms capture such changes and determine the tissue of origin (tissue deconvolution). Deciphering a “mixed” nucleosome pattern can reveal whether the DNA was released from cells where the respective nucleosomes had different positions. The number of fragments supporting different local nucleosome peaks could be used to estimate the percentage of different tissues contributing fragments to the coverage of a specific genomic locus. By analyzing multiple tissue-specific loci with extracted patterns, the accuracy of the estimated tissue contribution can be increased. To explain the relevance of tissue deconvolution: Several studies described diverse cellular and tissue origins of cfDNA.
  • the bloodstream serves as a heterogeneous reservoir of cfDNA fragments that vary from individual to individual as well as with age and other underlying physiological conditions.
  • cfDNA circulating hematopoietic DNA
  • coDNA circulating organ DNA
  • Example 7 Applications in patients with cancer
  • Figure 13 shows a comparison of the relevant loci for different patient groups (healthy, lung cancer, and colorectal cancer).
  • patient groups health, lung cancer, and colorectal cancer.
  • the accessibility of genomic regions may change. If the diseased organ releases its DNA into the bloodstream, the composition of the cfDNA pool will change accordingly.
  • Our algorithms capture such changes and determine the organs with increased DNA release.
  • Early detection of disease i.e., screening for the presence of cancer in healthy individuals, aims at the detection of diseases in specific organs before the manifestation of clinical symptoms so that therapies can be started as early as possible.
  • early disease detection refers to issues, such as patient selection and monitoring, evaluation of ctDNA evolution, or clearance as a surrogate endpoint, and as such to the detection of minimal residual disease (MRD) (Moding et al., 2021).
  • MRD minimal residual disease
  • Both periods i.e., early detection and early disease, have in common that disease-associated changes (modifications) in plasma DNA may be hard to find in the blood.
  • the ctDNA might account for just 0.1 % or even less.
  • a tumor-informed method can be used for MRD detection.
  • tumor fraction with a limit of detection of ⁇ 0.01 % can be discovered by screening for numerous mutations (Moding et al., 2021).
  • Our approach offers the opportunity to screen for thousands of targets, i.e., nucleosome positions and open chromatin regions, and should have a great potential of identifying minor traces of alterations from tumor cells.
  • a nucleosome/open chromatin-based strategy does not require any knowledge about alterations in the tumor; hence, it is a tumor-agnostic approach.
  • ctDNA analyses In patients with advanced disease, the primary purpose of ctDNA analyses is the treatment selection via biomarker and the monitoring of patients in remission for recurrence/metastasis/resistance. As our improved nucleosome mapping options pave the way for novel gene expression/pathway analyses, it will affect the medical treatment of patients by targeted treatment selections and identification of resistance mechanisms.
  • Example 8 Applications in patients with chronic diseases, e.g., inflammatory bowel disease
  • IBD inflammatory bowel diseases
  • CD Crohn’s disease
  • UC ulcerative colitis
  • Symptoms include frequent bloody bowel movements, abdominal pain, weight loss, and fatigue.
  • Complications include stricture formation, abscesses, fistulas, extra-intestinal manifestations, and colorectal cancer.
  • Current therapy consists of 5-aminosalicylates (5-ASA), corticosteroids, immunosuppressives, and biological treatment options.
  • 5-aminosalicylates 5-ASA
  • corticosteroids corticosteroids
  • immunosuppressives and biological treatment options.
  • Example 10 Applications in patients with syndromes, e.g., Coffin-Siris syndrome
  • nucleosome position changes likely occur in at least a subset of syndromes.
  • SWI/SNF complex is an ATP-dependent chromatin remodeler that regulates the spacing of nucleosomes and thereby controls gene expression.
  • Heterozygous mutations in genes encoding subunits of the SWI/SNF complex have been reported in individuals with Coffin-Siris syndrome (CSS), with most of the mutations in ARID1 B.
  • CSS is a rare congenital disorder characterized by facial dysmorphisms, digital anomalies, and variable intellectual disability. Mutations in genes encoding subunits of the ubiquitously expressed SWI/SNF complex may alter the nucleosome profiles in different cell types.
  • Example 11 Identification of physiological states, e.g., age effect of nucleosome dyad position on cfDNA fragments
  • cfDNA analyses have been mainly used to characterize disease states. However, provided that cfDNA can be interrogated with increased resolution limits, it should be possible to derive “physiological” information about the health condition of an individual, e.g., whether someone ages well or not (“healthy aging”). Our technologies may be suitable to estimate the age of an individual.
  • Aging leads to multiple changes, for example, profound alterations in the immune system and increased susceptibility to chronic, infectious, and autoimmune diseases.
  • Figure 25 depicts exemplary comparisons between low-resolution nucleosome positioning signals of a small number of healthy males below the age of 30, healthy males above the age of 55, and prostate cancer patients in pericentromeric regions of chromosome 12 which are known to harbor chains of well-positioned nucleosomes.
  • a prior probability is used, independent of the genomic locus of interest, by making two assumptions. Thereby, a locus is defined as a continuous region spanning about 10kbp.
  • the distribution of nucleosomes per 10kb does not deviate tremendously from a uniform distribution along the mappable human genome with an average dyad-to-dyad distance of 167bp (e.g., no kbp or larger stretches are without nucleosomes, and the minimum dyad-to-dyad distance is much closer to the 167bp average than to zero).
  • the average distance gives a per-base probability of observing a nucleosome of one over 167.
  • the DNA fragmentation process acts equally on all cfDNA fragments and is independent of their origin locus and, in particular, the nucleosome of origin. It follows that the relative frequencies at which fragment lengths are observed at a locus are similar to those seen across the whole dataset. As a result, the probability of observing a specific local fragmentation pattern is consistent across the entire mappable genome. Thus, the marginal likelihood which acts only as a scaling factor can be omitted when computing the local maxima of the posterior probability.
  • This number must be divided among all fragment lengths according to their relative frequency of occurrence in the whole dataset to get an estimate of how many counts are expected to be obtained for every fragment length.
  • a 30x depth-of-coverage WGS dataset i.e., on average, every base of the genome is covered by 30 sequenced DNA fragments yields multiple million data points for each fragment length if rare fragment lengths are excluded by carrying out an in silico size selection.
  • Two signals are generated and evaluated in the method described herein.
  • the first is based on sequencing coverage, i.e., the number of sequencing reads aligned to a specific locus in a reference genome ( Figure 5, left side: assumption 1).
  • the second signal is the “posterior nucleosome signal”, and its generation is based on Bayesian inference as outlined in the following.
  • Nucleosomes offer protection from enzymatic digestion during apoptosis, with the nucleosomal dyad being the region where nucleosomal DNA is most tightly bound, and, thus most resistant to being cleaved by DNases. Therefore, it is possible to translate the sequencing depth into approximate nucleosome position maps (Figure 5).
  • nucleosomal dyad positions are inferred from sequencing read depth. These dyad positions are subsequently mapped across the respective cfDNA fragments ( Figure 5). Because of the resolution of this approach, primarily information for the most pronounced nucleosome positioning while little to no information for smaller signals in the same region is gained. Following this, mainly one series of nucleosomes is observed along the genome (“single nucleosome series assumption”).
  • the nucleosomal dyad may map inside or outside a given fragment, and mapped dyad positions are summarized for all observed fragment lengths.
  • a fragment length-dependent maximum distance between the inferred nucleosome dyad and fragment is used to restrict dyad counting to the nearest nucleosome. This ensures that neighboring nucleosomes are excluded while relevant fragments mapping to regions between nucleosomes are included. This proximity restriction can be lifted to create prior knowledge about positioning neighboring nucleosomes following the same procedure.
  • Summary statistics are transformed into probability density functions by truncating the empirical distribution and normalizing the area under the discrete count distribution to one afterward.
  • fragment length truncation The maximum or the most pronounced maxima in the resulting distribution of cleaving resistances over fragments of a specific length is expected to be the preferred location(s) of the nucleosomal dyad on fragments of the specific length. Hence, we call this distribution “nucleosome prior distribution”. Other factors might increase cleaving resistance but primarily the strong interaction between DNA and histone complex confers cleaving resistance to cfDNA.
  • the cleaving resistance of each fragment end (5’ and 3’) can be computed by combining the end distance to the closest preferred dyad position on fragment times signal strength of the cleaving resistance at this position ( Figure 12).
  • Figure 12 The base positions where prior distributions are truncated according to the strategies mentioned above are depicted in the overview heatmap of dyad prior distributions in Figure 7.
  • White lines indicate where local minima next to fragment midpoints are expected.
  • Orange dashed lines indicate the location of fragment ends.
  • a third option would be to extend all priors equally to a total length of, e.g., 170 bp (i.e., 85 bp left and right of the fragment center; “uniform truncation”) so that for each position on the genome, only fragments with their center mapping inside a window of certain length around it can affect the posterior signal at that very same position. Although this might help avoid coverage-dependent distortions in regions with tremendous depth-of-coverage variations or untypical fragmentation, it has not been tested yet.
  • the base positions of truncation in this exemplary case of uniform 170 bp truncation are indicated in Figure 7 by green dotted lines.
  • the simplification of assuming only one chromatin state is present during prior knowledge creation results in completely random dyad positions being recorded in the summary statistics.
  • the noise level depends on the fragment length because different fragment lengths occur at different frequencies and, thus contribute more or less to the predominant nucleosome positioning state (see Figure 4).
  • nucleosome prior distribution (yi).
  • nucleosome posterior distribution or “posterior nucleosome signal” based on the Bayesian inference procedure ( Figure 8).
  • a map of posterior nucleosomal dyad positions can be computed across the mappable, non-homologous human genome by calling peak positions from the posterior nucleosome dyads y. Based on signal amplitude and peak proximity, peaks can be grouped hierarchically into main and associated side peaks. The depth-of- coverage at the site of a posterior peak can be used as an additional source of information to incorporate the absolute fragment support for a peak in subsequent “chaining” analysis (e.g., peaks called at locations with 30x depth of coverage are well supported in contrast to peaks called from only five or fewer fragments).
  • the chaining analysis creates a series of peaks following rules for naturally occurring nucleosomal spacing to extract the most likely and possible chromatin states from the observed individual dyad positioning in a region. Examples of manually chained posterior dyad peaks are depicted in Figure 22. Main peaks, as well as side peaks, can be chained. The more dyad series are present at a genomic locus, the harder it is to tell them apart accurately.
  • a region with the highest coverage-based signal may match the nucleosome posterior signal; however, the prior-based signal can differ significantly from the sequence read cover signal (Figure 9A).
  • a decisive advantage is that the posterior nucleosome dyad signal can resolve nearby signals from different chromatin states in the cfDNA shedding cell population, which is impossible based on sequence depth analyses alone ( Figure 9B, C). Hence, interpretations solely derived from coveragebased signals -as in cfDNA assays- can result in wrong conclusions.
  • posterior nucleosome signal can be interpreted within the context of the coverage-based signal.
  • analyzing posterior nucleosome positions in low depth-of-coverage regions is challenging.
  • the posterior nucleosome signal may not be representative in regions with a low depth of coverage (e.g., below 6x coverage). Therefore, such regions can either be masked and excluded from further analysis ( Figure 9D) or simply down-scaled based on a target coverage of, e.g., 10x. The latter primarily supports visual inspection of loci.
  • the posterior signal at a locus with 1x coverage would be divided by ten, one with 5x coverage would be divided by 2, and a locus with 20x coverage would be amplified by a factor of 2. Only the posterior signal for bases with 10x coverage would remain the same in this example.
  • Examples of low confidence peaks being masked out are the peak in the NDR of the PFKFB2 gene TTS which is supported only by a single fragment (Figure 12B), the peaks in the central -200 bp to +200 bp region of the RIT gene TSS of Figure 22A, the peaks around the TSS of the ATP1A1 gene between position -200 bp and +500 bp as shown in Figure 22B, and the peak slightly upstream to position +200 of the TTS of the NOTO gene which is displayed in Figure 22C.
  • the combination of depth of coverage and posterior nucleosome signal can be leveraged not only to characterize nucleosome dyad peaks but also to calculate particular metrics like the local DNA accessibility (ac) from the depth of coverage (d) and the spacing between nucleosome dyads (And) (Figure 10).
  • the accurate calculation of nucleosome dyads spacing (And) allows for determining genomic regions where nucleosomes are highly phased or less well organized. Furthermore, it enables the reliable identification of closed or open chromatin ( Figure 10). Another way of detecting closed chromatin regions (not shown) is to compute the ratio of fragment midpoints in the long fraction vs.
  • Open chromatin regions indicate locations within the human genome with regulatory functions, such as transcription start sites, i.e., regions with extraordinary biological relevance.
  • Machine learning approaches such as training random forest models in a supervised learning setting, can help select informative features ( Figures 10, 11).
  • One typical application uses these features to investigate whether a plasma sample was derived from a healthy donor or an individual with a particular disease.
  • the search for disease-associated signals would be the classic example of a diagnostic liquid biopsy application for the early detection of diseases based on the epigenetic features of cfDNA shedding tissues.
  • Trained models could be updated based on orthogonal diagnoses and other technologies like magnetic resonance imaging (MRI).
  • MRI magnetic resonance imaging
  • the updated ground truth labels of samples can be used to fine-tune a model and to extract additional informative features like the number of samples grows. Additionally, the machine learning-assisted epigenetic characterization of disease states could deepen our understanding of disease-causing processes and mechanisms of disease progression if the approach were frequently applied.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Zoology (AREA)
  • Molecular Biology (AREA)
  • Wood Science & Technology (AREA)
  • Pathology (AREA)
  • Immunology (AREA)
  • Hospice & Palliative Care (AREA)
  • Theoretical Computer Science (AREA)
  • Oncology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Microbiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention se rapporte à un procédé d'analyse ou de détermination de fragments d'ADN acellulaire (ADNcf) à partir d'un échantillon comprenant les étapes consistant : i. à extraire des fragments d'ADNcf à partir de l'échantillon ; ii. à effectuer un séquençage de génome entier sur les fragments d'ADNcf extraits ; iii. à déterminer le profil de fragmentation d'ADNcf en alignant des séquences d'ADN des fragments d'ADNcf à partir de ii. sur une séquence génomique de référence ; et iv. à déterminer la probabilité de présence d'une dyade nucléosomique pour chaque position de base des fragments d'ADNcf.
PCT/EP2023/075122 2022-09-13 2023-09-13 Détermination de l'état de santé et surveillance de traitement avec de l'adn acellulaire WO2024056720A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP22195313.6 2022-09-13
EP22195313 2022-09-13

Publications (1)

Publication Number Publication Date
WO2024056720A1 true WO2024056720A1 (fr) 2024-03-21

Family

ID=83319277

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/075122 WO2024056720A1 (fr) 2022-09-13 2023-09-13 Détermination de l'état de santé et surveillance de traitement avec de l'adn acellulaire

Country Status (1)

Country Link
WO (1) WO2024056720A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170211143A1 (en) 2014-07-25 2017-07-27 University Of Washington Methods of determining tissues and/or cell types giving rise to cell-free dna, and methods of identifying a disease or disorder using same

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170211143A1 (en) 2014-07-25 2017-07-27 University Of Washington Methods of determining tissues and/or cell types giving rise to cell-free dna, and methods of identifying a disease or disorder using same

Non-Patent Citations (14)

* Cited by examiner, † Cited by third party
Title
"Vogel and Motulsky's Human Genetics: Problems and Approaches", 2010, SPRINGER
ALBERTS, B. ET AL.: "Molecular Biology of the Cell.", 2022, W.W. NORTON & CO
HALL, M.A. ET AL.: "High-resolution dynamic mapping of histone-DNA interactions in a nucleosome", NAT STRUCT MOL BIOL, vol. 16, 2009, pages 124 - 129
HEITZER, E. ET AL.: "Current and future perspectives of liquid biopsies in genomics-driven oncology", NATURE REVIEWS GENETICS, vol. 20, 2019, pages 71 - 88, XP036675874, DOI: 10.1038/s41576-018-0071-5
JIANG, P. ET AL.: "Lengthening and shortening of plasma DNA in hepatocellular carcinoma patients", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, vol. 112, 2015, pages E1317 - 1325, XP055223840, DOI: 10.1073/pnas.1500076112
MICHAEL, A.KTHOMA, N.H.: "Reading the chromatinized genome", CELL, vol. 184, 2021, pages 3599 - 3611
MODING, E.J. ET AL.: "Detecting Liquid Remnants of Solid Tumors: Circulating Tumor DNA Minimal Residual Disease", CANCER DISCOVERY, 2021
MOULIERE, F. ET AL.: "Enhanced detection of circulating tumor DNA by fragment size analysis", SCIENCE TRANSLATIONAL MEDICINE, vol. 10, 2018, pages eaat4921, XP055669959, DOI: 10.1126/scitranslmed.aat4921
SNYDER, M.W. ET AL.: "Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin", CELL, vol. 164, 2016, pages 57 - 68
SUN, K. ET AL.: "Size-tagged preferred ends in maternal plasma DNA shed light on the production mechanism and show utility in noninvasive prenatal testing", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, vol. 115, 2018, pages E5106 - E5114, XP055612386, DOI: 10.1073/pnas.1804134115
ULZ, P. ET AL.: "Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection", NATURE COMMUNICATIONS, vol. 10, 2019, pages 4666, XP055892459, DOI: 10.1038/s41467-019-12714-4
ULZ, P. ET AL.: "Inferring expressed genes by whole-genome sequencing of plasma DNA", NATURE GENETICS, vol. 48, 2016, pages 1273 - 1278
WEINBERG, A.: "The Biology of Cancer", 2013, W. W. NORTON & COMPANY
WINOGRADOFF, D.AKSIMENTIEV, A.: "Molecular Mechanism of Spontaneous Nucleosome Unraveling", JOURNAL OF MOLECULAR BIOLOGY, vol. 431, 2019, pages 323 - 335, XP085576956, DOI: 10.1016/j.jmb.2018.11.013

Similar Documents

Publication Publication Date Title
Rodin et al. The landscape of somatic mutation in cerebral cortex of autistic and neurotypical individuals revealed by ultra-deep whole-genome sequencing
AU2019277698A1 (en) Convolutional neural network systems and methods for data classification
US11581062B2 (en) Systems and methods for classifying patients with respect to multiple cancer classes
US20210065842A1 (en) Systems and methods for determining tumor fraction
KR20180031742A (ko) 무세포 dna의 단편화 패턴 분석
Parry et al. Genomic evaluation of multiparametric magnetic resonance imaging-visible and-nonvisible lesions in clinically localised prostate cancer
US11869661B2 (en) Systems and methods for determining whether a subject has a cancer condition using transfer learning
AU2016293025A1 (en) System and methodology for the analysis of genomic data obtained from a subject
KR20220012849A (ko) 단일 세포 유전 구조 변이의 포괄적인 검출
US20220106642A1 (en) Multiplexed Parallel Analysis Of Targeted Genomic Regions For Non-Invasive Prenatal Testing
US20200340064A1 (en) Systems and methods for tumor fraction estimation from small variants
US20210238668A1 (en) Biterminal dna fragment types in cell-free samples and uses thereof
US20210102262A1 (en) Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data
US20210358626A1 (en) Systems and methods for cancer condition determination using autoencoders
US20190073445A1 (en) Identifying false positive variants using a significance model
Kasimatis et al. Evaluating human autosomal loci for sexually antagonistic viability selection in two large biobanks
Zwemer et al. RNA‐Seq and expression microarray highlight different aspects of the fetal amniotic fluid transcriptome
US20200157602A1 (en) Enrichment of targeted genomic regions for multiplexed parallel analysis
US20180300450A1 (en) Systems and methods for performing and optimizing performance of dna-based noninvasive prenatal screens
WO2024056720A1 (fr) Détermination de l'état de santé et surveillance de traitement avec de l'adn acellulaire
US20240055073A1 (en) Sample contamination detection of contaminated fragments with cpg-snp contamination markers
WO2024022529A1 (fr) Analyse épigénétique d'adn acellulaire
US20240136018A1 (en) Component mixture model for tissue identification in dna samples
Simpson Jr Investigating Disease Mechanisms and Drug Response Differences in Transcriptomics Sequencing Data
Tankard et al. Detecting tandem repeat expansions in cohorts sequenced with short-read sequencing data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23769208

Country of ref document: EP

Kind code of ref document: A1