WO2023288275A2

WO2023288275A2 - Systems and methods for assessment of nucleobase modifications

Info

Publication number: WO2023288275A2
Application number: PCT/US2022/073737
Authority: WO
Inventors: Alex CHIALASTRI; Siddharth S. DEY
Original assignee: The Regents Of The University Of California
Priority date: 2021-07-14
Filing date: 2022-07-14
Publication date: 2023-01-19
Also published as: WO2023288275A3; WO2023288275A9

Abstract

Methods to detect modified nucleobases in a nucleic acid molecule are described. In some instances, methods are utilized to detect modified nucleobases in both strands of a nucleic acid molecule. In some instances, a modification-dependent restriction nuclease and a nucleobase conversion reaction is utilized to detect modified nucleobases in both strands of a nucleic acid molecule.

Description

SYSTEMS AND METHODS FOR ASSESSMENT OF NUCLEOBASE

MODIFICATIONS

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional Application Ser. No. 63/221,643, entitled “Systems and Methods for Assessment of Nucleobase Modifications,” filed July 14, 2021, which is herein incorporated by reference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

[0002] This application was made with Government support under contracts R01HG011013 and R01HD099517 awarded by the National Institutes of Health. The Government has certain rights in the invention.

SEQUENCE LISTING

[0003] This application hereby incorporates by reference the material of the electronic Sequence Listing filed concurrently herewith. The material in the electronic Sequence Listing is submitted as an XML (.xml) file entitled “07383PCT_SeqList_ST26.xml” created on July 14, 2022, which has a file size of 93 KB, and is herein incorporated by reference in its entirety.

TECHNICAL FIELD

[0004] The disclosure is generally directed to methods and systems to assess modifications of DNA biomolecules, and more specifically to methods and systems that identify modifications of DNA nucleobases on both strands of a DNA biomolecule.

BACKGROUND

[0005] A sub-discipline within the field of epigenetics is the study of modifications to nucleic acids that do not involve changes to the nucleic acid sequence. One form of nucleic acid modification is covalent modification of nucleobases of nucleic acids such as DNA and RNA, which can be modified with functional groups such as methyl, hydroxymethyl, carboxyl, formyl and other groups. These functional groups can provide various functions. For instance, methylation of DNA in prokaryotes signals for DNA replication, chromosome segregation, mismatch repair, packing of bacteriophage genomes, transposase activity, and regulation of gene transcription. Methylation in eukaryotic genomes most often occurs on cytosines within CpG dinucleotides, especially within CpG islands. Methylation on CpG dinucleotides located within or near promoters and/or transcription start sites and are highly involved in gene regulation. High methylation of CpG islands typically correlates with low expression or silencing of nearby genes.

SUMMARY OF THE DISCLOSURE

[0006] Many embodiments are directed to methods of assaying for modifications of double-stranded nucleic acid nucleobases. In many of these embodiments, an assay is performed that is able to detect nucleobase modifications on both strands of a double- stranded nucleic acid, including (but not limited to) 5-methylcytosine, 5- hydroxymethylcytosine, 5-glucosylhydroxymethylcytosine, 5-formylcytosine, 5- carboxylcytosine, N4-methylcytosine and N6-methyladenine. Various embodiments are also directed towards kits for performing assays to detect modification nucleobases, which can be scaled down to a few picograms of input material and at a single-cell resolution.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments and should not be construed as a complete recitation of the scope of the disclosure.

[0008] Fig. 1 provides a flowchart of an exemplary method for detecting nucleobase modifications in accordance with various embodiments.

[0009] Fig. 2A provides a schematic of an exemplary method for detecting 5- methylcytosine modifications in accordance with various embodiments. [0010] Fig. 2B provides a schematic of an exemplary method for performing RNA transcription analysis and nucleobase modification analysis in accordance with various embodiments.

[0011] Fig. 3 provides a data chart indicating the percent maintenance of 5mCpG, generated in accordance with various embodiments.

[0012] Figs. 4 provides a data chart indicating the percent of 5mCpG, generated in accordance with various embodiments.

[0013] Fig. 5 provides a dot plot of single cells of 5mCpG maintenance and 5mCpG methylation percentage, generated in accordance with various embodiments.

[0014] Fig. 6 provides a data graph indicating the ability to detect 5mC utilizing various extraction and experimental conditions in single cells in accordance with various embodiments.

[0015] Fig. 7 provides a schematic describing four different versions of Dyad-seq. M- M-Dyad-seq profiles 5mC on one strand and if the cytosine on the opposing strand is methylated or not, as assessed in accordance with various embodiments. M-FI-Dyad- seq profiles 5mC on one strand and if the cytosine on the opposing strand is hydroxymethylated or not. FI-FI-Dyad-seq profiles 5hmC on one strand and if the cytosine on the opposing strand is hydroxymethylated or not. FI-M-Dyad-seq profiles 5hmC on one strand and if the cytosine on the opposing strand is methylated or not.

[0016] Fig. 8 provides a schematic of an exemplary method for detecting 5- hydroxymethylcytosine modifications in accordance with various embodiments.

[0017] Fig. 9 provides a data graph depicting 5mCpFlpG maintenance methylation detected by M-M-Dyad-seq, generated in accordance with various embodiments.

[0018] Figs. 10A, 10B, and 10C provide data graphs depicting 5mC and 5hmC maintenance, generated in accordance with various embodiments. Fig. 10 A shows (left panel) 5mCpG maintenance, quantified as the percentage of CpG dinucleotides that are symmetrically methylated, is shown for mESCs grown under different conditions. M-M- Dyad-seq is used to estimate 5mCpG maintenance (middle panel) M-FI-Dyad-seq shows the percentage of 5mC that are paired with 5hmC at CpG dyads (right panel) FI- FI-Dyad-seq shows the percentage of 5hmC that are paired with 5hmC at CpG dyads. Fig. 10 B shows (left panel) FI-M-Dyad-seq shows the percentage of 5hmC that are paired with 5mC at CpG dyads (middle panel) Genome-wide 5mCpG levels quantified using M-M-Dyad-seq. (right panel) Genome-wide 5hmCpG levels quantified using M-H- Dyad-seq. Fig 10C depicts (left panel) Genome-wide 5mCpG levels quantified using H- M-Dyad-seq for mESCs grown under different conditions (right panel) Genome-wide 5hmCpG levels quantified using H-H-Dyad-seq for mESCs grown under different conditions.

[0019] Fig. 10D provides a data graph depicting loss of DNA methylation after culturing mESCs in 2i conditions for 48 hours is associated with a reduction in 5mCpG maintenance levels, generated in accordance with various embodiments. Each dot represents genomic tilling of 100 kb.

[0020] Fig. 11 depicts data graphs generated in accordance with various embodiments. Left panel depicts the first two principal components show distinct transcriptomes of mESCs grown in different conditions. Bulk RNA-seq was performed in triplicate. Right panel depicts a heatmap of expression level of genes related to de novo methylation, maintenance methylation, and demethylation pathways.

[0021] Figs. 12A and 12B provide data generated in accordance with various embodiments. Fig. 12A depicts a heat map of differentially expressed genes with a putative role in regulating DNMT1 -mediated maintenance fidelity. Fig. 12B depicts gene pathway enrichment analysis for differentially expressed genes performed using Metascape. Left panel shows gene sets associated with specific pathways that are highly expressed in the 2i and M condition, lowly expressed in No, and not differentially expressed across SL, BL, and G. Right panel shows gene sets associated with specific pathways that are highly expressed in the No condition, lowly expressed in 2i and M, and not differentially expressed across SL, BL, and G.

[0022] Figs. 13A and 13B provide data generated in accordance with various embodiments. Fig. 13A depicts bar plots that show 5mCpG levels estimated using M-M- Dyad-seq at various repetitive elements after 48-hours in the indicated media conditions. Fig. 13B depicts bar plots that show 5mCpG maintenance fidelity estimated using M-M- Dyad-seq at various repetitive elements for mESCs grown under different conditions. [0023] Fig. 14 provides data generated in accordance with various embodiments. (Top left panel) Box plot of 5mCpG maintenance levels in 1 kb genomic bins categorized based on the number of CpGs in the bin and the absolute methylation levels. Low 5mC indicates methylation levels lower than 20%, medium 5mC indicates methylation levels between 20% and 80%, and high 5mC indicates methylation levels greater than 80%. N.D. stands for “Not detected”. (Top right panel) Box plot of 5mCpG maintenance levels in 1 kb genomic bins over varying absolute methylation levels. 1 kb regions in which at least 5 unique CpG dyads are detected were included (bottom left panel) 5mCpG maintenance levels at fully methylated regions (FMR), lowly methylated regions (LMR), and unmethylated regions (UMR). (bottom right panel) 5mCpG maintenance levels at CpG islands separated based on absolute methylation levels. ‘Methylated’ indicates CpG islands with greater than 20% methylation, ‘mixed’ indicates methylation levels between 10 and 20%, and ‘unmethylated’ indicates CpG islands with less than 10% methylation. mESCs grown in SL condition and profiled using M-M-Dyad-seq are used to make the panels in this figure.

[0024] Fig. 15 provides a heatmap of 5mCpG maintenance fidelity in serum grown mESCs at genomic regions enriched for various histone marks, generated in accordance with various embodiments. Numbers within parenthesis indicate the total number of regions analyzed in the meta-region.

[0025] Figs. 16A and 16B provide box plots of 5mCpG maintenance levels as a function of absolute 5mCpG levels at individual loci enriched for a histone mark (T) or a meta-region (M) containing all enriched loci corresponding to a histone mark, generated in accordance with various embodiments. Distributions for the meta-regions were obtained using bootstrapping, where resampling was performed 1 ,000 times per histone mark. Blue dots indicate average values found in genome-wide 1kb bins (same as data presented in panel.

[0026] Fig. 17 provides data showing accuracy of scDyad-seq, generated in accordance with various embodiments. 5mCpHpG maintenance levels of single cells treated with or without 0.6 mM Decitabine for 24 hours. [0027] Fig. 18 provides a data graph depicting the coverage of CpG dinucleotides that provide information on maintenance methylation (5mCpG dyad coverage), and coverage of CpG sites that enable quantification of absolute methylation levels (CpG coverage), together with the number of unique transcripts detected in individual cells, generated in accordance with various embodiments. The total number of CpG sites detected in a cell is the sum of 5mCpG dyad coverage and CpG coverage.

[0028] Fig. 19A provides data graphs depicting an example of of two cells, P7L4.78 and P7L3.69, that show very similar levels of 5mCpG maintenance computed using scDyad&T-seq but display substantial differences when MspJI-based quantification is used to estimate strand-specific methylation, generated in accordance with various embodiments. A low Pearson’s correlation indicates deviations from a strand bias score of 0.5. Color of the data points indicates 5mCpG maintenance percent of individual chromosomes estimated using scDyad&T-seq.

[0029] Fig. 19B provides a heatmap comparing 5mCpG maintenance over individual chromosomes in mESCs computed using scDyad&T-seq with the strand bias metric that can be estimated from techniques such as scMspJI-seq from the same single cells, generated in accordance with various embodiments. The heatmap shows that the 5mCpG maintenance estimated from scDyad&T-seq displays increased sensitivity in quantifying strand-specific DNA methylation compared to the strand bias metric obtained from scMspJI-seq. The transcriptional group individual cells belong to (top) and their genome-wide 5mCpG methylation levels (bottom) are also reported in this panel.

[0030] Fig. 19C provides data comparing scDyad&T-seq with scMspJI-seq, generated in accordance with various embodiments. (Top left panel) Similar levels of 5mCpG detected on the plus and minus strand of each chromosome by the enzyme MspJI in cell P7L3.67 is in agreement with the high levels of 5mCpG maintenance estimated using scDyad&T-seq. The color of the data points correspond to the 5mCpG maintenance percent estimated using scDyad&T-seq. (Top right panel) Comparison of chromosome-wide 5mCpG strand bias scores, estimated using techniques such as scMspJI-seq, to 5mCpG maintenance percent estimated using scDyad&T-seq. The color of the data points correspond to the absolute methylation levels estimated using scDyad&T-seq. (Bottom panel) Comparison of genome-wide concordance of methylation calls to 5mCpG maintenance percent estimated using scDyad&T-seq for single cells. Concordance was defined as the fraction of reads (with at least 5 CpG sites covered) where 90% or more of the sites were methylated. The color of the data points correspond to the absolute methylation levels estimated using scDyad&T-seq.

[0031] Fig. 20 provides a bar plot depicting DNA methylation and 5mCpG maintenance levels at different genomic regions as fully methylated regions (FMR), lowly methylated regions (LMR), and unmethylated regions (UMR), generated in accordance with various embodiments. Data points represent individual cells.

[0032] Figs. 21 A and 21 B provide data showing heterogeneity of mESCs, generated in accordance with various embodiments. Fig. 21 A depicts UMAP visualization of serum grown mESCs based on the single-cell transcriptomes obtained from scDyad&T-seq. Fig. 21 B depicts single-cell transcriptomes obtained from scDyad&T-seq showing the expression levels of pluripotency related genes NANOG, REX1, and ESRRB in the two clusters (NANOG high and NANOG low) in serum grown mESCs.

[0033] Fig. 22 provides (left panel) 5mCpG levels in regions marked by specific histone modifications and (right panel) 5mCpG maintenance levels in regions marked by specific histone modifications, generated in accordance with various embodiments. Data points represent individual cells.

[0034] Figs. 23A and 23B provide analysis of DNA methylations and 5mCpG maintenance levels based on nanog expression, generated in accordance with various embodiments. Fig. 23A depicts DNA methylation levels at regions marked by different histone modifications. Fig. 23B depicts 5mCpG maintenance at regions marked by different histone modifications. Data points represent individual cells.

[0035] Fig. 24A provides data graphs depict analysis of mESC methylation and transcription, generated in accordance with various embodiments. (Left panel) Genome wide methylation and (Middle panel) maintenance levels of individual mESCs cultured in serum or in 2i conditions for 3, 6 or 10 days. (Right panel) Genome-wide 5mCpG methylation and maintenance levels of single cells as they transition from serum to 2i conditions. Cells transition from highly methylated and highly maintained to a lowly methylated and lowly maintained or lowly methylated and highly maintained state. [0036] Fig. 24B provides data graphs depict analysis of mESC methylation and transcription, generated in accordance with various embodiments. (Left panel) As cells transition from serum to 2i conditions, they are classified as lowly/highly methylated and lowly/highly maintained based on hierarchical clustering. (Middle panel) Genome-wide methylation and (Right panel) maintenance levels at regions identified to be highly methylated in mESCs grown in long term 2i culture (denoted as ‘5mC high’) and at all other genomic regions (denoted as Other loci’), with each dot denoting a single cell. [0037] Fig. 25A provides a representation of hierarchical clustering based on genome-wide 5mCpG levels, generated in accordance with various embodiments. The clustering shows that cells can be classified into two major groups - a 5mCpG low (mC^Lo) or a 5mCpG high (mC^Hi) state.

[0038] Fig. 25B provides a representation of hierarchical clustering based on genome-wide 5mCpG maintenance levels, generated in accordance with various embodiments. The clustering shows that cells can be classified into two major groups - a low maintenance (Mnt^Lo) or a high maintenance (Mnt^Hi) state.

[0039] Fig. 26 provides UMAP visualization of cells transiting from serum to 2i conditions, based on the single-cell transcriptomes obtained from scDyad&T-seq, shows that cells can be classified into two broad transcriptional clusters, generated in accordance with various embodiments. The cluster names, 2i-like and Serum-like were assigned based on expression of key marker genes in mESCs grown in 2i or SL conditions, respectively.

[0040] Fig. 27 provides UMAP visualization of serum and 2i cells based on the single cell transcriptomes obtained from scDyad&T-seq, and classified by culture conditions (left panel) or by the transcriptome-based clustering (right panel), generated in accordance with various embodiments.

[0041] Fig. 28 provides data analysis of epigenetic and transcriptional features of 2iD3 cells, generated in accordance with various embodiments. Genome-wide DNA methylation (top left panel) and 5mCpG maintenance (top right panel) levels of 2iD3 cells in the two broad transcriptional groups - 2i-like and Serum-like. Expression levels of the pluripotency marker POU5F1 (also known as OCT4) (bottom left panel) and early neuroectoderm lineage marker SOX1 (bottom right panel). [0042] Fig. 29 provides a heatmap of differentially expressed genes between the 2i-

1 and 2i-2 population, generated in accordance with various embodiments.

[0043] Fig. 30A provides data graphs of expression levels of select genes and transposable elements, such as DPPA3, KFIDC3, RLTR45, and RLTR45-int, that were found to be highly expressed in the 2i-2 population, generated in accordance with various embodiments.

[0044] Fig. 30B provides Genome-wide methylation and and maintenance levels of single cells in different transcriptional clusters , generated in accordance with various embodiments.

[0045] Fig. 31 provides absolute DNA methylation levels and the corresponding 5mCpG maintenance levels for 100 kb bins for cells in population 2i-1 (left panel) or 2i-

2 (right panel), generated in accordance with various embodiments.

[0046] Fig. 32A provides a bar plot depicting the percentage of 2i-1 and 2i-2 cells in the four groups classified based on the genome-wide methylation and maintenance levels, generated in accordance with various embodiments. Numbers within parenthesis indicate the total number of cells in the transcriptional clusters 2i-1 and 2i-2.

[0047] Fig. 32B provides a bar plot depicting how cells cultured in 2i condition for varying number of days are distributed between the 2i-1 and 2i-2 populations, generated in accordance with various embodiments. The number in the parenthesis indicates the total number of cells in that sub-population.

[0048] Fig. 33A provides a data graph depicting the coverage of CpG sites providing information on 5mCpG maintenance (dyad coverage), and the coverage of CpG sites providing information on the absolute levels of DNA methylation in single cells (coverage), generated in accordance with various embodiments. The shading of the data points indicate the total number of unique transcripts detected in single cells grown in SL and 2i conditions.

[0049] Fig. 33B provides a heatmap of 5mCpG maintenance for individual chromosomes in single cells indicates increased sensitivity in quantifying DNMT1- mediated maintenance fidelity and demethylation compared to the strand bias score obtained from methods such as scMspJI-seq, generated in accordance with various embodiments. The data also shows the culture conditions and genome-wide 5mCpG methylation levels for the same cells.

DETAILED DESCRIPTION

[0050] Turning now to the drawings and data, systems and methods to detect nucleobase modification and applications thereof are described, in accordance with the various embodiments described herein. Various embodiments are directed to detecting nucleobase modifications (or lack thereof) on both strands of a double stranded nucleic acid molecule, which can be achieved via restriction nuclease cleavage patterns, nucleobase conversion, and sequencing. A modification-dependent restriction nuclease can be utilized to identify nucleobase modification on at least one strand of a double stranded nucleic acid molecule. In some embodiments, however, a restriction nuclease that is blocked by modification is utilized to identify unmodified nucleobases on at least one strand. In some embodiments, a nucleobase conversion reaction is performed to identify nucleobase modification on at least one strand of a double stranded nucleic acid molecule. And in some embodiments, a modification-dependent restriction nuclease is utilized to identify nucleobase modification on a first strand of a double stranded nucleic acid molecule and a nucleobase conversion reaction is performed to identify nucleobase modification on a second strand of a double stranded nucleic acid molecule. Sequencing can be performed to identify modified and/or unmodified nucleobase.

[0051] Double stranded nucleic acids (e.g., DNA, dsRNA) are composed of two antiparallel strands containing complimentary bases. Since each antiparallel strand is complimentary, there is little benefit to perform experimentation to obtain data about each of the antiparallel strands, and thus traditional detection assays analyze one of the two strands but either cannot distinguish which strand the readout came from or infer the data on the other strand based on the experimental measurement. Modifications to nucleobases, such as (for example) 5-methylcytosine, 5-hydroxymethylcytosine, 5- glucosylhydroxymethylcytosine, 5-formylcytosine, 5-carboxylcytosine, N4- methylcytosine, and N6-methyladenine, however, are not patterned in a complimentary fashion. Thus, simultaneous detection of nucleobase modification (or lack thereof) on both strands provides additional insight. Described herein are various systems and methods that allow for simultaneous detection of nucleobase modifications on both strands of a double stranded nucleic acid molecule at single nucleotide resolution, which can utilize nucleic acid sequencing as a readout. Also described herein are experimental data that validate that these methodologies, which are broadly applicable to all nucleobase modifications of nucleic acid molecules.

Assays to Identify Nucleobase Modifications

[0052] Several embodiments are directed towards systems, methods and reagents to detect sites of nucleobase modification on one or both strands of a double stranded nucleic acid molecule. In many embodiments, a modification-dependent restriction nuclease is used to digest a double stranded nucleic acid biomolecule. In many embodiments, a nucleobase conversion reaction is performed, which can be utilized with sequencing to detect modified nucleobases. In some embodiments, a modification- dependent restriction nuclease is utilized to identify nucleobase modification on a first strand of a double stranded nucleic acid molecule and a nucleobase conversion reaction and/or direct sequencing is performed to identify nucleobase modification on a second strand of a double stranded nucleic acid molecule. In many embodiments, nucleic acid sequencing is performed on modification-dependent restriction nuclease digested, nucleobase converted nucleic acid molecules such that the sites of nucleobase modification are identified at single-base resolution.

[0053] Provided in Fig. 1 is a flowchart of an exemplary method to detect nucleobase modifications in a double stranded nucleic acid molecule. The method generally utilizes modification-dependent restriction nuclease digestion to detect nucleobases on a first strand of a double stranded nucleic acid and nucleobase conversion reaction to detect nucleobases on a second strand of a double stranded nucleic acid, which are revealed via nucleic acid sequencing at single-base resolution. The method of Fig. 1 can be performed upon a population of biological cells or upon an individualized single biological cell. To perform the method on single biological cells, a population of cells can be individualized and the double stranded nucleic acid molecules (e.g., genomic DNA) can be examined for nucleobase modification detection. As discussed in the Exemplary Data section, nucleobase modifications can be detected on a single-cell level with high efficiency. Furthermore, other biomolecules (e.g., RNA) can be extracted from the same single cells to perform other assessments to gain a more complete understanding of the cell’s biological activity.

[0054] The method of Fig. 1 can begin by digesting 101 a double stranded nucleic acid molecule with a modification-dependent restriction enzyme. The digestion can be performed on any appropriate double stranded nucleic acid molecule, which may have one or more modified nucleobases. Modified nucleobases include (but are not limited to) 5-methylcytosine, 5-hydroxymethylcytosine, 5-glucosylhydroxymethylcytosine, 5- formylcytosine, 5-carboxylcytosine, N4-methylcytosine and N6-methyladenine. In some embodiments, the double stranded nucleic acid molecule is derived from a population cells (e.g., nucleic acid extraction from tissue or cell culture). In some embodiments, the double stranded nucleic acid molecule is derived from a single cell (e.g., cells sorted in single cells and nucleic acid is extracted from an individual single cell). In some embodiments, the double stranded nucleic acid molecule is derived from a biological source, such as (for example) prokaryotes, plants, fungus, or animals. Certain modified bases are common in some species types and either uncommon and nonexistent in other species types. 5-methylcytosine can be found throughout double stranded nucleic acid molecules in most (if not all) species of prokaryotes, plants, fungus, and animals. 5- hydroxymethylcytosine can be found throughout double stranded nucleic acid molecules in mammals (especially in the brain, germ cells, and embryonic cells) and bacteria phages. 5-formylcytosine and 5-carboxylcytosine can be found throughout double stranded nucleic acid molecules in mammals (especially in the brain, germ cells, and embryonic cells). 5-glucosylhydroxymethylcytosine can be found throughout double stranded nucleic acid molecules in bacteria phages. N4-methylcytosine and N6- methyladenine can be found throughout double stranded nucleic acid molecules in prokaryotes. In some embodiments, the double stranded nucleic acid molecule is synthesized with a protocol that incorporates one or more types of modified bases. Double stranded nucleic acid molecules include (but are not limited to) double stranded DNA and double stranded RNA, and double stranded hybrid DNA/RNA molecules. [0055] Modification-dependent restriction nucleases include (but are not limited to) Type IIM and Type IV restriction endonucleases. Modification-dependent restriction nucleases that detect and digest double stranded nucleic acids having 5-methylcytosine include (but are not limited to) MspJI, FspEI, LpnPI, AspBHI, Rial, SgrTI, Sgel, Sgul, Aoxl, Bisl, Blsl, Glal, Glul, Krol, Mtel, Pcsl, Pkrl, SauUSI, SauNewl, EcoKMcrA, ScoA3McrA, BanUMcrB, BanUMcrB3, EcoKMrr, BanUMrr, SepRPMcrR, ScoA3l, McrBC, mcrA, ScoA3ll+lll, YenY4l, MsiJI, McaZI, BwiMMI, EfaL9l, ScoA3IV, AbaUMB2l, Alai 76121, AspTB23l, Bce1273l, Bce95l, BceLI, BceYI, Bth171l, CbuDI, Dde51507l, Dsp20l, EcoBLMcrX, Elml, Esp638l, KpnW2l, MspAK21 l, Nhol, PaePS50l, Pam7902l, Pan13l, Pfl8569l, Pps170l, Pru4541l, PspJDRII, PsuGI, RdeR2l, Rfl17l, Sde240l, Sve396l, ScoA3V, and engineered SRA-nicking domain fusion proteins. Modification- dependent restriction nucleases that detect and digest double stranded nucleic acids having 5-hydroxymethylcytosine include (but are not limited to) AbaSI, PvuRtsl I, PpeHI, AbaAI, AbaBGI, AbaCI, AbaDI, AbaHI, AbaTI, AbaUI, AcaPI, BbiDI, BmeDI, CfrCI, EsaMMI, EsaNI, Mte37l, PatTI, PfrCI, Pxyl, Ykrl, MspJI, FspEI, LpnPI, AspBHI, Rial, SgrTI, SauUSI, McrBC, CmeDI, PspR81l, TspA15l, VcaM4l, YenY4l, MsiJI, VcaCI, MfoEI, MmaNI, RrhNI, Vsi48l, Vvu009l, McaZI, BwiMMI, and EfaL9l. Modification- dependent restriction nucleases that detect and digest double stranded nucleic acids having 5-glucosylhydroxymethylcytosine include (but are not limited to) AbaSI, PvuRtsl I, PpeHI, AbaAI, AbaBGI, AbaCI, AbaDI, AbaHI, AbaTI, AbaUI, AcaPI, BbiDI, BmeDI, CfrCI, EsaMMI, EsaNI, Mte37l, PatTI, PfrCI, Pxyl, Ykrl, GmrSD, CmeDI, PspR81 l, TspA15l, and VcaM4l. Modification-dependent restriction nucleases that detect and digest double stranded nucleic acids having N6-methyladenine include (but are not limited to) Dpnl, ScoA3Mrr, Mall, Cful, FtnUIV, Hsa13891 l, Mph110311, Nani 957311, NgoAVI, NgoDXIV, NmeAII, NmeBL859l, NmuDI, NmuEI, NsuDI, Sbgl, Tdel, and ScoA3V. Modification-dependent restriction nucleases that detect and digest double stranded nucleic acids having N4-methylcytosine include (but are not limited to) McrBC. [0056] Various embodiments are directed to detecting putative sites of nucleobase modification that are unmodified. Accordingly, a restriction nuclease that is blocked by a modification can be utilized. Restriction endonucleases blocked by 5-methycytosine, 5- hydroxymethylcytosine, and 5-glucosylhydroxymethylcytosine include (but are not limited to) Aatll, Acil, Acll, Afel, Agel, Ascl, AsiSI, Aval, BceAI, BmgBI, BsaAI, BsaHI, BsiEI, BsiWI, BsmBI-v2, BspDI, BsrFI-v2, BssHII, BstBI, BstUI, Clal, Eagl, Esp3l, Faul, Fsel, Fspl, Haell, Hgal, Hhal, HinPII, Hpall, HpyCH4IV, Hpy99l, Kasl, Mlul, Nael, Narl, NgoMIV, Notl, Nrul, Nt.BsmAI, Nt.CviPII, PaeR7l, PluTI, Pmll, Pvul, Rsrll, Sacll, Sail, Sfol, SgrAI, Smal, SnaBI, Srfl, TspMI, and Zral. Restriction endonucleases blocked by N6-methyladenine include (but are not limited to) Alwl, Bell, Dpnll, Hphl, Mbol, and Nt.Alwl. Restriction endonucleases blocked by 5-glucosylhydroxymethylcytosine, 5- formylcytosine and 5-carboxylcytosine include (but are not limited to) Mspl. Restriction endonucleases blocked by N4-methylcytosine include (but are not limited to) Hpall, Smal, and Xmal.

[0057] To increase specificity of modification-dependent restriction nucleases, in some embodiments, certain modified nucleobases are further modified. For example, to increase the specificity of modification-dependent restriction nucleases that recognize and digest 5-methylcytosine and 5-hydroxymethylcytosine, a double stranded nucleic acid molecule can be treated with a T4 phage beta-glucosyltransferase or T4 phage alpha-glucosyltransferase to further modify 5-hydroxymethylcytosine such that the modification-dependent restriction nuclease is incapable of recognizing and digesting at such sites. In some embodiments including (but not limited to) those using the MspJI family of restriction enzymes, treatment with T4 phage beta-glucosyltransferase prior to nuclease digestion would block 5-hydroxymethylcytosine and thus only 5-methylcytosine sites would be digested. In some embodiments including (but not limited to) those using the PvuRtsl l family of restriction enzymes, treatment with T4 phage beta- glucosyltransferase prior to nuclease digestion would strongly increase selectivity towards 5-hydroxymethylcytosine detection over 5-methylcytosine.

[0058] It may be desired to detect nucleobase modifications on a single cell level. To do so, individual biological cells can be isolated and the double stranded nucleic acid molecules (e.g., genomic DNA) of each individual cell examined.

[0059] Provided in Fig. 2A is a schematic of an exemplary method to detect nucleobase modifications in a double stranded nucleic acid molecule. As can be seen in the schematic, the modification-dependent restriction endonuclease MspJI is utilized to detect and digest 201 the double stranded nucleic acid molecules. The MspJI endonuclease recognizes individual 5-methylcytosines and its recognition sequence allows for the recognition of a high number of CpG and CHG sites. After the recognition of an individual 5-methylcytosine within its recognition sequence, MspJI cuts 12 nucleotides downstream of the 5-methylcytosine on the same strand as the 5- methylcytosine, and 16 nucleotides downstream on the opposing DNA strand, leaving a random 4 nucleotide 5’ overhang.

[0060] As noted, the method of Fig. 1 can be performed on single cells and/or combined with RNA transcriptional analysis. To exemplify this ability, Fig. 2B is a schematic of an exemplary method to perform RNA transcriptional analysis and detect nucleobase modifications from single cells. A population of cells is sorted 251 into single cells, where the cell can be fixed and/or lysed to release the nucleic acid biomolecules. A reverse transcriptase and poly-T polymer can be added to the nucleic acid solution to perform reverse transcription 253 on poly-A RNA molecules. The pol-T primer can further include a primer sequence, an amplification sequence, a sample and/or cell barcode, or a unique molecular identifier (UMI). In addition, the modification-dependent restriction endonuclease MspJI is utilized to detect and digest 255 double stranded nucleic acid molecules at sites of 5-methylcytosine.

[0061] Referring back to Fig. 1, once the double stranded nucleic acid molecule is digested, adapter nucleic acid molecules are ligated 103 to the digested nucleic acid molecule fragments. An adapter nucleic acid molecule, in accordance with various embodiments, is a single or double stranded nucleic acid molecule with one or more sequences, each sequence having a particular function. In some embodiments utilizing a double stranded nucleic acid, one or both nucleic acids may be phosphorylated at the 5’ end. An adapter nucleic acid molecule will include an overhang compatible with the overhang on the digested nucleic acid molecule fragments. In some embodiments, a blunt ended adapter is utilized to ligate with a blunt end digestion (e.g., Dpnl results in blunt ends) or when overhangs are excised or are extended to become blunt. In some embodiments, the ends of the digested double stranded nucleic acid molecule are modified prior to ligation. Furthermore, an adapter nucleic acid molecule can include a polymerase chain reaction primer sequence or other amplification specific sequences, a cell and/or sample barcode, and/or a unique molecular identifier. Further sequences, such as spacers and/or various nucleotides may also be incorporated in an adapter. In some embodiments in which a modified cytosine is to be detected, at least one strand of the adapter nucleic acid molecule is devoid of cytosines or only includes modified cytosines, which may help during the steps involving nucleobase conversion (see description of step 105 below). In some of these embodiments, the strand of the adapter nucleic acid molecule that is devoid of cytosines or only includes modified cytosines is ligated to the strand opposite of the strand containing the modified nucleobase recognized by the modification-dependent restriction endonuclease in the digestion reaction. In some embodiments in which modified adenosines are to be detected, at least one strand of the adapter nucleic acid molecule is devoid of adenosines or only includes modified adenosines, which may help during the steps involving nucleobase conversion (see description of step 105 below). Barcoded molecules signifying particular samples and/or cells can be pooled for further treatment and/or multiplexed analysis.

[0062] As can be seen in the schematic of the exemplary method of Fig. 2A, MspJI digested nucleic acid molecules are ligated 203 with an adapter nucleic acid molecule having a 5’-overhang of 4 random bases to complement the 5’ -overhang left by the MspJI digestion. Further, as shown, the ligated adapter molecule can further include a primer sequence, a sample and/or cell barcode, or a unique molecular identifier (UMI). In some embodiments, nucleic acid molecules containing modified nucleobases are enriched. In some embodiments, enrichment is performed using an antibody specific to a modified nucleobase, or through biotinylation strategies coupled with streptavidin pulldown. [0063] The exemplary method of Fig. 2B further shows ligation 255 of an adapter molecule, which can further include a primer sequence, an amplification sequence, a sample and/or cell barcode, and a unique molecular identifier (UMI). The individual cell lysate solutions of reverse transcribed RNA and ligated digested molecules are pooled 257. Molecules from individual cells or pooled molecules from many cells can be amplified. To separate the reverse transcribed RNA from the digested double stranded nucleic acids, the RNA can be pulled down 259 and isolated, leaving the digested double stranded nucleic acids in the flowthrough 261. It should be understood, however, that in some embodiments, the digested double stranded nucleic acids are pulled down and the reverse transcribed RNA are left in the flowthrough. After separation of the RNA and the digested double stranded, the RNA can be prepped and analyzed 263 in various molecular assessments, such as (for example) RNA-seq, quantitative PCR, and cDNA cloning. The digested double stranded can be further analyzed 265 in accordance with the descriptions of Figs. 1 and 2A. In some embodiments, digested double stranded nucleic acids and reversed transcribed RNA are not separated before prepping and analyzing.

[0064] Returning back to Fig. 1 , in some embodiments, the digested double stranded nucleic acid molecule fragments are denatured and nucleobases of nucleic acid molecule fragments are converted 105. Prior to conversion, in some embodiments, the double stranded nucleic acid molecule fragments are denatured into single stranded nucleic acid molecule fragments. Denaturing of double stranded nucleic acid fragments into single stranded fragments can be performed by any appropriate method, including (but not limited to) a denaturing heat treatment and/or a denaturing chemical treatment. In some embodiments, denaturing of the digested double stranded fragments with adapter results in the disassociation of the adapter sequence with nucleic acid fragment strand that was recognized by the modification-dependent restriction enzyme. In some embodiments, prior to conversion, nucleobases are altered to adjust their susceptibility to nucleotide conversion. Alterations include but are not limited to oxidation of modified cytosines by enzymatic or chemical means for example by the Ten-eleven translocation family of enzymes (TET), TET1, TET2 and TET3, or with potassium perruthenate, or potassium ruthenate. Alterations include but are not limited to reductions of modified cytosines by enzymatic or chemical means for example by sodium borohydride. Alterations include but are not limited to protection or deprotection of nucleobases by enzymatic or chemical means for example by DNA or RNA methyltransferases including (but not limited to) the DNA methyltransferase family (DNMT), M.Sssl, M.CviPI, DNA adenine methyltransferase (Dam), EcoGII methyltransferase, Alul Methyltransferase, Bam FI I Methyltransferase, EcoRI Methyltransferase, Hael 11 Methyltransferase, Hhal Methyltransferase, Hpal I Methyltransferase, Mspl Methyltransferase, Taql Methyltransferase or by glucosyltransferases including T4 phage beta- glucosyltransferase and T4 phage alpha-glucosyltransferase, or by 1 -ethyl-3-[3- dimethylaminopropyljcarbodiimide hydrochloride, or O-ethylhydroxylamine. [0065] Conversion of nucleobases are to delineate a modified nucleobase from an unmodified nucleobase, as can be detected in a subsequent sequencing reaction. For examination of cytosine modifications, conversion reactions include (but are not limited to) bisulfite treatment, pyridine borane treatment, malononitrile treatment, chemical labeling of modified cytosines or an enzymatic treatment utilizing a cytosine deaminase. Bisulfite treatment converts unmodified cytosine, 5-formylcytosine and 5- carboxylcytosine containing residues into uracil but does not have an effect on modified cytosines including 5-methylcytosine, 5-hydroxymethylcytosine, and 5- glucosylhydroxymethylcytosine. In some embodiments of bisulfite treatment, reaction conditions can be tuned to have no effect on N4-methylcytosine. Pyridine borane treatment converts 5-formylcytosine and 5-carboxylcytosine into dihydrouracil, but does not have an effect on unmodified cytosine, 5-methylcytosine, and 5- hydroxymethylcytosine. Malononitrile treatment selectively converts 5-formylcytosine. Other chemical labeling of modified cytosines to induce a nucleobase conversion upon sequencing include but are not limited to the selective labeling of 5fC using an azido derivative of 1 ,3-indandione. Cytosine deaminase treatment converts unmodified cytosine, 5-methylcytosine, 5-hydroxymethylcytosine into uracil, thymine, and 5- hydroxymethyluracil, respectively, but does not have an effect on modified cytosines including 5-glucosylhydroxymethylcytosine, 5-formylcytosine, and 5-carboxylcytosine. Cytosine deaminases can also be tethered to modified or unmodified nucleotide identifying antibodies, proteins or domains of proteins to mark through cytosine deaminase nucleotide bases in proximity to that of the nucleotide identified by such antibody, protein, or protein domain. Cytosine deaminases that can be used in an enzymatic treatment include (but are not limited to) the AID/APOBEC family of enzymes and cytidine deaminases (CDA). The human AID/APOBEC family of enzymes include APOBEC1 , APOBEC2, APOBEC3A, APOBEC3B, APOBEC3C, APOBEC3D, APOBEC3F, APOBEC3G, APOBEC3FI, APOBEC4, and activation-induced cytidine deaminase (AID). Accordingly, when the resultant molecule is sequenced, the sequencing read will provide an indication on whether the cytosine was modified (i.e. , the sequencing read is “C”) or the cytosine was unmodified (i.e., the sequence read is “U/T”). For examination of adenosine modifications, conversion reactions include but are not limited to sodium nitrite treatment, adenine deaminase treatment, N6-methyladenine deaminase treatment, and antibody detection followed by cross linking. Sodium nitrite treatment deaminates unmethylated adenosines to hypoxanthine but does not have an effect on N6-methyladenine. Adenine deaminase treatment converts unmethylated adenosine into inosine but does not have an effect on N6-methyladenine. Adenine deaminases that can be used in an enzymatic treatment include (but are not limited to) adenosine deaminases acting on dsRNA (ADAR), adenosine deaminases acting on tRNA (ADAT), ADAT homologs such as ecTadA, adenosine deaminases (also known as adenosine aminohydrolases) (ADA), and evolved derivatives of such enzymes such as ABE6.3, ABE7.8, ABE7.9 and ABE7.10. N6-methyladenine deaminase treatment converts N6-methyladenine to hypoxanthine but does not have an effect on unmethylated adenosine. N6-methyladenine deaminases that can be used in an enzymatic treatment include (but are not limited to) Bh0637. Accordingly, when the resulting molecule is sequenced, the sequencing read will provide an indication on whether the adenosine was modified or unmodified, where conversion to hypoxanthine or inosine sequence as “G” and unconverted adenine sequence as “A”. Antibody detection of N6-methyladenine followed by cross linking results in mutation of the cytosine base (if present) one nucleobase upstream of the antibody detected N6- methyladenine site, the resulting mutation is sequenced as “T”. For examination of other modified nucleobases, the use of an appropriate enzymatic or chemical treatment resulting in an associated nucleobase change that can be detected by sequencing can be used.

[0066] As can be seen in the schematic of the exemplary method in Fig. 2A, the double stranded DNA molecules are denatured into single strands and the unmodified cytosines are converted 205 into uracil. The modified cytosines are left unperturbed. Further, denaturation results in the adapter sequence to dissociate with the strand recognized by the MspJI enzyme.

[0067] In some embodiments a nucleobase conversion reaction is not performed when utilizing a sequencing system that detects modified nucleobases directly. Such sequencing systems that can detect nucleobase modification include (but are not limited to) Pacific Bioscience’s Single Molecule, Real-Time (SMRT) sequencing platform (Menlo Park, CA) and Oxford Nanopore Technologies PromethlON, MinlON, and GridlON sequencing platforms (Oxford, UK).

[0068] The nucleic acid molecule fragments are prepared 107 for sequencing in accordance with the sequencing platform utilized. Generally, in accordance with various embodiments, another primer is annealed to the strand opposing the strand recognized by the modification-dependent restriction enzyme, the strand is then linearly amplified and further amplified prior to sequencing. Amplification can be performed by any appropriate means, including (but not limited to) polymerase chain reaction (PCR), whole genome amplification (WGA), in vitro transcription (IVT), or any combination of amplification techniques. In some embodiments, a single or double stranded adapter is ligated to the strand opposing the strand recognized by the modification-dependent restriction enzyme, the strand is then further amplified by PCR. In some embodiments, Adaptase (Swift Biosciences, Ann Arbor, Ml) is used to simultaneously tail and ligate an adapter, the strand is then further amplified by PCR.

[0069] In some embodiments, to add a second primer to the nucleic acid molecule, a Klenow fragment (3’ -> 5’ exo-) is used to linearly amplify and create a double stranded molecule. The second primer can include a primer for the sequencing reaction and a number of nucleotides to anneal with the nucleic acid molecule fragment. In some embodiments, random nucleotides are utilized in the primer for annealing. In some embodiments, specific sequences matching nucleobase converted or unconverted genomic regions of interests are utilized in the primer for annealing. After creating a complementary strand, polymerase chain reaction (PCR) is utilized to amplify the double stranded molecule. Then the digested and converted double stranded nucleic acid molecule fragments are sequenced 109 to detect sites of nucleobase modification. Any appropriate sequencing platform can be utilized, such as (for example) llumina’s sequencing platform (Ilium ina, Inc., La Jolla, CA).

[0070] As can be seen in Fig. 2A, a primer molecule is annealed to the single stranded nucleic acid fragment, which is then used to linearly amplify 207 and to recreate a double stranded fragment. The primer molecule includes nine random nucleotides to anneal to each single stranded nucleic acid fragment, with a portion of the primer molecule overhanging the 5’ end containing primer sequence for PCR and/or sequencing. After linear amplification, PCR is performed 209 to amplify the double stranded nucleic acid fragments. In some embodiments, PCR primers in this step contain a 5’ overhang to incorporate sequences useful for sequencing or molecule identification. As shown, the final molecule can contain sequencing specific sequences, including sequences to bind the flow cell and sequencing primer sites. The amplified double stranded nucleic acid fragments are then sequenced 211 utilizing an appropriate sequencing platform.

[0071] To detect sites of nucleobase modification, the sequencing results are analyzed and compared to a reference sequence. Nucleobase modification on the strand that was digested with a modification-dependent restriction nuclease can be detected by the expected distance from the adapter sequence as compared to a reference sequence. In some embodiments, the resulting sequencing library is strand-specific, allowing the results to be compared to a specific strand of a reference sequence or genome.

[0072] To detect sites of 5-methylcytosine utilizing MspJI, the expected location of CpG or CpHpG sites on the sequenced DNA fragments are known, and the sequencing results can be used to identify CpG dyads that are fully- or hemi-methylated. In this case, this is done by identifying the cut site of MspJI indicated in the sequencing results as a G on approximately the 17^th non defined base (17^th nucleotide from the original fragment of DNA). If traditional bisulfite nucleotide conversion was utilized, the sequence on approximately the 16^th non defined base indicates if the original DNA fragment was fully- or hemi-methylated in a CpG context given by a C or T, respectively. The sequence on approximately the 15^th non defined base indicates if the original DNA fragment was fully- or hemi-methylated in a CpHpG context given by a C or T, respectively. Notably, identification of the nucleotide is approximate because at low frequency, MspJI exhibits a wobble and thus the precise location of nucleotide can be one or a few base pairs away.

[0073] To detect sites of 5-hydroxymethylcytosine utilizing AbaSI, the expected location of CpG sites on the sequenced DNA fragments are known, and the sequencing results can be used to identify CpG dyads that are hemi-hydroxymethylated or that are opposing hemi-hydroxymethylated/hemi-methylated CpG sites. In this case, this is done by identifying the cut site of AbaSI indicated in the sequence results as a G on approximately the 14^th non-defined base (14^th nucleotide from the original fragment of DNA). If traditional bisulfite nucleotide conversion was utilized, the sequence on approximately the 13^th non-defined base indicates if the original DNA fragment was hemi-hydroxymethylated/hemi-methylated or only hemi-hydroxymethylated in a CpG context given by a C or T, respectively. Notably, identification of the nucleotide is approximate because there is some wobble and thus the precise location of nucleotide can be one or a few base pairs away.

[0074] To detect sites of N6-methyladenine, the presence of the Dpnl cut site is identified by a GATC site in the genome at the location where the original fragment of DNA maps. This confirms the 6mA on at least one strand, but notably Dpnl has a high preference for cutting GATC sites which have 6mA on both sides. If traditional bisulfite nucleotide conversion was utilized, the occurrence of methylated or unmethylated cytosines on the opposing strand is indicated directly from the read sequence as a C or T, respectively.

[0075] While specific examples of methods for detecting sites of nucleobase modification on both strands of a double stranded nucleic acid molecule are described above, one of ordinary skill in the art can appreciate that various steps of the method can be performed in different orders and that certain steps may be optional according to some embodiments. As such, it should be clear that the various steps of the method could be used as appropriate to the requirements of specific applications.

Kits for Identification of Nucleobase Modification

[0076] In several embodiments, kits are utilized for identification of nucleobase modification (or lack thereof). Kits can be used to detect sites of nucleobase modification on one or both strands of a double stranded nucleic acid molecule as described herein. For example, the kits can be used to detect any one or more of modified bases, including (but not limited to) 5-methylcytosine, 5-hydroxymethylcytosine, 5- glucosylhydroxymethylcytosine, 5-formylcytosine, 5-carboxylcytosine, N4- methylcytosine and N6-methyladenine. The kit may include one or more agents for performing endonuclease digestion, one or more agents for modifying nucleobases, one or more agents for performing nucleobase conversion, one or more agents for nucleic acid sequencing, reagents for nucleic acid preparation from biological cells including appropriate means for lysing, stripping nucleic acids of proteins, and preparing the biological sample, and printed instructions for reacting agents with the biological sample to detect nucleobase modifications (or lack thereof) within the sample. Accordingly, a kit may contain one or more restriction nucleases described herein, one or more agents (e.g., potassium perruthenate) or enzymes (e.g., T4 phage beta-glucosyltransferase) for modifying nucleobases described herein, one or more agents (e.g., sodium bisulfite) or enzymes (e.g., AID/APOBEC) for nucleobase conversion, bisulfite sequencing reagents, adapter sequences for amplification and/or sequencing, enzymes and reagents for ligation, and/or reagents for nucleic acid purification. The agents may be packaged in separate containers. The kit may further comprise one or more control reference samples and reagents for performing an endonuclease digestion, nucleobase conversion, and/or sequencing assay.

[0077] A kit can include one or more containers for compositions contained in the kit. Compositions can be in liquid form or can be lyophilized. Suitable containers for the compositions include, for example, bottles, vials, syringes, and test tubes. Containers can be formed from a variety of materials, including glass or plastic. The kit can also comprise a package insert containing written instructions for methods of detecting nucleobase modifications.

EXEMPLARY DATA

[0078] The embodiments of the description will be better understood with the experimental data as provided herein. Validation results are also provided.

5mCpG Maintenance

[0079] Provided in Fig. 3 is a sequencing result for detecting 5-methylcytosine in a sample derived from mouse embryonic stem cells. The data graph shows the 5mCpG maintenance percent in 100 kilobase bins for E14TG2a (E14) mouse embryonic stem cells grown in serum (WT) or in serum supplemented with 0.05 uM Decitabine for 24 hours (Decitabine Treated). A 5mCpG maintenance percent of 100% indicates that all reads where an indirect 5mCpG was identified through the cut site of MspJI, the direct read from the same CpG site was methylated. Decitabine is a well characterized small molecule known for its ability to directly demethylate the genome by interacting with DNMT1 which is directly responsible for creating fully methylated CpG sites from hemi- methylated CpG sites during DNA replication.

Non-dyad 5mCpG Detection

[0080] Provided in Fig. 4 is a sequencing result for detecting 5-methylcytosine in a CpG context in a sample derived from mouse embryonic stem cells. The data graph shows the corresponding 5mCpG methylation percent at all non-dyad detected sites in 100 kilobase bins for E14 mouse embryonic stem cells grown in serum (WT) or in serum supplemented with 0.05 uM Decitabine for 24 hours (Decitabine Treated). Non-dyad detected sites are those CpG sites for which the corresponding 5mC status on the opposing DNA strand was not identified through the cut site of MspJI.

5mCpG Detection in Single Cells

[0081] Provided in Fig. 5 is a sequencing result for detecting 5-methycytosine in single cell samples derived from K562 cell line culture. The dot plot shows that this methodology works down to the single-cell level. It depicts the genome wide 5mCpG methylation levels for non-dyad detected sites and the corresponding genome wide 5mCpG maintenance percent of 5mCpG dyads for single K562 cells. The K562 cells were either grown under standard conditions (WT) or under standard conditions supplementation with 0.6 uM of Decitabine (DAC) for 24 hours.

Reaction and amplification conditions for 5mCpG Detection [0082] Provided in Fig. 6 is a sequencing result for detecting 5-methycytosine in K562 single cell samples undergoing a variety of experimental conditions. Each dot represents a single K562 cell, where total sequencing depth is the same in all conditions. In regards to the ligation step, limited differences are seen between when the adapter top strand is phosphorylated or when it is not (P vs. U). In regards to cytosine conversion, bisulfite conversion (on column (C) or on beads (B)) and enzymatic conversion (E) worked well, with enzymatic conversion resulting in slightly higher efficiency. In regards to linear amplification, various conditions to perform linear amplification had little to no effect (M vs 0).

[0083] In conclusion, with varying efficiency, all various experimental conditions resulted in robust detection of 5-methylcytosine in K562 single cell samples. From this data, enzymatic-based cytosine conversion provided slightly better efficiency of detection over bisulfite-based cytosine conversion. The differences between all other experimental factors tested were negligible.

Combinatorial quantification of 5mC and 5hmC at individual CpG dyads and the transcriptome in single cells reveals modulators of DNA methylation maintenance fidelity

[0084] Inheritance of the epigenetic mark DNA methylation (5-methylcytosine or 5mC) during cell division is critical to ensure that cellular identity is transmitted from mother to daughter cells. While inheritance of DNA methylation is primarily performed by the maintenance DNA methyltransferase 1 (DNMT1) protein by copying methylated cytosines in a CpG sequence context (5mCpG) from the old to new DNA strand, recent work has suggested that DNMT1 displays imprecise maintenance activity. However, it remains unclear if the fidelity of DNMT1 varies at different genomic regions as well as when cells transition from one state to another. For example, one of the most dramatic illustrations of this differential DNMT1 activity is the genome-wide erasure of 5mC while methylation is maintained at imprinted loci during mammalian preimplantation embryogenesis. Further, genome-wide loss of 5mC during mouse and human preimplantation development is heterogenous, with a subset of cells experiencing global passive demethylation, where 5mC is not faithfully copied to the newly synthesize DNA strands. Therefore, more generally, it remains unknown how cell-to-cell heterogeneity in DNMT1 maintenance methylation activity impacts cellular phenotypes. Investigating these questions remain a challenge due to a lack of experimental techniques. While the methylation status of CpG dinucleotides, a readout of DNMT1 maintenance activity, can be investigated using hairpin-bisulfite sequencing or extensions of this method, where complimentary DNA strands are physically linked, these techniques typically have low efficiency and are challenging to scale down to a single-cell resolution. Further, physically linking the two opposing strands using a hairpin prevents direct investigation of 5mC on one strand and the oxidized derivative 5-hydroxymethylcytosine (5hmC) on the other strand of a single DNA molecule. Therefore, to investigate how the chromatin landscape and cellular state tune DNMT1 maintenance methylation fidelity and DNA methylation dynamics, a new technology (Dyad-seq) was developed that integrates enzymatic detection of modified cytosines with traditional nucleobase conversion techniques to quantify all combinations of 5mC and 5hmC at individual CpG dyads. Finally, Dyad-seq was scaled down and integrated with simultaneous quantification of the transcriptome from the same cell to gain deeper insights into how DNA methylation and DNMT1 -mediated maintenance methylation regulates gene expression.

[0085] To address the above questions, four different versions of Dyad-seq in bulk was developed, two where 5mC is selected on one strand through the digestion of DNA with MspJI followed by the interrogation of 5mC or 5hmC on the CpG site of the opposing strand (M-M-Dyad-seq and M-FI-Dyad-seq, respectively), and two where 5hmC is selected on one strand through the digestion of DNA with AbaSI followed by the interrogation of 5mC or 5hmC on the CpG site of the opposing strand (FI-M-Dyad-seq and FI-FI-Dyad-seq, respectively) (Fig. 7). After digestion of DNA with either MspJI or AbaSI, the bottom strand of the fragmented molecules are captured by ligation to a double-stranded adapter containing the corresponding overhang, a sample barcode, a unique molecule identifier (UMI), and a PCR amplification sequence. Next, to detect unmodified or methylated cytosines on the opposing DNA strand, samples are treated enzymatically with APOBEC3A or with sodium bisulfite to convert unmodified cytosine to uracil while methylated cytosine remain unchanged (M-M-Dyad-seq and FI-M-Dyad- seq). Similarly, slight modification to the conversion reaction can result in the detection of 5hmC on the opposing DNA strand (M-FI-Dyad-seq and FI-FI-Dyad-seq). The bottom strand of the adapter is unaffected by these conversion reactions as it is devoid of cytosines. Next, extension using a random nonamer primer is used to incorporate part of the lllumina read 2 adapter sequence, and the resulting molecules are then PCR amplified and sequenced on an lllumina platform. From the sequencing data, the location of the methylated or hydroxymethylated cytosine on the non-amplified strand, detected by the endonuclease MspJI or AbaSI, can be inferred based on its distance from the adapter, while the methylation/hydroxymethylation status of the opposing CpG site, as well as other cytosines on this strand, can be determined directly from the sequencing results of the conversion reaction (Figs. 2&8). Thus, Dyad-seq not only enables measurement of the percentage of 5mC or 5hmC at a single-base resolution, similar to that obtained from bisulfite sequencing-based approaches, but also enables quantification of the percentage of 5mC or 5hmC maintenance at individual CpG dyads. Finally, M-FI-Dyad-seq and FI-M-Dyad-seq allow for the direct detection of two different epigenetic marks at individual CpG dyads, measurements that are not possible with hairpin bisulfite-based techniques.

[0086] To validate M-M-Dyad-seq, mouse embryonic stem cells (mESC) grown with or without Decitabine were compared. Decitabine is a cytosine analog known to directly inhibit DNMT1 activity. Treatment with Decitabine for 24 hours resulted in a global loss of DNA methylation as well as a dramatic reduction in 5mCpG maintenance, quantified as the fraction of CpG sites that are symmetrically methylated, demonstrating that M-M- Dyad-seq can be used to measure genome-wide DNA methylation levels and the fidelity of DNMT1 -mediated maintenance methylation (Figs. 3 and 4). In addition, CpFIpG maintenance methylation was very low in both conditions, consistent with the known preference of DNMT1 to maintain methylation only at CpG sites in mammalian cells (Fig. 9).

[0087] After validating the technique, Dyad-seq was applied to an in vitro model of epigenetic reprogramming by transitioning mESCs cultured in serum containing media supplemented with leukemia inhibitory factor (LIF) (denoted by ‘SL’)) to a serum-free media (basal media) containing LIF and two inhibitors, GSK3i (CHIR99021) and MEKi (PD0325901) (denoted by ‘2i’)) (Figs. 10A & 10B). These two states of mESCs are interconvertible, with SL mESCs displaying high levels of methylation that become hypomethylated in 2i. After transitioning SL mESCs to 2i for 48 hours, a global loss of 5mCpG was observed that is accompanied with a corresponding reduction in DNMT1- mediated maintenance methylation activity (Figs. 10A&10B). Similarly, when the genome was binned into 100 kb regions, it was observed that in greater than 95% of the bins, reduced 5mCpG in 2i conditions is associated with reduced maintenance methylation, consistent with previous observations that passive demethylation is the main contributor to this demethylation process (Fig. 10D). To further investigate the role of different modes of demethylation during this global erasure of the methylome, SL mESCs were transitioned to different media conditions for 48 hours and performed all four variants of Dyad-seq (Figs. 10A-10C). In the basal media containing neither of the two inhibitors or LIF (denoted by ‘No’), cells spontaneous differentiated with a rapid increase in both the absolute levels of 5mCpG as well as DNMT 1 -mediated maintenance methylation (Figs. 10A-10C). In the case where LIF alone (denoted by ‘BL’) or a combination of LIF and GSK3i (denoted by ‘G’) were added to the basal media, limited changes in both the absolute levels of 5mCpG and maintenance methylation activity were observed (Figs. 10A-10C). In contrast, basal media containing LIF and MEKi (denoted by M) induced an even larger decrease in both the global levels of 5mCpG and DNMT1 -mediated maintenance methylation fidelity than 2i (Figs. 10A-10C). Interestingly, although the maintenance methylation fidelity decreased in both M and 2i conditions compared to SL and No, M-H-Dyad-seq showed that the fraction of 5mC- containing dyads that were paired with 5hmC on the opposing strand were similar for No, SL, 2i and M (Fig. 10A). Similarly, the genome-wide levels of 5hmC were also mostly unchanged between these different conditions (Figs. 10B & 10C). Overall, these results highlight that the global gain or loss of methylation in this system is not dependent on TET-mediated activity but is closely linked to the fidelity of DNMT 1 -mediated methylation maintenance.

[0088] The transition of SL mESCs to the No condition was associated with an increase in both 5mCpG and DNMT1 -mediated maintenance methylation, while the 2i and M conditions were associated with a decrease in both 5mCpG and DNMT1- mediated maintenance methylation, thus it was hypothesized that quantifying the transcriptome in these different conditions could potentially be used to identify new factors that are involved in regulating the DNA maintenance methylation machinery. For example, 2i and M conditions involve the inhibition of the MAPK/ERK pathway, which has previously been shown to reduce protein levels of DNMT1 in multiple systems, including during the transition of mESCs from SL to 2i. Therefore, RNA-seq was performed on all conditions, and as expected, found each condition to be transcriptionally distinct (Fig. 11). As DNMT1 displayed reduced maintenance methylation fidelity in the M and 2i conditions, but an increase in the No condition, it was reasoned that putative genes involved in tuning maintenance methylation could be identified as those that are upregulated or downregulated in M and 2i when compared to No, but are expressed at intermediate levels in SL, G, and BL conditions. Using these combinatorial criteria, 61 differentially expressed genes were identified, 39 of which were highly expressed in the 2i and M conditions with enrichment in pathways associated with pluripotency, negative cell cycle regulation, and blastocyst development, while 22 genes were highly expressed in the No Condition with enrichment in pathways associated with the negative regulation of ERK1 and ERK2 cascade and mesenchymal cell differentiation (Figs. 12A & 12B). Notably, the screen identified Dppa3 (Developmental pluripotency associated 3) as one of the hits that is highly expressed in the M and 2i condition (Figs. 12A & 12B). Previous studies have found that ectopic expression of DPPA3 leads to global hypomethylation, while Dppa3 knockout leads to global hypermethylation. Further, DPPA3 has even been shown to directly bind the PFID domain of UFIRF1 (Ubiquitin like with PHD and ring finger domains 1), a critical partner of DNMT1 necessary for 5mCpG maintenance, and displaces it from chromatin, thus inhibiting methylation maintenance. The identification of a previously well-characterized factor involved in DNA methylation maintenance suggests that several of the other 60 genes identified in our screen, by combining Dyad- seq with RNA-seq, could potentially reveal novel regulators of DNA methylation maintenance fidelity and pluripotency.

[0089] These experiments were also used to further investigate the dynamics and relationship between DNA methylation and DNA hydroxymethylation at individual CpG dyads. While M-M-Dyad-seq showed as expected that a majority of 5mC existed as symmetrically methylated dyads instead of hemimethylated dyads, H-H-Dyad-seq showed that 5hmC was found to infrequently occur as symmetrically hydroxymethylated dyads (Figs. 10A). In contrast to this, H-M-Dyad-seq showed that 5hmC sites had high levels of 5mC on the CpG site of the opposing DNA strand, which showed similar trends to the global levels of 5mC among conditions (Figs. 10B & 10C). This observation is in agreement with single-molecule fluorescence resonance energy transfer experiments, which while lacking locus-specific information, globally identified that approximately 60% of 5hmC sites exist in a 5hmC/5mC dyad state in mESC. These measurements strongly suggest that in mESCs TET proteins hydroxymethylate only one of the two 5mC sites in a symmetrically methylated dyad and do not sequentially convert both 5mC to 5hmC. This result is consistent with previous in vitro work showing that TET proteins have a stronger binding affinity for symmetrically methylated dyads over hemimethylated dyads, and a crystal structure that indicates the non-reactive cytosine is not involved in protein- DNA contacts. In summary, these experiments show that the combination of the four Dyad-seq variants can provide insights into DNA methylation/hydroxymethylation turnover at a single-base and genome-wide resolution that was previously not possible. [0090] As the genome-wide methylation levels were strongly related to the overall maintenance methylation activity for mESCs grown under different conditions, it was next explored how methylation levels were linked to DNMT1 -mediated maintenance methylation fidelity for different genomic regions within the same cell state (Figs. 10A & 10B). For example, as has been shown previously, compared to other genomic loci within the same cell state, intracisternal A particles (lAPs) display high 5mCpG levels in SL as well as in the globally hypomethylated 2i and M conditions (Fig. 13A). Interestingly, these elevated 5mCpG levels at lAPs are associated with higher maintenance methylation activity, suggesting that high levels of methylation locally could be linked to increased DNMT1 -mediated maintenance methylation fidelity (Fig. 13B). More generally, it was found that genomic regions with increasing methylation levels are correlated with higher maintenance methylation fidelity, with a more pronounced affect in regions with higher CpG density, such as CpG islands (Fig. 14). Consistent with recent findings, the results from M-M-Dyad-seq suggests that there exists a tight coupling and a positive feedback loop between local methylation density and DNMT1 -mediated maintenance methylation activity.

[0091] Based on this discovery that 5mCpG maintenance can be tuned by the methylation levels at different genomic loci, it was hypothesized that the genome-wide landscape of histone modifications could also impact DNMT1 -mediated maintenance methylation. It was found that highly methylated genomic regions are associated with high 5mCpG maintenance, independent of the histone modification at those loci (Fig. 15). Surprisingly, however, we discovered that for regions that are lowly methylated, the presence of a particular histone mark can dramatically alter the maintenance methylation fidelity (Fig. 15 and 16A & 16B). For example, regions enriched for the repressive mark FI3K9me2 were found to be associated with higher maintenance methylation fidelity than a randomly selected bin at similar methylation levels (Figs. 15 and 16A). This is consistent with previous observations that UFIRF1 can specifically bind FI3K9me2 with high affinity, providing a mechanistic rationale for the recruitment of DNMT1 and higher maintenance seen in these regions. Interestingly, enhancers marked by FI3K4me1 or FI3K27ac, and active promoters/enhancers marked by FI3K9ac also have increased DNMT1 -mediated maintenance methylation fidelity (Figs. 15 and 16A & 16B). In contrast, FI3K4me3, a mark found at active gene promoters, had lower levels of 5mCpG maintenance than the average genome-wide maintenance (Fig. 15 and 16B). Overall, these results demonstrate that for regions of the genome that are lowly methylated, histone modifications can significantly alter maintenance methylation fidelity from approximately 40% to over 70% of the dyads being symmetrically methylated.

[0092] In addition to the dramatic heterogeneity observed in maintenance methylation across different genomic contexts, we next wanted to understand how cell-to-cell variability in DNMT1 -mediated maintenance methylation fidelity could impact cellular phenotypes within a population. Addressing this question has been challenging as current hairpin bisulfite techniques have not successfully been applied to single cells. To overcome this limitation, we scaled down M-M-Dyad-seq to a single-cell resolution (termed as scDyad-seq). As proof-of-concept, cells treated with Decitabine for 24 hours experienced a global loss of DNA methylation and a dramatic reduction in the fraction of CpG dyads that were symmetrically methylated, while showing no changes in 5mCpFlpG maintenance (Figs. 5 and 17). Further, to directly link heterogeneity in DNA methylation and DNMT1 -mediated maintenance methylation fidelity to gene expression variability, the method was extended to simultaneously also capture the transcriptome (scDyad&T- seq) using an amplification strategy. scDyad&T-seq was applied to serum grown mESCs cells to detect up to 75,835 unique transcripts per cell, and the methylation status of up to 1 , 118,393 CpG sites per cell, together with the additional detection of the maintenance methylation status of up to 203,620 CpG dyads per cells (with an average of 25,066 unique transcripts per cell (5,825 genes/cell), covering the methylation status of 328,967 CpG sites on average per cell and the maintenance methylation status of an additional 51 ,650 CpG dyads on average per cell) (Fig. 18). Further, scDyad&T-seq was compared to scMspJI-seq, a method recently developed for strand-specific quantification of 5mC. While scMspJI-seq does not have the resolution of individual CpG dyads, it can be used to estimate the extent of asymmetry in DNA methylation between two strands of DNA over a large genomic region. This strand-specific DNA methylation was previously quantified using a metric known as strand bias, defined as the number of methylated cytosines on the plus strand divided by the total number of methylated cytosines on both DNA strands, with deviations from a score of 0.5 indicating asymmetric DNA methylation between the two strands of DNA. Therefore, the individual-CpG-dyad (or 5mCpG maintenance) resolution afforded by scDyad&T-seq was directly compared to the strand bias score that can be obtained from both scDyad&T-seq as well as scMspJI-seq. Overall, agreement was found between both measurements, with high 5mCpG maintenance associated with no strand bias (Figs. 19A, 19B & 19C). Notably, however, it was found that the strand bias score in general can only identify cells that experience high levels of strand-specific asymmetry in DNA methylation and, unlike the individual- CpG-dyad resolution of scDyad&T-seq, is not able to capture the wide gradient in strand- specific DNA methylation in single cells (Figs. 19A & 19B). Further, the limited accuracy when using the strand bias score of scMspJI-seq possibly also arises from a lack of allele-specific resolution of this technique. For example, while low levels of 5mCpG maintenance (Mean 5mCpG maintenance = 25.4%) in cell P7L3.69 was associated with strand bias scores that substantially deviated from 0.5 as expected, cell P7L4.78 displaying similarly low levels of 5mCpG maintenance (Mean 5mCpG maintenance = 30.7%) showed very little strand bias (Figs. 19A & 19B). Similarly, scDyad&T-seq was used to test the performance of a recent computational method for estimating maintenance methylation from nucleobase conversion-based 5mC sequencing data (see F. Kreuger and S. R. Andrews, Bioinformatics 27, 1571-1572 (2011)). While the computational metric performed reasonably well for cells that displayed low 5mCpG maintenance, its accuracy was lower for cells with high levels of 5mCpG maintenance (Spearman correlation, p = 0.42) (Fig. 19C). Together these results demonstrate that scDyad&T-seq is a highly sensitive method for assessing the maintenance methylation status at individual CpG dyads in single cells, and when combined with the readout of absolute levels of methylation and the transcriptome from the same cell, this method could potentially allow direct linkage of DNA methylation dynamics and heterogeneity to cellular phenotypes.

[0093] After establishing that scDyad&T-seq can accurately quantify maintenance methylation fidelity, in addition to DNA methylation and the transcriptome in single cells, it was next attempted to estimate the cell-to-cell heterogeneity in the methylome and its relationship to transcriptional variability and cellular phenotypes. Surprisingly, even when considering regions of similar 5mC content in the genome based on previous bulk measurements, substantial heterogeneity was observed in both the methylation levels and 5mCpG maintenance fidelity between individual cells (Fig. 20). Furthermore, we also noted that the relationship of higher 5mCpG maintenance fidelity at regions of the genome with higher methylation density is preserved at the single cell level. To systematically correlate heterogeneity in the epigenome to gene expression variability and the occurrence of distinct cell states, the transcriptome was used to identify two subpopulations in the serum grown mESCs - one high in NANOG, REX1, and ESRRB (referred to as NANOG high or ‘Nan^Hi’) and one low in the expression of these genes (referred to as NANOG low or ‘Nan^Lo’) (Figs. 21 A & 21 B). While these two well- established subpopulations in serum grown mESCs are known to be transcriptionally heterogenous with bimodal expression of key pluripotency genes, how these cell states are linked to the methylome and DNMT1 -mediated maintenance methylation fidelity remains less well studied. Cells in the Nan^Hi state are globally hypomethylated with reduced genome-wide 5mCpG maintenance compared to the Nan^Lo cell state (Mann- Whitney U test, p = 8x1 O⁶ and p = 3.6x1 O⁴, respectively), suggesting that DNMT1- mediated maintenance methylation fidelity could play an important role in tuning the cell state in serum grown mESCs (Fig. 22).

[0094] It was next explored how histone modifications impact 5mCpG maintenance in the context of different cell states. Consistent with bulk findings, FI3K9me2/3, FI3K4me1, FI3K27ac, and FI3K9ac enriched genomic regions showed higher maintenance methylation fidelity while FI3K4me3 marked regions showed reduced maintenance methylation fidelity (Figs. 22 and 23A & 23B). Interestingly though, across all the histone modifications investigated, cells in the Nan^Hi state consistently showed reduced methylation levels and correspondingly lower 5mCpG maintenance compared to Nan^Lo cells, suggesting an intrinsic and mechanistic relationship between DNA methylation density and DNMT1 -mediated maintenance methylation fidelity that is independent of cell state. Finally, these results also demonstrate that the cell state can play a key role in impacting the DNA methylation maintenance machinery and the resulting genome-wide methylation landscape (Fig. 22).

[0095] To further investigate how maintenance methylation fidelity is influenced as cells transition from one state to another, scDyad&T-seq was applied to mESCs that were switched from serum containing media (Serum) to 2i media for 3, 6, or 10 days (2iD3, 2iD6, and 2iD10, respectively). While transition from Serum to 2i resulted in dramatic erasure of DNA methylation and loss of maintenance methylation activity, it was surprisingly also found that the epigenetic reprogramming to the 2i state is highly heterogenous with a fraction of cells still retaining high levels of methylation and 5mCpG maintenance even after 10 days in 2i (Fig. 24A). Using hierarchical clustering, cells were classified as either highly or lowly methylated (mC^Hi and mC^Lo), and highly or lowly maintained (Mnt^Hi and Mnt^Lo), leading to 4 distinct categories of epigenetic states (Figs. 25A and 25B). Superimposing the time-course data on these epigenetic states show that cells generally start off in a highly methylated and highly maintained state, with passive demethylation thereafter resulting in the loss of 5mC till they reach a lowly methylated and lowly maintained state. Surprisingly, a fraction of cells subsequently moves towards a lowly methylated but highly maintained state to establish a globally hypomethylated genomic landscape that is maintained at high fidelity (Figs. 24A and 24B). Consistent with this result, it was found that regions previously identified as retaining high methylation in the 2i state also showed higher methylation, and correspondingly displayed higher maintenance methylation fidelity at all timepoints, including in serum grown cells (Fig. 24B). These results highlight the intrinsic relationship between the local methylation density and maintenance methylation fidelity that is conserved through transitions in cell states. Further, these highly methylated regions were previously shown to correlate with FI3K9me3 peaks, which are found at similar genomic regions for both serum and 2i grown mESCs. This, together with the observation that FI3K9me3-marked regions show high 5mCpG maintenance fidelity, suggests that the relationship between methylation density and maintenance methylation fidelity across cell states in this system could be coupled to H3K9me3, and that the global impairment to the 5mCpG maintenance machinery in 2i grown mESCs is partially restored at H3K9me3-enriched regions.

[0096] The gene expression patterns was analyzed from the same cells to find that most cells in 2i were transcriptionally similar and distinct from serum grown cells, as expected (Fig. 26). Further, the 2i cells did not strongly separate by the time spent in culture, suggesting that the transcriptional reprogramming is quickly activated once cells are transitioned from serum to 2i conditions (Fig. 27). Beyond the two broad transcriptional groups of serum-like and 2i-like cells, further clustering revealed 4 distinct populations, with two clusters transcriptionally similar to serum grown cells - Nanog low like-cells and Nanog high like-cells (referred to as ‘Nan^{Lo LC}’ and ‘Nan^HM-^c’, respectively) - that primarily included serum grown mESCs, and two similar to 2i cells (referred to as ‘2i-T and ‘2i-2’) (Fig. 27). Interestingly, a small group of cells from the 2iD3 condition clustered with the Nan^Lo_LC cell population. Previous work has shown that serum grown cells in the Nan^Lo state are less likely to survive the transition to 2i when compared to those in the Nan^Hi state. This low survival rate of cells in the Nan^Lo state could possibly arise due to an epigenetic barrier, and comparison of the 2iD3 cells that cluster with the serum-like cells shows that this group is characterized by high 5mCpG maintenance and correspondingly high methylation levels relative to the 2iD3 cells that successfully transition and cluster with the 2i-like cells (Fig. 28). Further, consistent with recent observations, the 2iD3 cells that clustered with the serum-like cells expressed low levels of Pou5f1 (Fig. 28). Finally, these cells also express Sox1, suggesting that they begin to differentiate towards the neuroectoderm lineage, but do not survive long term culture in 2i conditions and are not present by day 10 (Figs. 27 and 28).

[0097] In contrast, cells that successfully transition their transcriptional program to 21- like cells could be split into two distinct clusters, 2i-1 and 2i-2, with a small group of genes and transposable elements that are differentially expressed between these two sub populations (Fig. 29 and 30A). Notably, cells in the 2i-2 cluster expressed Dppa3 at higher levels than the 2i-1 cluster (Fig. 29 and 30A). Consistent with the known role of DPPA3 as an inhibitor of maintenance methylation, 2i-2 cells displayed dramatically lower DNMT1 -mediated maintenance methylation fidelity, resulting in genome-wide demethylation compared to 2i-1 cells (Fig. 30B). These results highlight the underlying intrinsic relationship between DNA methylation levels and maintenance methylation fidelity across the genome and how cell states preferentially occupy specific regions within the methylome density vs. maintenance methylation fidelity landscape (Fig. 31). Further, the globally hypomethylated state in the 2i-2 population correlates with higher expression of endogenous retroviral elements RLTR45 and RLTR45-int, as well as higher expression of Khdc3 (also known as Filia), a factor known to be involved in safeguarding genomic integrity in mESCs and preimplantation embryos (Fig. 30A). Taken together, these observations suggest that reduced DNMT1 -mediated maintenance fidelity in the 2i-2 population results in genome-wide DNA demethylation, an associated loss of retroviral silencing, and genome instability. Finally, between the two 2i subpopulations, the 2i-2 cluster is dominated by cells in the mC^Lo and Mnt^Lo state and is weakly biased towards cells that have been cultured longer in 2i conditions (Figs. 32A & 32B).

[0098] In summary, Dyad-seq is a generalized genome-wide approach for profiling all combinations of 5mC and 5hmC at individual CpG dyads. Using M-M-Dyad-seq, it was discovered that DNMT 1 -mediated maintenance methylation fidelity is directly tied to local methylation levels, and for regions of the genome that have low methylation, specific histone marks can significantly modulate the maintenance methylation activity. Further, when combined with RNA-seq, well-characterized factors were identified, such as DPPA3, as well as other putative factors that are potentially involved in regulating the maintenance methylation fidelity of DNMT1. Similarly, using FI-M-Dyad-seq and H-H- Dyad-seq, it was found that 5hmCs are more commonly duplexed with 5mC in mESCs. These results highlight that variants of Dyad-seq, such as FI-M-Dyad-seq and M-FI-Dyad- seq, enable measurements that are currently not possible with hairpin bisulfite techniques, thereby providing deeper insights into the regulation of methylation and demethylation pathways in mammalian systems. To understand how changes in maintenance methylation fidelity during cell state transitions reprogram the methylome and transcriptome, scDyad&T-seq was developed to simultaneously quantify genome- wide methylation levels, maintenance methylation and mRNA from the same cells. Applying scDyad&T-seq to mESCs transitioning from serum to 2i, it was found that the epigenetic reprogramming was highly heterogenous, with the emergence of four distinct cell populations that dramatically differ in 5mCpG maintenance, DNA methylation levels, as well as gene expression. These results showed that in addition to cell identity, DNA methylation levels and histone modifications are closely tied to how faithfully the DNA methylation maintenance machinery copies methylated cytosines from one cell generation to another. Overall, scDyad-seq is an enhancement over both scMspJI-seq and single-cell bisulfite sequencing techniques, enabling high-resolution quantification of both genome-wide 5mC levels and maintenance methylation in thousands of single cells, and when extended to scDyad&T-seq, the method can also be used to simultaneously obtain the transcriptome from the same cells (Figs. 33A and 33B).

Methods Cell culture

[0099] All cells were maintained in incubators at 37°C and 5% CO2. Mouse embryonic stem cell line ES-E14TG2a (E14) were grown on gelatin (Millipore Sigma, ES-006-B) coated tissue culture plates with media containing high glucose DMEM (Gibco, 10569044), 1% non-essential amino acid (Gibco, 11140050), 1% Glutamax (Gibco, 35050061), 1x Penicillin-Streptomycin (Gibco, 15140122), and 15% stem cell qualified serum (Millipore Sigma, ES-009-B). The media was frozen in aliquots and used thereafter for a maximum of 2 weeks after thawing while storing it at 4°C. Once thawed, 1 pL of beta-mercaptoethanol (Gibco, 21985023), and 1 pl_ of LIF (Gibco, A35933) were added for every 1 ml_ of thawed media. Cells were washed with 1x DPBS (Gibco, 14190250) and the media was exchanged daily. Cells were routinely passaged 1 :6 once they reached 75% confluency using 0.25% trypsin-EDTA (Gibco, 25200056). E14 cells grown under these conditions also describe the SL experimental group. For FACS sorting, a single-cell suspension was made using 0.25% trypsin-EDTA. The trypsin was then inactivated using serum containing medium. Afterwards, the cells were washed with 1x DPBS before being passed through a cell strainer and sorted for single cells into 384- well plates. [0100] K562 cells were grown in RPMI (Gibco, 61870036) with 10% serum (Gibco,

10437028) and 1x Penicillin-Streptomycin. When the culture reached a density of approximately 1 million cells per ml_, they were split and resuspended at a density of 200,000 cells per ml_. Cells were washed and FACS sorted as described for E14 cells.

Decitabine treatment

[0101] E14 mouse embryonic stem cells were cultured as described above. Upon passage of the E14 cells, SL media was supplemented with 0.05 mM of Decitabine. After 24 hours, cells were harvested using 0.25% trypsin-EDTA. The trypsin was then inactivated using serum containing medium. The cells were washed with 1x DPBS and then resuspended in 200 pl_ of DPBS. Genomic DNA was extracted using the DNeasy kit (Qiagen, 69504) according to the manufacturer’s recommendations.

[0102] K562 cells were cultured as described above. Upon passage, the media was supplemented with 0.6 mM of Decitabine or DMSO (as a control). After 24 hours the cells were washed and single-cell FACS sorting was performed as described above.

Transition form serum to 2i and other media conditions

[0103] E14 mouse embryonic stem cells were cultured in SL conditions as described above. Upon passage, cells were resuspended in the following media depending on the condition studied. Commercial 2i media containing LIF (Millipore, SF016-200) was used for BL, G, 2i, and M experiments. For 2i, all components were used according to the manufacturer's recommendations. For G and M conditions, only the GSK3B inhibitor or MEK1/2 inhibitor was added, respectively. For the BL condition, no inhibitors were added. For the No condition, commercial 2i media without LIF (Millipore, SF002-100) was used with no inhibitors added. After 24 hours, the cells were washed with 1x DPBS and the media was exchanged. 48 hours after the initial media switch, the cells were collected using 0.25% trypsin-EDTA, quenched using serum containing media, washed in 1x DPBS and finally resuspended in 1x DPBS. The sample was then split in half. One half was resuspended in 200 pL of DPBS for genomic DNA extraction, as described above. The other half was resuspended in 500 pl_ of TRIzol reagent (Invitrogen, 15596018) and total RNA was extracted according to the manufacturer’s recommendations. Experiments for each condition were performed in triplicate.

Chip-seq data processing

[0104] The following published serum grown E14 ChIP datasets were used in this study (GEO accessions): GSM1000123 (H3K9ac), GSE74055 (H3K9me1 and H3K27ac), GSE23943 (H3K4me3, H3K9me3, H3K27me3, and H3K36me3), and GSE77420 (H3K9me2). For all these datasets, the processed data file was downloaded from GEO and further processed if needed. For GSE74055, the bigwigCompare tool on Galaxy (version 2.1.1.20160309.6) was used for 1 kb bins to identify enriched regions compared to the input data, with bins with a log2 enrichment score greater than 2 being considered enriched regions. For GSE23943, peak calling was performed using MACS2 on Galaxy, and the resulting narrow peaks file was used as enriched regions. For GSE77420, the enrichment score for FI3K9me2 in serum grown conditions was compared to the input score within 2 kb bins. Regions were considered enriched if the FI3K9me2 score was greater than the input score for both replicates. For all samples, contiguous enriched regions were combined into a single region. When applicable, enriched regions were converted from mm9 to mm 10 using the UCSC genome browser LiftOver tool.

Dyad-seq Adapters

[0105] The double-stranded Dyad-seq adapters are designed to be devoid of cytosines on the bottom strand. They contain a PCR sequence, a 4-base pair UMI, and a 10-base pair cell-specific barcode. For Dyad-seq variants that use MspJI as a restriction enzyme (M-M-Dyad-seq and M-FI-Dyad-seq), the adapters contain a random 4 base pair 5’ overhang.

Top oligo: 5’- NNNN [10 bp barcode] HHWHCCAAACCCACTACACC -3’ (SEQ ID No.

1)

Bottom oligo: 5’- GGTGTAGTGGGTTTGGDWDD [10 bp barcode] -3’ (SEQ ID No. 2) The sequences of the 10 bp cell-specific barcode for scDyad&T-seq are provided in Table 1. For bulk Dyad-seq, cell-specific barcodes were used as sample-specific barcodes.

For the M-M-Dyad-seq experiments described in the manuscript, a prototype of the above design was used consisting of a 3 bp UMI and an 8 bp sample-specific barcode (Table 2).

Top oligo: 5’- NNNN [8 bp barcode] HHHCCAAACCCACTACACC -3’ (SEQ ID No. 3) Bottom oligo: 5’- GGTGTAGTGGGTTTGGDDD [8 bp barcode] -3’ (SEQ ID No. 4)

The sequences of the 8 bp sample-specific barcode for M-M-Dyad-seq are provided in Table 2.

For Dyad-seq variants that use AbaSI as a restriction enzyme (FI-FI-Dyad-seq and Fl-M- Dyad-seq), the adapters contain a random 2 base pair 3’ overhang as shown below: Top oligo: 5’- [10 bp barcode] HHWHCCAAACCCACTACACC -3’ (SEQ ID No. 5) Bottom oligo: 5’- GGTGTAGTGGGTTTGGDWDD [10 bp barcode] NN -3’ (SEQ ID No. 6)

The sequences of the 10 bp sample-specific barcode used for FI-M-Dyad-seq and FI-FI- Dyad-seq are provided in Table 3.

Table 1.

Table 2

Table 3

Bulk Dyad-seq

[0106] For all bulk Dyad-seq experiments, 100 ng of purified genomic DNA was resuspended in 20 pL of glucosylation mix (1x CutSmart buffer (NEB, B7204S), 2.5x UDP-glucose, 10 U T4-BGT (NEB, M0357L)) and the samples were incubated at 37°C for 16 hours. Afterwards, 10 pL of protease mix (100 pg protease (Qiagen, 19155), 1x CutSmart buffer) was added to each sample, and the samples were heated to 50°C for 5 hours, 75°C for 20 minutes, and 80°C for 5 minutes. After this, processing of the samples differed based on the version of Dyad-seq used.

[0107] For M-M-Dyad-seq and M-FI-Dyad-seq, 10 pL of MspJI digestion mix (2 U MspJI, 1x enzyme activator solution, 1x CutSmart buffer) was added to each sample and the samples were heated to 37°C for 5 hours, and 65°C for 20 minutes. Next, 1 pL of barcoded 1 pM double-stranded adapter was added. Then 9 pL of ligation mix (1.11xT4 ligase reaction buffer, 4.44 mM ATP (NEB, P0756L), 2000 U T4 DNA ligase (NEB, M0202M)) was added to each sample, and the samples were incubated at 16°C for 16 hours. [0108] For H-M-Dyad-seq and H-H-Dyad-seq, 10 mI_ of AbaSI digestion mix (10 U AbaSI (NEB, R0665S), 1x CutSmart buffer) was added to each sample and the samples were incubated at 25°C for 2 hours, and then heated to 65°C for 20 minutes. Next, 1 mI_ of barcoded 1 mM double-stranded adapter was added. Then 9 mI_ of ligation mix (1.11x T4 ligase reaction buffer, 4.44 mM ATP (NEB, P0756L), 2000 U T4 DNA ligase (NEB, M0202M)) was added to each sample, and the samples were incubated at 16°C for 16 hours.

[0109] After ligation, up to three barcoded libraries of the same type were pooled and all Dyad-seq versions were subjected to a 1x AMPure XP bead cleanup (Beckman Coulter, A63881), and eluted in 40 mI_ of water.

[0110] M-M-Dyad-seq and H-M-Dyad-seq samples were then concentrated to a volume of 28 mI_ and subjected to nucleobase conversion using the NEBNext enzymatic methyl-seq conversion module (NEB, E7125S) according to the manufacturer’s recommendations except for performing the final elution step in 40 mI_ of water. For M- H-Dyad-seq and H-H-Dyad-seq samples, nucleobase conversion was performed using the NEBNext enzymatic methyl-seq conversion. Briefly, samples were first concentrated to a volume of 17 mI_. Then 4 mI_ of formamide (Sigma-Aldrich, F9037-100ML) was added and the samples were heated to 85°C for 10 minutes before being quenched on ice. APOBEC nucleobase conversion was performed as described by the manufacturer except for two minor changes. Samples were incubated at 37°C for 16 hours, and the final elution step was performed using 40 mI_ of water.

[0111] To all Dyad-seq versions, the nucleobase converted samples were subjected to one round of linear amplification. To do this, 9 mI_ of amplification mix was added (5.56x NEBuffer 2.1 (NEB, B7202S), 2.22 mM dNTPs (NEB, N0447L), and 2.22 uM Linear amplification 9-mer (5’- GCCTTGGCACCCGAGAATTCCANNNNNNNNN -3' (SEQ ID No. 109))) and the samples were heated to 95°C for 45 seconds before being quenched on ice. Once cold, 100 U of high concentration Klenow DNA polymerase (3’- 5’ Exo-) (fisher scientific, 50-305-912) was added. Then samples were quickly vortexed, centrifuged and then incubated at 4°C for 5 minutes, followed by an increase of 1°C every 15 seconds at a ramp rate of 0.1 °C per second till the samples reach 37°C which was then held for an additional 1.5 hours. Afterwards a 1.1x AMPure XP bead cleanup was performed, and the samplers were eluted in 40 pL of water before being concentrated down to 10 pL. The entire sample was then used in a linear PCR reaction by adding 15 pL of PCR mix (1.67x high-fidelity PCR mix (NEB, M0541L) and 0.67 mM Extended RPI primer (5’-

AAT GAT AC G G C G AC C AC C GAG AT CTAC AC GTT C AG AGTT CT AC AGT C C G AC GAT C GGTGTAGTGGGTTTGG-3’ (SEQ ID No. 110))) and performing PCR as follows: Initial denaturing at 98°C for 30 seconds, followed by 16 cycles of 98°C for 10 seconds, 59°C for 30 seconds, and 72°C for 30 seconds, and a final extension step at 72°C for 1 minute. Next, 5 mI_ of the linear PCR product was amplified further in a standard lllumina library PCR reaction, incorporating a uniquely indexed i7 primer. The remaining linear PCR product was stored at -20°C. To the final sequencing library, two 0.825x AMPure XP bead cleanups were performed with a final elution volume of 15 mI_ in water. The libraries were then quantified on an Agilent Bioanalyzer and Qubit fluorometer. Finally, libraries were subjected to paired-end 150 bp lllumina sequencing on a HiSeq platform.

Bulk RNA-seq

[0112] Total RNA was extracted using TRIzol (Ambion, 15596018). 50 ng of total RNA was heated to 65°C for 5 minutes and returned to ice. Thereafter, it was combined with 9 uL of reverse transcription mix (20 U RNAseOUT (Invitrogen, 10777-019), 1.11x first strand buffer, 11.11 mM DTT, 0.56 mM dNTPs (NEB, N0447S), 100 U Superscript II (Invitrogen, 18064-071), and 25 ng of barcoded reverse transcription primer) and the sample was incubated at 42°C for 75 minutes, 4°C for 5 minutes, and 70°C for 10 minutes. Each replicate received a different barcoded reverse transcription primer. Afterwards, 50 mI_ of second strand synthesis mix (1 2x second strand buffer (Invitrogen, 10812-014), 0.24 mM dNTPs (NEB, N0447S), 4 U E.coli DNA Ligase (Invitrogen, 18052- 019), 15 U E.coli DNA Polymerase I (Invitrogen, 18010-025), 0.8 U RNase H (Invitrogen, 18021-071)) was added to each sample and the samples were incubated at 16°C for 2 hours. The barcoded replicates were then pooled, and a 1x AMPure XP bead (Beckman Coulter, A63881) cleanup was performed, eluting in 30 mI_ of water, which was subsequently concentrated to 6.4 mI_. The molecules were amplified with IVT and an lllumina sequencing library was prepared as described in CEL-seq2⁴². Libraries were sequenced on an lllumina HiSeq platform obtaining 150 bp reads from both ends.

Bulk RNA-seq analysis

[0113] Bulk RNA-seq data reads were mapped to the RefSeq gene model based on the mouse genome release mm 10, along with the set of 92 ERCC spike-in molecules (Ambion, 4456740).

[0114] DESeq2 was used for normalization and differential gene expression calling. Gene expression differences between each condition were evaluated using adaptive shrinkage to adjust the log fold change observed. For differential gene expression calling an adjusted p-value cutoff of 0.01 and a shrunken log fold change cutoff of 0.75 was used. For visualization and clustering, variance stabilizing transformation was performed and batch effects from different reverse transcription primer barcodes were removed using the removeBatchEffect function in the LIMMA package. scDyad&T-seq

[0115] 4 pL of Vapor-Lock (QIAGEN, 981611) was manually dispensed into each well of a 384-well plate using a 12-channel pipette. All downstream dispensing into 384-well plates were performed using the Nanodrop II liquid handling robot (BioNex Solutions). To each well, 100 nL of uniquely barcoded reverse transcription primers (7.5 ng/pL) containing 6 nucleotide UMI was added. The reverse transcription primers used a UMI length of 6. Next, 100 nL of lysis buffer (0.175% IGEPAL CA-630, 1.75 mM dNTPs (NEB, N0447S), 1:1,250,000 ERCC RNA spike-in mix (Ambion, 4456740), and 0.19 U RNase inhibitor (Clontech, 2313A)) was added to each well. Single cells were sorted into individual wells of a 384-well plate using FACS and stored at -80°C. To begin processing, plates were heated to 65°C for 3 minutes and returned to ice. Next, 150 nL of reverse transcription mix (0.7 U RNAseOUT (Invitrogen, 10777-019), 2.33x first strand buffer, 23.33 mM DTT, and 3.5 U Superscript II (Invitrogen, 18064-071)) was added to each well and the plates were incubated at 42°C for 75 minutes, 4°C for 5 minutes, and 70°C for 10 minutes. Thereafter, 1.5 pL of second strand synthesis mix (1.23x second strand buffer (Invitrogen, 10812-014), 0.25 mM dNTPs (NEB, N0447S), 0.14 U E. coli DNA Ligase (Invitrogen, 18052-019), 0.56 U E. coli DNA Polymerase I (Invitrogen, 18010- 025), 0.03 U RNase H (Invitrogen, 18021-071)) was added to each well and the plates were incubated at 16°C for 2 hours. Following this step, 650 nl_ of protease mix (6 pg protease (Qiagen, 19155), 3.85x NEBuffer 4 (NEB, B7004S)) was added to each well, and the plates were heated to 50°C for 15 hours, 75°C for 20 minutes, and 80°C for 5 minutes. Next, 500 nl_ of glucosylation mix (1 U T4-BGT (NEB, M0357L), 6x UDP- glucose, 1x NEBuffer 4) was added to each well and the plates were incubated at 37°C for 16 hours. Thereafter, 500 nl_ of protease mix (2 pg protease, 1x NEBuffer 4) was added to each well, and the plates were incubated at 50°C for 3 hours, 75°C for 20 minutes, and 80°C for 5 minutes. Next, 500 nl_ of MspJI endonuclease mix (1x NEBuffer 4, 8x enzyme activator solution, 0.1 U MspJI (NEB, R0661L)) was added to each well and the plates were incubated at 37°C for 4.5 hours, and then heated to 65°C for 25 minutes. To each well, 280 nl_ of uniquely barcoded 250 nM unphosphorylated double- stranded Dyad-seq adapters were added. Next, 720 nl_ of ligation mix (1.39x T4 ligase reaction buffer, 5.56 mM ATP (NEB, P0756L), 140 U T4 DNA ligase (NEB, M0202M)) was added to each well, and the plates were incubated at 16°C for 16 hours. After ligation, uniquely barcoded reaction wells were pooled using a multichannel pipette, and the oil phase was discarded. The aqueous phase was incubated for 30 minutes with 1x AMPure XP beads (Beckman Coulter, A63881), and then subjected to standard bead cleanup with the DNA eluted in 30 pL of water. After vacuum concentrating the elute to 6.4 pL, in vitro transcription (IVT) was performed as previously described in the scAba- seq and scMspJI-seq protocols. The entire IVT product was used for enrichment, 4 pL of 1 pM biotinylated polyA primer (5’- AAAAAAAAAAAAAAAAAAAAAAAA/3 B ioTE G/ -3’ (SEQ ID No. 111)), and 8 pL of Dynabeads MyOne Streptavidin C1 beads (Invitrogen, 65001) were used and resuspended in 24 pL of 2x B&W solution after establishing RNase-free conditions. In addition, the supernatant was saved for additional processing. [0116] The supernatant from the RNA enrichment process contains unamplified barcoded scDyad-seq DNA molecules. A 1x AMPure XP bead cleanup was performed by incubating the samples with beads for 30 minutes and eluting in 40 pL of water. Samples were then concentrated to 28 pL and nucleobase conversion was performed as described above for bulk M-M-Dyad-seq. Samples were then subjected to four rounds of linear amplification. The first round was the same as described for bulk Dyad-seq. In subsequent rounds, samples were first heated to 95°C for 45 seconds before being quenched on ice. Once cold, 5 pL of amplification mix was added (1x NEBuffer2.1 (NEB, B7202S), 2 mM dNTPs (NEB, N0447L), 2 uM Linear amplification 9-mer, and 10 U of high concentration Klenow DNA polymerase (3’-5’ Exo-) (fisher scientific, 50-305-912)). Samples were then quickly vortexed, centrifuged and the same thermocycler conditions were used as in the first round of linear amplification. After 4 rounds of linear amplification, sequencing libraries were prepared the same way as described for bulk Dyad-seq. Finally, 150 bp paired-end lllumina sequencing was performed on a HiSeq platform.

[0117] scDyad-seq is performed similar to scDyad&T-seq, except the initial reverse transcription and second strand synthesis steps are replaced with the equivalent volume of 1x NEBuffer 4. In addition, as the transcriptome is not captured, IVT is not performed and steps involving RNA enrichment and processing are omitted.

Dyad-seq analysis

[0118] Dyad-seq provides information on methylation or hydroxymethylation levels as well as information on 5mCpG or 5hmCpG maintenance levels. These two outputs of Dyad-seq were analyzed separately. To quantify 5mCpG maintenance levels, read 1 was trimmed to 86 nucleotides, and then exact duplicates were removed using Clumpify from BBTools. Next, reads containing the correct PCR amplification sequence and correct barcode were extracted. These reads were then trimmed using the default settings of TrimGalore. For mapping, Bismark was used in conjunction with Bowtie2 v2.3.5 to map to the mm 10 build of the mouse genome. For experiments using K562 cells, the hg19 build of the human genome was used. After mapping, Bismark was used to further deduplicate samples based on UMI, cell barcode and mapping location. For libraries that were prepared using MspJI, a custom Perl script was used to identify 5mC positions based on the cutting preference of MspJI, and the methylation status of the opposing cytosine in a CpG or CpHpG dyad context was inferred from the nucleobase conversion. For libraries that were prepared using AbaSI, a custom Perl script was used to identify 5hmC positions based on the cutting preference of AbaSI, and the methylation status of the opposing cytosine in a CpG dyad context was inferred from the nucleobase conversion. To quantify absolute methylation or hydroxymethylation levels, the cell barcode and UMI were transferred from read 1 to read 2. Read 1 was trimmed using TrimGalore in paired-end mode. The 5’ end of read 1 was clipped by 20 bases and the 3’ end of read 2 was hard clipped 34 bases after detection of the PCR amplification sequence to remove potential bias arising from enzymatic digestion and to avoid recounting unmethylated, methylated or hydroxymethylated cytosines detected at CpG dyads. The 5’ end of read 2 was clipped by 9 bases to minimize potential bias arising from the linear amplification random 9-mer primer. Similarly, the 3’ end of read 1 was also hard clipped 9 bases after the lllumina adapter was detected. Each read was mapped separately to mm10 using Bismark, and both the resulting sam files were deduplicated further using UMI, cell barcode and mapping location. The bismark_methylation_extractor tool was then used to extract the methylation status of detected cytosines. Next, a custom Perl code was used to demultiplex detected cytosines to the respective single cells based on the associated cell barcode. Thereafter, for cytosines detected in a CpG context, information from read 1 and read 2 were merged. Then, using UMIs, duplicate cytosine coverage resulting from overlapping paired-end reads or generated during the random priming step were deduplicated. Cells for which less than 25,000 CpG sites were covered were discarded from downstream DNA methylation analysis. To cluster cells based on the methylome, hierarchical clustering was used and the optimal number of clusters was assigned using silhouette scores. scDyad&T-seq gene expression analysis

[0119] Read 2 was trimmed using the default settings of TrimGalore. After trimming, STARsolo (STAR aligner version 2.7.8a) was used to map the reads to mm10 using the gene annotation file from Ensembl. The reads were again mapped to mm 10 using the transposable elements annotation file described in TEtranscripts. Transcripts with the same UMI were deduplicated and genes or transposable elements that were not detected in at least one cell were removed from any downstream analysis. The combined counts from genes and transposable elements for each cell was considered the expression profile of that cell and was used in downstream analysis.

[0120] The standard analysis pipeline in Seurat (version 3.1.5) was used for single cell RNA expression normalization and analysis. Cells containing more than 500 genes and more than 2,000 unique transcripts were used for downstream analysis. The default NormalizeData function was used to log normalize the data. The top 1 ,000 most variable genes were used for making principal components and the elbow method was used to determine the optimal number of principle components for clustering. UMAP-based clustering was performed by running the following functions: FindNeighbors, FindClusters, and RunUMAP. To identify DEGs, the FindAIIMarkers or FindMarkers function was used. The Wilcoxon rank sum test was used to classify a gene as differentially expressed, requiring a natural log fold change of at least 0.1 and an adjusted p-value of less than 0.05.

Claims

WHAT IS CLAIMED IS:

1. A method to detect nucleobase modification on both strands of a double stranded nucleic acid molecule, the method comprising: digesting a double stranded nucleic acid molecule with a modification-dependent restriction nuclease to yield one or more double stranded nucleic acid molecule fragments; ligating an adapter nucleic acid molecule to at least one of the one or more double stranded nucleic acid molecule fragments; preparing the one or more ligated single stranded nucleic acid molecule fragments for sequencing, yielding one or more amplified double stranded nucleic acid molecule fragments; and sequencing the one or more amplified double stranded nucleic acid molecule fragments.

2. The method of claim 1, further comprising converting one or more nucleobases of the single stranded nucleic acid molecule fragments, wherein the conversion reaction results in conversion of only select modified or unmodified nucleobases.

3. The method of claim 1, further comprising denaturing the at least one double stranded nucleic acid molecule fragments to yield at least one single stranded nucleic acid molecule fragments.

4. The method of claim 1 , wherein the double stranded nucleic acid molecule is double stranded DNA, double stranded RNA, or double stranded DNA/RNA hybrid.

5. The method of claim 1, wherein the modification-dependent restriction nuclease is Type IIM or Type IV.

6. The method of claim 1, wherein the modification-dependent restriction nuclease is MspJI, FspEI, LpnPI, AspBHI, Rial, SgrTI, Sgel, Sgul, Aoxl, Bisl, Blsl, Glal, Glul, Krol, Mtel, Pcsl, Pkrl, SauUSI, SauNewl, EcoKMcrA, ScoA3McrA, BanUMcrB, BanUMcrB3, EcoKMrr, BanUMrr, SepRPMcrR, ScoA3l, McrBC, mcrA, ScoA3ll+lll, YenY4l, MsiJI, McaZI, BwiMMI, EfaL9l, ScoA3IV, AbaUMB2l, Alai 76121, AspTB23l, Bce1273l, Bce95l, BceLI, BceYI, Bth171l, CbuDI, Dde51507l, Dsp20l, EcoBLMcrX, Elml, Esp638l, KpnW2l, MspAK21l, Nhol, PaePS50l, Pam7902l, Pan13l, Pfl8569l, Pps170l, Pru45411, PspJDRII, PsuGI, RdeR2l, Rfl17l, Sde240l, Sve396l, ScoA3V, or engineered SRA-nicking domain fusion proteins.

7. The method of claim 1, wherein the modification-dependent restriction nuclease is AbaSI, PvuRtsll, PpeHI, AbaAI, AbaBGI, AbaCI, AbaDI, AbaHI, AbaTI, AbaUI, AcaPI, BbiDI, BmeDI, CfrCI, EsaMMI, EsaNI, Mte37l, PatTI, PfrCI, Pxyl, Ykrl, MspJI, FspEI, LpnPI, AspBHI, Rial, SgrTI, SauUSI, McrBC, CmeDI, PspR81l, TspA15l, VcaM4l, YenY4l, MsiJI, VcaCI, MfoEI, MmaNI, RrhNI, Vsi48l, Vvu009l, McaZI, BwiMMI, or EfaL9l.

8. The method of claim 1, wherein the modification-dependent restriction nuclease is AbaSI, PvuRtsll, PpeHI, AbaAI, AbaBGI, AbaCI, AbaDI, AbaHI, AbaTI, AbaUI, AcaPI, BbiDI, BmeDI, CfrCI, EsaMMI, EsaNI, Mte37l, PatTI, PfrCI, Pxyl, Ykrl, GmrSD, CmeDI, PspR81l, TspA15l, orVcaM4l.

9. The method of claim 1, wherein the modification-dependent restriction nuclease is Dpnl, ScoA3Mrr, Mall, Cful, FtnUIV, Hsa13891l, Mph110311, Nani 957311, NgoAVI, NgoDXIV, NmeAII, NmeBL859l, NmuDI, NmuEI, NsuDI, Sbgl, Tdel, orScoA3V.

10. The method of claim 1 , wherein the nucleobase modification to be detected is one or more of: 5-methylcytosine, 5-hydroxymethylcytosine, 5- glucosylhydroxymethylcytosine, 5-formylcytosine, 5-carboxylcytosine, N4-methylcytosine and N6-methyladenine.

11. The method of claim 1 , wherein the double stranded nucleic acid molecule is derived from a single cell.

12. The method of claim 1 , further comprising: extracting the double stranded nucleic acid molecule from a biological source; extracting RNA from the biological source; and sequencing the extracted RNA.

13. The method of claim 12, wherein the biological source is a population of cells.

14. The method of claim 12, wherein the biological source is a single cell.

15. A method to detect unmodified putative sites of nucleobase on at least one strand of a double stranded molecule, the method comprising: digesting a double stranded nucleic acid molecule with a restriction nuclease that is blocked by nucleobase modification to yield one or more double stranded nucleic acid molecule fragments; ligating an adapter nucleic acid molecule to at least one of the one or more double stranded nucleic acid molecule fragments; preparing the one or more ligated single stranded nucleic acid molecule fragments for sequencing, yielding one or more amplified double stranded nucleic acid molecule fragments; and sequencing the one or more amplified double stranded nucleic acid molecule fragments.

16. The method of claim 15, further comprising converting one or more nucleobases of the single stranded nucleic acid molecule fragments, wherein the conversion reaction results in conversion of only select modified or unmodified nucleobases.

17. The method of claim 15, further comprising denaturing the at least one double stranded nucleic acid molecule fragments to yield at least one single stranded nucleic acid molecule fragments.

18. The method of claim 15, wherein the double stranded nucleic acid molecule is double stranded DNA, double stranded RNA, or double stranded DNA/RNA hybrid.

19. The method of claim 15, wherein the restriction nuclease is Aatll, Acil, Acll,

Afel, Agel, Ascl, AsiSI, Aval, BceAI, BmgBI, BsaAI, BsaHI, BsiEI, BsiWI, BsmBI-v2,

BspDI, BsrFI-v2, BssHII, BstBI, BstUI, Clal, Eagl, Esp3l, Faul, Fsel, Fspl, Haell, Hgal, Hhal, HinP11, Hpall, HpyCH4IV, Hpy99l, Kasl, Mlul, Nael, Narl, NgoMIV, Notl, Nrul, Nt.BsmAI, Nt.CviPII, PaeR7l, PluTI, Pmll, Pvul, Rsrll, Sacll, Sail, Sfol, SgrAI, Smal, SnaBI, Srfl, TspMI, or Zral.

20. The method of claim 15, wherein the restriction nuclease is Alwl, Bell, Dpnll, Hphl, Mbol, or Nt.Alwl.

21. The method of claim 15, wherein the restriction nuclease is Mspl.

22. The method of claim 15, wherein the restriction nuclease is Hpal I, Smal, or

Xmal.

23. The method of claim 15, wherein the putative sites of nucleobase are CpG sites in nucleic acids derived from mammalian cells or GATC sites in nucleic acids derived from prokaryotic cells.

24. The method of claim 15, wherein the double stranded nucleic acid molecule is derived from a single cell.

25. The method of claim 15, further comprising: extracting the double stranded nucleic acid molecule from a biological source; extracting RNA from the biological source; and sequencing the extracted RNA.

26. The method of claim 25, wherein the biological source is a population of cells.

27. The method of claim 25, wherein the biological source is a single cell.

28. A kit for identification of nucleobase modification, comprising: one or more modification-dependent restriction nucleases or one or more restriction nucleases that are blocked by nucleobase modification or both; one or more enzymes for nucleobase conversion; and reagents for nucleic acid sequencing.

29. The kit of claim 28 wherein the one or more modification-dependent restriction nucleases is a Type IIM or Type IV nuclease.

30. The kit of claim 28, wherein the one or more enzymes for nucleobase conversion is an enzyme of the AID/APOBEC family.