WO2020009665A1

WO2020009665A1 - Method for single-cell transcriptome and accessible regions sequencing

Info

Publication number: WO2020009665A1
Application number: PCT/SG2019/050335
Authority: WO
Inventors: Jonathan Yuin Han LOH; Qiaorui XING
Original assignee: Agency For Science, Technology And Research
Priority date: 2018-07-06
Filing date: 2019-07-08
Publication date: 2020-01-09

Abstract

The invention relates to a method for generating genome-wide accessible chromatin region and transcriptome information from a single cell comprising the steps of: (a) providing a nucleic acid sample from an individual cell, the nucleic acid sample comprising genomic DNA and messenger RNA (mRNA); (b) producing transposed DNA from the sample, preferably by treating the sample with Tn5 transposase and EDTA for inactivation of Tn5; (c) producing complementary DNA (cDNA) from the sample; (d) separating transposed DNA from cDNA; (e) amplifying transposed DNA, preferably by PCR; (f) amplifying cDNA, preferably by PCR and preparing a cDNA library from amplified cDNA; and (g) optionally sequencing transposed DNA and cDNA.

Description

METHOD FOR SINGLE-CELL TRANSCRIPTOME AND ACCESSIBLE REGIONS SEQUENCING

FIELD

This invention relates to the fields of medicine, cell biology, molecular biology and genetics. This invention relates to the field of medicine. BACKGROUND

The development of single cell technologies to understand cellular heterogeneity has exploded in recent years. There are many techniques which describe genome¹, transcriptome², chromatin accessibility³, DNA methylation⁴, chromatin conformation⁵, copy number variation⁶, lineage⁷ and cell surface protein detection⁸ at single cell level. scRNA-seq is a powerful technique which generates comprehensive information on the whole transcriptome within single cells, while scATAC-seq predicts novel cis- and trans- regulatory elements and transcription factors binding sites, providing new insights into cellular heterogeneity of the regulome³.

There have been methods developed for the combination of genome and

transcriptome⁹ _’ ¹⁰, transcriptome and DNA methylome detection within a single cell¹¹ _’ ¹². A recent study describes a method, named scNMT-seq¹³, which concurrently measures DNA methylation, chromatin accessibility and gene expression from a single cell.

However, as the chromatin accessibility data is extracted from the bisulfite sequencing libraries, it suffers from the extremely low mapping percentage and mutations would be acquired during the bisulfite treatment process. Moreover, due to the inherent nature of the technology, the isolation of genomic DNA and RNA content of a cell is required at the beginning of the protocol, which makes it difficult to adapt onto automated platforms.

Furthermore, scNMT-Seq determines accessibility based on methylation levels rather than actual enrichment of reads. Hence, Integrative transcriptomic and accessibility profile using pipeline like the NMF could not be performed. Additionally, dependence on

methylation levels renders the prediction of regulatory elements using software such as CICERO (cite) not attainable. Another recent study describes a method, sci-CAR¹⁴, a combinational indexing-based assay jointly measuring the chromatin accessibility and gene expression in each of thousands of single cells. Due to its high-throughput, the library suffers from rare sequencing reads.

Buenrostro et al (2013) and Buenrostro et al (2015) described bulk and single cell ATAC-seq methods.

Goetz et al (2012) and Picelli et al (2015) developed single cell RNA-seq by which full-length transcripts of each single cell can be profiled.

SUMMARY

According to a I^st aspect of the present invention, we provide a method. The method may comprise (a) providing a nucleic acid sample from an individual cell. The nucleic acid sample may comprise genomic DNA and messenger RNA (mRNA). The method may comprise (b) obtaining chromatin associated DNA from genomic DNA in the sample. The method may comprise (c) obtaining complementary DNA (cDNA) from mRNA in the sample.

Steps (b) and (c) may be conducted simultaneously. The method may further comprise step (d) in which chromatin associated DNA is separated from complementary DNA (cDNA).

The method may further comprise step (e) in which chromatin associated DNA is amplified, for example by PCR.

The method may further comprise step (f) in which cDNA is amplified, for example by PCR.

Step (b) may comprise generating transposed DNA from chromatin associated DNA in the nucleic acid sample. The transposed DNA may be generated by cutting DNA with transposase. The transposase comprises a prokaryotic transposase. The transposase may comprise a hyperactive transposase. The transposase may comprise Tn5. Step (c) may comprise reverse transcription-polymerase chain reaction (RT-PCR).

Where chromatin associated DNA is separated from cDNA, the cDNA may be labelled with biotin. This may be achieved using biotinylated primers during polymerase chain reaction. The biotinylated cDNA may be separated from transposed DNA by binding to streptavidin, such as by binding to streptavidin beads.

The chromatin associated DNA may be sequenced. The cDNA may be sequenced. Both may be sequenced. There is provided, according to a 2^nd aspect of the present invention, a method of generating nucleic acid libraries from a plurality of individual cells. The method comprises performing a method as set out above on each of a plurality of individual cells.

The method may be performed independently and in parallel on each individual cell.

The method may comprise a step of separating cells in a sample comprising multiple cells to obtain a plurality of individual cells. The separation may be conducted by flow cytometry sorting. The separation may be conducted manual picking to multi-well plates. The separation may be conducted by microfluidic chips to multi-chambers.

The plurality of individual cells may be comprised in a tissue sample.

We provide, according to a 3^rd aspect of the present invention, a method as set out above for generating genome-wide accessible chromatin region and transcriptome information from a single cell.

As a 4^th aspect of the present invention, there is provided a method for generating transcriptome and chromatin accessibility profiles of a cell or a plurality of cells. The method may comprise a method as set out above. We provide, according to a 5^th aspect of the present invention, a method comprising the steps of: (a) generating transcriptome and chromatin accessibility profiles of a first cell using a method as set out above; and (b) comparing the transcriptome and chromatin accessibility profiles so generated with transcriptome and chromatin accessibility profiles of a second cell, such as by using a method as set out above. The method may be used for detecting a disease state of a cell. The transcriptome and chromatin accessibility profiles of a first cell may be compared to those of a second cell known to be diseased. The disease may be cancer. The method may be used for identification of a cellular subgroup of a cell. The transcriptome and chromatin accessibility profiles of a first cell may be compared to those of a second cell known to be of a particular cellular subgroup.

The present invention, in a 6^th aspect, provides a method of diagnosis of a disease in a patient. The method may comprise generating transcriptome and chromatin accessibility profiles of a first cell from a sample of the patient using a method as set out above. The method may comprise comparing the transcriptome and chromatin accessibility profiles so generated with transcriptome and chromatin accessibility profiles of a second cell known to be diseased. The method may comprise diagnosing a disease where the transcriptome and chromatin accessibility profiles of the first cell are similar to those of the second cell.

In a 7^th aspect of the present invention, there is provided a method which comprises the following steps in order: (a) providing a nucleic acid sample from an individual cell comprising genomic DNA and messenger RNA (mRNA); (b) producing transposed DNA from the sample, preferably by treating the sample with Tn5 transposase; (c) producing complementary DNA (cDNA) from the sample; (d) separating transposed DNA from cDNA; (e) amplifying transposed DNA, preferably by PCR; (f) amplifying cDNA, preferably by PCR and preparing a cDNA library from amplified cDNA; and (g) optionally sequencing transposed DNA and cDNA.

According to an 8^th aspect of the present invention, we provide a kit comprising: (a) reagents for providing a nucleic acid sample from an individual cell comprising chromatin associated DNA and messenger RNA (mRNA); (b) reagents for obtaining chromatin associated DNA from genomic DNA in the sample, preferably comprising reagents for generating transposed DNA from chromatin associated DNA in the nucleic acid sample, such as transposase, for example a prokaryotic transposase, such as a hyperactive transposase, such as Tn5; (c) reagents for generating complementary DNA (cDNA) from mRNA in the sample; (d) reagents for separating transposed DNA from cDNA; (e) reagents for amplifying transposed DNA and/or cDNA, preferably by polymerase chain reaction (PCR) and reagents for preparing libraries from amplified nucleic acid; and (f) instructions for use.

Here, we describe an integrative Assay for Single-cell Transcriptome and Accessible Regions sequencing (ASTAR-seq) for the multi-omic analysis of whole transcriptome and epigenome accessibility at single-cell resolution. Multilayers of information collected by ASTAR-seq allows for the identification of regulatory regions and the genes being regulated, which contributed to the heterogeneity of a population observed, as well as the interaction of the regulatory elements to its surrounding genomic regions. We developed ASTAR-seq (Assay for Single-cell Transcriptome and Accessibility

Regions) and further integrated with automated microfluidic chips for the parallel sequencing of transcriptome and chromatin accessibility within the same single cell. We profiled 192 mESCs and 2i cells, 384 human cells including BJ, K562, JK1 and Jurkat. Coupled NMF analysis clustered single cells from various lineages distinctively based on both the transcriptome and chromatin accessibility profile. Analysis of epigenetic regulomes further identified key regulatory transcription factors responsible for the heterogeneity observed.

We disclose a method which enables the capture of both genome-wide accessible chromatin region and transcriptome information from the same single cell, on bench and automated microfluidic chips. The technique can be carried out on multiple platforms, depending on the single-cell isolation methods used. For example, single cells can be isolated by flow cytometry or manually picked using mouth pipette which were then transferred to a single tube/multi-well plate.

In addition, the technique was also adapted for automated and integrated microfluidic platforms, which enables us to profile 96 cells in a microfluidic chip.

In brief, single cell suspension was loaded onto microfluidic chips and individual cells were captured at different wells of the microfluidic chip.

Next, the wells were imaged under microscope, and the wells with live single cell were noted for inclusion for downstream analysis. Following that, cell lysis and transposition of open chromatin region were performed before reverse transcription of the mRNA to the stable cDNA, which was then labeled with biotin and separated from the transposed open chromatin by streptavidin beads. Lastly, ASTAR ATACseq libraries were derived by incorporating unique barcodes into the transposed DNA of each cell and ASTAR RNAseq libraries of the same 96 cells were prepared in parallel.

The practice of this invention will employ, unless otherwise indicated, conventional techniques of chemistry, molecular biology, microbiology, recombinant DNA and

immunology, which are within the capabilities of a person of ordinary skill in the art. Such techniques are explained in the literature. See, for example, J. Sambrook, E. F. Fritsch, and T. Maniatis, 1989, Molecular Cloning: A Laboratory Manual, Second Edition, Books 1-3, Cold Spring Harbor Laboratory Press; Ausubel, F. M. et al. (1995 and periodic supplements;

Current Protocols in Molecular Biology, ch. 9, 13, and 16, John Wiley & Sons, New York, N.Y.); B. Roe, J. Crabtree, and A. Kahn, 1996, DNA Isolation and Sequencing: Essential Techniques, John Wiley & Sons; J. M. Polak and James O’D. McGee, 1990, In Situ

Hybridization: Principles and Practice, Oxford University Press; M. J. Gait (Editor), 1984, Oligonucleotide Synthesis: A Practical Approach, Irl Press; D. M. J. Lilley and J. E. Dahlberg, 1992, Methods of Enzymology: DNA Structure Part A : Synthesis and Physical Analysis of

DNA Methods in Enzymology, Academic Press; Using Antibodies: A Laboratory Manual: Portable Protocol NO. I by Edward Harlow, David Lane, Ed Harlow (1999, Cold Spring Harbor Laboratory Press, ISBN 0-87969-544-7); Antibodies: A Laboratory Manual by Ed Harlow (Editor), David Lane (Editor) (1988, Cold Spring Harbor Laboratory Press, ISBN 0- 87969-314-2), 1855. Handbook of Drug Screening, edited by Ramakrishna Seethala,

Prabhavathi B. Fernandes (2001, New York, NY, Marcel Dekker, ISBN 0-8247-0562-9); and Lab Ref: A Handbook of Recipes, Reagents, and Other Reference Tools for Use at the Bench, Edited Jane Roskams and Linda Rodgers, 2002, Cold Spring Harbor Laboratory, ISBN 0- 87969-630-3. Each of these general texts is herein incorporated by reference. BRIEF DESCRIPTION OF THE FIGURES

Figures 1 A to 1E are drawings showing a workflow of ASTAR and its validation on- bench.

Figure 1 A. Schematic of ASTAR technology. The top panel indicates the reagents required for each step. Bottom panel indicates the steps. Figure 1B. Measurement of total RNA integrity on agarose gel. Left lane: total RNA without Tn5 treatment. Right lane: total RNA treated with Tn5 for 30min and Tn5 was inactivated by EDTA.

Figure 1C. Barchart showing the relative enrichment of GAPDH and ACTB in the samples processed with pipeline in 1A, with or without addition of Tn5. The cDNA amount in the Tn5 treated sample was comparable to the non-treated sample. Error bar indicates SD, n=2.

Figure 1D. Barchart showing the relative expression of GAPDH when start with the indicated number of cells. Error bar indicates SD, n=2. Figure 1E. Barchart showing the relative enrichment of GAPDH (left) and total amount of ATAC-DNA (right) in supernatant and eluent of 1000 BJ cells processed with pipeline in Figure 1 A. Error bar indicates SD, n=2.

Figures 2A to 2D are drawings showing that ASTAR RNAseq library is of good quality. Figure 2A. Quality control of ASTAR RNAseq libraries. Dotplot demonstrates the exon mapping percentage (x-axis) of each ASTAR RNAseq library, along with its

corresponding detected gene rates (y-axis). Axis crosses each other at the cut-off values that filter the libraries. Blue dots represent libraries that pass the QC filters.

Figure 2B. t-SNE clustering of BJ and K562 ASTAR RNAseq libraries shows clear distinction of the two cell lines.

Figure 2C. Reference Component Analysis (RCA) of ASTAR RNAseq libraries shows the correlation of BJ to fibroblast foreskin or smooth muscle cells, and K562 to leukemia or bone marrow cells.

Figure 2D. Principal Component Analysis (PCA) of K562 bulk RNAseq (dark blue, arrow head), K562 ASTAR RNAseq (light blue), BJ single cell RNAseq (light red) and BJ ASTAR RNAseq (dark red) libraries. BJ and K562 ASTAR RNAseq libraries display similar transcriptomic profile to conventional BJ and K562 RNAseq libraries respectively.

Figures 3 A to 3D are drawings showing that analysis of ASTAR ATACseq library reveals cell-type specific chromatin architecture. Figure 3 A. Quality control of ASTAR ATACseq libraries. Dotplot demonstrates the library size (X-axis) of each ASTAR ATACseq library, along with its contribution to the ensembled HARs (Y-axis). Black dots represent the cells that passed the QC filters.

Figure 3B. Histograms of insert size metrics of a BJ ASTAR ATACseq library revealing a nucleosomal pattern which is characteristic of a well prepared ATAC-Seq library. The histogram was generated using the“CollectlnsertSizeMetrics” of Picard.

Figure 3C. Left: Correlation between K562 ASTAR ATACseq and published K562 scATAC-seq (Jason et.al, 2015), R² is 0.894. Right: Correlation between BJ ASTAR

ATACseq and published BJ scATAC-seq (Jason et.al, 2015), R² is 0.804. The axes represent the accessibility level in each sample.

Figure 3D is a drawing showing a correlation heatmap of BJ single cell ATACseq libraries (JL-l and JL-2), BJ ASTAR ATACseq libraries and K562 ASTAR ATACseq libraries based on calculated JASPAR motifs deviations in the HARs peaks. The side color bar (y-axis) indicates the samples that each ATACseq library belongs to. Figures 4A to 4D are drawings showing an Assay for Single-cell Transcriptome and

Accessible chromatin Region (ASTAR-seq).

Figure 4A is a drawing showing a Dotplot revealing the detected gene rate (%) in each mouse ASTAR RNAseq (y-axis) plotted against the rate of mapping to exons (x-axis). Blue dots represent the libraries which passed the QC whereas the grey dots represent the libraries that need to be filtered out. Source data are provided as a Source Data file.

Figure 4B is a drawing showing a line plot representing the coverage ratio (y-axis) of the mouse ASTAR RNAseq libraries over the genebodies of housekeeping genes at the indicated locations (x-axis).

Figure 4C is a drawing showing a dotplot revealing the proportion of fragments of each mouse ASTAR ATACseq library that falls in the HARs (y-axis) plotted against the size of each library (x-axis). The red dotted lines represent the threshold for each criterion. Source data are provided as a Source Data file.

Figure 4D is a drawing showing a histogram demonstrating the frequency (y-axis) of fragments that have the indicated insert size (x-axis). Figures 5A to 5J are drawings showing transcriptomic and epigenetic heterogeneity of mouse 2i cells and E14 mESCs.

Figure 5A is a drawing showing a heatmap demonstrating the correlation levels of the mES and 2i ASTAR ATACseq libraries among themselves.

Figure 5B is a drawing showing an RNA velocity analysis of the mESC and 2i ASTAR RNAseq libraries based on the activity of the transcript inferred by the ratio of spliced and unspliced residuals. The velocity field is projected onto the tSNE plot. Arrows show the local average velocity. The color represents the velocity score and ranges from light blue (mESC-low velocity) to blue (mESC-high velocity), and from light red (2i-low velocity) to red (2i-high velocity).

Figure 5C is a drawing showing a heatmap revealing the correlation levels of the mESCs ASTAR RNAseq libraries with the indicated lineages in the mouse cell atlas. Colour ranges from dark grey (low correlation) to dark red (high correlation). 2-cell like mESCs are boxed with dotted line. Source data are provided as a Source Data file.

Figure 5D is a drawing showing a tSNE clustering of mES and 2i ASTAR ATACseq libraries (left) and ASTAR RNAseq (right) libraries based on the differential accessible chromatin regions and differentially expressed genes identified by NMF (Non-negative Matrix Factorization) clustering. NMF clusters cells based on the gene expression, chromatin accessibility, and the correlation between them.

Figure 5E is a drawing showing a heatmap demonstrating the highly accessible regions (left) and highly expressed genes (right) of each cluster identified by NMF clustering. The expression heatmap (right) demonstrates the corresponding genes whose expression correlates with the accessibility of the regions in the accessibility heatmap (left). The highly accessible and expressed genes of each cluster are indicated on the right. Expression and accessibility levels range from blue (low) to red (high). Source data are provided as a Source Data file.

Figure 5F is a drawing showing line plots indicating the Cicero co-accessibility links between the regions highlighted in red and the distal sites in the surrounding region. The height indicates the Cicero co-accessibility score between the connected peaks. The top sets of links are constructed from cells in cluster 1, while the bottom sets are built from cells in cluster 2. Dnmt3l (left) is highly expressed and accessible and has more interaction with the surrounding regions in cluster 1, whereas Scdl (middle), GdfS and Dppa3 (right) are highly active and interactive in cluster 2.

Figures 5G and 5H are drawings showing interactome analysis indicating the top pathways enriched by the cluster 1 (Figure 5G) and cluster 2 (Figure 5H) specific genes, identified by NMF analysis.

Figures 51 and 5J are drawings showing a heatmap demonstrating the transcription factors motifs enriched on the identified cluster 1 (Figure 51) and cluster 2 (Figure 5J) specific accessible regions. The expression fold change of the TFs in each cluster relative to the other is indicated on the right. Source data are provided as a Source Data file. Figures 6A to 6F are drawings showing application of ASTAR-seq on Human Cell

Lines.

Figure 6A is a drawing showing a correlation heatmap of BJ, JK1, K562 and Jurkat ASTAR ATACseq libraries based on the calculated JASPAR motifs deviations in the HARs peaks. The value of correlation ranges from blue (no correlation) to red (high correlation). The side color bar (y-axis) indicates the samples that each ATACseq library belongs to.

Figure 6B is a drawing showing a plot indicating the significantly variable motif sequences in terms of the accessibility from the BJ, JK1, K562 and Jurkat ASTAR ATACseq libraries. Y-axis represents the variability score assigned to each JASPAR motif whereas the x-axis represents the motif rank. The motifs are classified based on its enrichment among 4 cell lines. The box color is the same as cell line color code. Source data are provided as a Source Data file.

Figure 6C is a drawing showing a PC A clustering of BJ, JK1, K562 and Jurkat ASTAR ATACseq libraries based on the deviation scores of JASPAR motifs.

Figure 6D is a drawing showing a super-imposition of the motif enrichment scores for FOSL1, GATA1, ZBTB33, and TEAD3 on the PCA cluster in Figure 6c. Colour ranges from blue (no enrichment) to red (highly enriched).

Figure 6E is a drawing showing. Left: PCA clustering of ASTAR RNAseq libraries based on their correlation to the RCA panel. Right: Heatmap showing the lineages of each cell correlating to. Reference Component Analysis (RCA) of ASTAR RNAseq libraries shows the correlation of BJ to fibroblast foreskin or smooth muscle cells, K562 to leukaemia or bone marrow cells, JK1 to erythroblast cells, and Jurkat to leukemia lymphoblast cells. Source data are provided as a Source Data file.

Figure 6F is a drawing showing UCSC screenshots indicating the chromatin accessibility levels (top panels) and the expression (bottom panels) of GATA1 (left) and SP1 (right) across the BJ, JK1, K562 and Jurkat cell lines.

Figures 7A to 7C are drawings showing validation of ASTAR on bench and quality control of ASTAR-Seq libraries.

Figure 7A is a drawing showing a heatmap showing the enrichment of ACTB after inactivating the Tn5 activity by different dosages of EDTA and quenching excess EDTA by variable amounts of MgCl_2. The schematic of the experimental design is showed on top. Source data are provided as a Source Data file.

Figure 7B is a drawing showing a boxplot showing the percentage of the fragments of each mouse ASTAR ATACseq library contributing to HAR (Highly Accessible Regions). Source data are provided as a Source Data file.

Figure 7C is a drawing showing an average enrichment profile of reads count per million mapped reads around all transcription start sites (TSS) of the genome with a window of -5K to 5K in a mESC.

Figures 8A to 8G are drawings showing heterogeneity of mESC and 2i cells in terms of chromatin accessibility and transcriptome.

Figure 8A is a drawing showing line plot showing the variability level (y-axis) of all mouse JASPAR motifs on the determined HARs of the mES and 2i cells. Source data are provided as a Source Data file.

Figure 8B is a drawing showing tSNE clustering of mouse ASTAR ATACseq libraries (left) and the superimposition of the deviation scores of Klf4 (middle) and Zfx (right) motifs on the tSNE cluster.

Figure 8C is a drawing showing PCA clustering of the mESCs ASTAR RNAseq libraries based on MCA analysis correlation levels. The dotted ellipse surrounds the 2C-like mESCs. Figure 8D is a drawing showing a schematic of the 2C reporter (top). Bright field (top - left) and fluorescence (top - right) images of mESCs transfected with 2C reporter was shown. The white arrows indicate the cells in which the 2C reporter was activated.

Fluorescence images of siNT control mESCs (bottom - left) and GPa-depleted mESCs (bottom - right) transfected with the 2C reporter.

Figure 8E is a drawing showing UCSC screenshots demonstrating the chromatin accessibility and expression levels of the genes that are differentially accessible and expressed between NMF clusters of the mouse ASTAR-seq libraries. Dnm3l is highly expressed and accessible in mESCs, while Scdl is highly expressed and accessible in 2i cells. Figures 8F and 8G are drawings showing the interactome analysis showing the strong interaction among the cluster 1 (Figure 8F) and cluster 2 (Figure 8G) specific genes, identified by NMF analysis, through the specified pathways.

Figures 9A to 9E are drawings showing quality control of human ASTAR-seq libraries. Figure 9A is a drawing showing a dotplot revealing the detected gene rate (%) in each human ASTAR RNAseq (y-axis) plotted against the rate of mapping to exons (x-axis). Blue dots represent the libraries which passed the QC whereas the grey dots represent the libraries that need to be filtered out. Source data are provided as a Source Data file.

Figure 9B is a drawing showing a line plot representing the coverage ratio (y-axis) of the human ASTAR RNAseq libraries over the genebodies of housekeeping genes at the indicated locations (x-axis).

Figure 9C is a drawing showing a boxplot showing the percentage of the fragments of each human ASTAR ATACseq library contributing to HAR (Highly Accessible Regions). Source data are provided as a Source Data file. Figure 9D is a drawing showing a dotplot revealing the proportion of fragments of each human ASTAR ATACseq library that falls in the HARs (y-axis) plotted against the size of each library (x-axis). The red dotted lines represent the threshold for each criterion. Source data are provided as a Source Data file. Figure 9E is a drawing showing a super-imposition of the motif enrichment scores of NFYA and NFE2 on the PCA cluster in Figure 6C. Colour ranges from blue (no enrichment) to red (highly enriched).

DETAILED DESCRIPTION

When cells go through developmental process or are exposed to external stimuli, heterogeneity is commonly observed. In which case, studies on bulk populations provide the average measurement of the heterogeneous populations, which is biased towards the dominant populations and masks the signals from the rare cells.

Others have used scRNA-seq for studying heterogeneity of gene transcription and scATAC-seq for studying the chromatin accessibility.

However, both techniques could only provide limited information of either

transcriptome or chromatin accessibility alone. Combination of both techniques would enable multiple omics information from a single-cell, hence will be invaluable for the identification of rare cell types with distinct transcriptomic profile, and unique upstream transcription factors or chromatin modifiers driving it.

Due to the rarity of materials from a single cell, techniques which are concurrently able to evaluate the transcriptome and cis-regulatory accessible regions suffer from sparse reading. To address this technological gap, we developed ASTAR platform, by which mRNA- Seq and ATAC-Seq libraries of high sensitivity may be obtained from the same single cell. Our method is described in more detail in the Examples.

The method described is adapted for automated high-throughput microfluidics chips. This has additional advantage in minimizing the common undesired effects associated with manual cell picking or flow cytometry separation.

Moreover, our methods minimize cross contamination among samples to the lowest level and require small volumes of reagents for the entire process.

We disclose a method for the simultaneous profiling of transcriptome and accessible chromatin within the same single cell. Our method may be used as diagnostic tool for the identification of cellular subgroups (e.g. cancer subtypes) from mixed heterogeneous population, based on transcriptomic and chromatin accessibility profile.

We disclose method for identifying the transcription factors or chromatin modifiers contributing to epigenetic heterogeneity within the mixed population (e.g. cancer cells). With the knowledge of the key transcription factors or chromatin modifiers driving each subtype, drugs for targeting these factors can be designed to eliminate each sub-population of cancer cells.

We call our method“Assay for Single-cell Transcriptome and Accessibility Regions (ASTAR-seq)”. Our method has a number of advantages.

Our method enables RNAseq libraries and ATACseq libraries with good quality to be obtained from one individual cell. Cross-contamination between ATACseq and RNAseq libraries is low. Finally, rare cell types can be identified based on transcriptomic profile and chromatin accessibility. SIMULTANEOUS SEQUENCING OF TRANSCRIPTOME AND ACCESSIBLE CHROMATIN REGIONS

Our method provides for simultaneous sequencing of genome-wide transcriptome and accessible chromatin regions from the same single cell.

The method described here generates a nucleic acid library from a single cell. The nucleic acid library comprises two components: a first component which represents transcriptionally active (and hence accessible) chromatin regions of genomic DNA and a second component which represents the transcriptome of the single cell.

The method comprises first providing single cells. Libraries corresponding to (a) transcriptionally active chromatin and (b) expressed mRNA are prepared.

In step (a), DNA sequences corresponding to transcriptionally active chromatin are identified and isolated. This step involves isolating DNA sequences corresponding to accessible DNA regions of the genome of the single cell. At step (b), messenger RNA (mRNA) may be reverse transcribed into cDNA. A cDNA library may be prepared from expressed mRNA. Each of these sets of sequences may be amplified. In particular, the method may comprise the following in order: providing a nucleic acid sample from an individual cell, the nucleic acid sample comprising chromatin associated DNA and messenger RNA (mRNA); generating transposed DNA from chromatin associated DNA in the nucleic acid sample, such as by cutting DNA with transposase; generating cDNA, preferably by RT-PCR; separation of transposed DNA and cDNA; amplifying transposed DNA, such as by PCR; and further amplification of cDNA and preparation of cDNA library.

SINGLE CELL ISOLATION

The method comprises preparing libraries from a single cell. Single cells may be prepared using any suitable method. For example, cells from a tissue sample may be dissociated and sorted into individual cells. Cultured cells may be separated by for example trypsinisation and sorted for example into individual chambers of a multi-well plate or by microfluidic chips to multi-chambers. Sorting may be done manually or by automated means, such as by FACS. Dead cells may be identified by for example staining with Trypan Blue, and discarded prior to the sorting and separating step.

The tissue sample may comprise a sample from any part of a body of an organism. Examples include tissues such as brain, breast, ovary, lung, colon, pancreas, testes, liver, muscle and bone tissues or from neoplastic growths derived from such tissues. For example, a sample may comprise a breast tissue, such as a breast tissue from an individual suspected to be suffering from breast cancer.

Single cells may be lysed for further processing by means known in the art, for example by detergent or by physical disruption such as sonication.

Cells may be derived from any suitable source. For example, the method may be conducted on animal cells, such as mammalian cells, mouse cells or human cells. TRANSCRIPTIONALLY ACTIVE CHROMATIN

Sequences corresponding to transcriptionally active chromatin are identified. Such sequences may be isolated and preferably amplified.

Transcriptionally active chromatin may be known as accessible chromatin, active chromatin sequences, exposed chromatin or euchromatin. Eukaryotic chromatin has a dynamic, complex hierarchical structure and active gene transcription takes place on only a small proportion of it at a time. It is believed that active genes are packaged in an altered nucleosome structure and are associated with domains of chromatin that are less condensed or more open than inactive domains. Active genes are more sensitive to nuclease digestions and probably contain specific nonhistone proteins which may establish and/or maintain the active state. Controlled accessibility to regions of chromatin and specific sequences of DNA may be one of the primary regulatory mechanisms by which higher cells establish potentially active chromatin domains. Other control mechanisms may include compartmentalization of active chromatin to certain regions within the nucleus, perhaps to the nuclear matrix. Topological constraints and DNA supercoiling may influence the active regions of chromatin and be involved in eukaryotic genomic functions. Further, the chromatin structure of various DNA regulatory sequences, such as promoters, terminators and enhancers, appears to partially regulate transcriptional activity. [Adapted from Reeves R. (1984) Transcriptionally active chromatin. Biochim Biophys Acta. 1984 Sep l0;782(4):343-93]

Methods of determining genome-wide chromatin accessibility are known in the art. Such methods include for example MNase-seq (sequencing of micrococcal nuclease sensitive sites), FAIRE-seq, DNAse-seq or ATAC-seq (Buenrostro, Jason D; Giresi, Paul G; Zaba, Lisa C; Chang, Howard Y; Greenleaf, William J (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods. 10 (12): 1213-1218).

Any of these methods may be used to identify, isolate and amplify transcriptionally active chromatin.

For example, transposase may be employed to identify and isolate exposed chromatin. The transposase may comprise for example prokarytic transposase, such as a hyperactive transposase, such as Tn5

ATAC-SEQ

Transcriptionally active chromatin may be identified by ATAC-seq (Assay

for Transposase-Accessible Chromatin using sequencing). ATAC-seq identifies accessible DNA regions by probing open chromatin with hyperactive mutant Tn5 transposase that inserts sequencing adapters into open regions of the genome.

The mutant Tn5 transposase excises any sufficiently long DNA in a process called tagmentation. Tagmentation comprises simultaneous fragmentation and tagging of DNA performed by Tn5 transposase pre-loaded with sequencing adaptors.

An example protocol follows. Cells are lysed and open chromatin transposed by transposase Tn5. EDTA is added to release Tn5 from chromatin and to inactivate transposase activity. Tagged DNA fragments may be further processed, for example, by removing contaminants (dNTPs, salts, primers, primer dimers, etc) and further amplification by for example PCR and sequencing.

The library of tagged DNA fragments may be referred to as ATAC-DNA. cDNA PREPARATION

The method includes a step of preparing a cDNA library from the single cell. This step may be performed at any stage, such as after identification and isolation of transcriptionally active chromatin.

Methods of preparing cDNA libraries from expressed messenger RNA are known in the art, such as RT-PCR. For example, reverse transcription may be initiated from poly-T primers. Double stranded cDNA may be amplified by for example polymerase chain reaction (PCR). The RNA strand may be digested using any suitable RNAse such as RNAse H.

The cDNA preparation step may incorporate modifications which enable or ease separation of cDNA from other components in the sample, for example from sequences corresponding to transcriptionally active chromatin (e.g., ATAC-DNA where ATAC-seq is employed). Such modifications may involve for example the use of tagged primers such as biotinylated primers. Amplified cDNA may then be separated by binding of tags to a suitable binding agent, such as by binding of biotinylated cDNA to streptavidin (such as streptavidin beads).

The separated cDNA may be further processed, such as by being sequenced. SIMULTANEOUS PROCESSING

The steps of identifying transcriptionally active chromatin and cDNA preparation may be carried out in any order.

For example, the individual steps of the identifying transcriptionally active chromatin phrase and the cDNA preparation phase may be mixed and integrated.

The steps may comprise the following, in order:

• single cell isolation;

• cell lysis and identification of transcriptionally active chromatin (e.g., ATAC);

• transposase release and quenching of EDTA (where ATAC is employed);

• reverse transcription and amplification of cDNA (e.g., by PCR);

• separation of cDNA and transcriptionally active chromatin (i.e., transposed DNA where ATAC is employed); and

• further amplification (such as by PCR).

In an example protocol, the method may involve the following steps. Single cells are prepared such as described above and lysed.

Open chromatin of lysed cells are transposed by transposase Tn5. Transposase activity is inactivated and Tn5 released from chromatin, for example using EDTA.

Messenger RNA is bound to poly-T primers and reverse transcription carried out. Magnesium ions may be added to quench any excess EDTA from the Tn5 inactivation step. Double-stranded cDNA is amplified using polymerase chain reaction (PCR). The PCR makes use of biotinylated primers to incorporate biotin into amplified products. RNAse H is added to digest the RNA in the cDNA-RNA hybrid. ATAC -DNA and amplified cDNA are then harvested.

Biotinylated cDNA is then separated from open chromatin regions. This is done by adding streptavidin beads and allowing these to bind to the biotinylated cDNA. Supernatant containing chromatin accessible DNA (ATAC -DNA) is collected. The isolated open chromatin regions are amplified using suitable primers (corresponding to the tags in tagged-transposase) in a polymerase chain reaction. The amplified products are cleaned and removed from contaminating nucleotides, salts, primers, primer dimers, etc. cDNA may be further amplified, tagmented and barcoded. Sequencing of isolated open chromatin regions and cDNA may also be carried out.

SEPARATION OF CHROMATIN ASSOCIATED DNA FROM CDNA

The method may comprise a step in which chromatin associated DNA is separated from complementary DNA (cDNA). The separation step may be carried out after obtaining chromatin associated DNA from genomic DNA in the sample and after obtaining complementary DNA (cDNA) from mRNA in the sample. The separation step may be carried out before any further amplification.

Separation of chromatin associated DNA from cDNA may be carried out by any suitable means, for example by tagging chromatin associated DNA or tagging cDNA or both. As an example, cDNA may be tagged with biotin and separated by exposing biotinylated cDNA to streptavidin, such as streptavidin beads. This may be achieved by use of biotinylated primers for reverse transcription and PCR, for example biotinylated poly-T primers or biotinylated PCR primers. Biotinylated cDNA may then be pulled down by use of streptavidin beads, as known in the art. GENERATION OF N UCLEIC ACID LIBRARIES

We disclose a method of generating nucleic acid libraries from a number of individual cells.

The method comprises isolating the cells from each other, and performing a method as set out above on each of the individual cells. The methods as applied to individual cells may be carried out independently on each cell and in parallel.

The individual cells may be separated into wells of a multi-well plate. The separation may be carried out by flow cytometry sorting or manual picking to multi-well plates or by microfluidic chips to multi -chambers. In one example, the plurality of individual cells may be comprises in a tissue sample. The tissue may comprise any suitable tissue from a patient or subject, such as connective tissue, muscular tissue, nervous tissue, epithelial tissue, meristematic tissues, permanent tissues etc. CELL PROFILING

The methods described in this document may be used to generate genome-wide accessible chromatin region and transcriptome information from a single cell. Such information from one cell may be compared to that of another cell, for various purposes. Information from multiple cells may be gathered and compared. Methods of cell profiling and comparison may comprise generating transcriptome and chromatin accessibility profiles of a first cell using the methods set out in this document. Transcriptome and chromatin accessibility profiles of a second cell may be obtained (again using the methods set out in this document). The transcriptome and chromatin accessibility profiles of the first cell may be compared to the transcriptome and chromatin accessibility profiles of the second cell.

The first cell or the second cell may be a reference cell, for example a cell of a known state. The other of the second cell or the first cell may be one of an unknown state.

The state may comprise a disease state, such as a cancer state. For example, the transcriptome and chromatin accessibility profiles of a first cell of unknown disease state may be compared to those of a second cell known to be diseased, such as cancerous. Where the transcriptome and chromatin accessibility profiles of the first cell are similar to those of the second diseased cell, then an inference may be drawn that the first cell is similarly diseased.

The state may comprise a cell type state, such as a particular cell type such as a particular cellular subgroup. For example, the transcriptome and chromatin accessibility profiles of a first cell of unknown cell type may be compared to those of a second cell known to be of a particular cell type or cellular subgroup. Where the transcriptome and chromatin accessibility profiles of the first cell are similar to those of the second cell, then an inference may be drawn that the first cell is of a similar cell type or cellular subgroup as the second cell. We disclose a method of diagnosis of a disease in a patient, the method comprising: (a) generating transcriptome and chromatin accessibility profiles of a first cell from a sample of the patient using a method as set out above; and (b) comparing the transcriptome and chromatin accessibility profiles so generated with transcriptome and chromatin accessibility profiles of a second cell known to be diseased; and diagnosing a disease where the

transcriptome and chromatin accessibility profiles of the first cell are similar to those of the second cell.

FURTHER ASPECTS

Further aspects and embodiments of the invention are now set out in the following numbered paragraphs; it is to be understood that the invention encompasses these aspects

Paragraph 1. A method of generating nucleic acid libraries from a cell, the method comprising the steps of: (a) providing a nucleic acid sample from an individual cell, the nucleic acid sample comprising chromatin associated DNA and messenger RNA (mRNA); (b) generating a library representing accessible regions of chromatin from DNA in the sample; and (c) generating a cDNA library from mRNA in the sample.

Paragraph 2. A method according to Paragraph 1, in which step (b) comprises a step (bl) of generating transposed DNA from chromatin associated DNA in the nucleic acid sample.

Paragraph 3. A method according to Paragraph 2, in which transposed DNA is generated by cutting DNA with transposase.

Paragraph 4. A method according to Paragraph 3, in which the transposase comprises a prokaryotic transposase, such as a hyperactive transposase, such as Tn5.

Paragraph 5. A method according to any preceding Paragraph, in which step (b) further comprises a step (b2) of amplifying transposed DNA, preferably by polymerase chain reaction (PCR).

Paragraph 6. A method according to any preceding Paragraph, in which step (c) comprises generating complementary DNA (cDNA) from messenger RNA by for example reverse transcription-polymerase chain reaction (RT-PCR). Paragraph 7. A method according to any preceding Paragraph, which comprises the following steps in order: (bl) generating transposed DNA from chromatin associated DNA in the nucleic acid sample, preferably by cutting DNA with transposase; (c) generating cDNA, preferably by RT-PCR; and (b2) amplifying transposed DNA, preferably by PCR. Paragraph 8. A method according to Paragraph 7, in which step (bl) comprises generating transposed DNA by cutting chromatin associated DNA with transposase and in which transposase is inactivated prior to step (c), such as by chelating magnesium ions, preferably by EDTA.

Paragraph 9. A method according to Paragraph 7 or 8, which further comprises a step of separating cDNA from transposed DNA between step (c) and step (b2).

Paragraph 10. A method according to Paragraph 9, in which cDNA is labelled with biotin and separated from transposed DNA by binding to streptavidin.

Paragraph 11. A method according to any preceding Paragraph, in which both the library from (b) and the library from (c) are sequenced. Paragraph 12. A method of generating nucleic acid libraries from a plurality of individual cells, comprising performing a method according to any preceding Paragraph on each of a plurality of individual cells.

Paragraph 13. A method according to Paragraph 12, in which the method is performed independently and in parallel on each individual cell. Paragraph 14. A method according to Paragraph 12 or 13, which further comprises a step of separating cells in a sample comprising multiple cells to obtain a plurality of individual cells, such as by flow cytometry sorting or manual picking to multi-well plates or by microfluidic chips to multi -chambers.

Paragraph 15. A method according to Paragraph 12, 13 or 14, in which the plurality of individual cells is comprised in a tissue sample.

Paragraph 16. A method according to any preceding Paragraph for generating genome- wide accessible chromatin region and transcriptome information from a single cell. Paragraph 17. A method for generating transcriptome and chromatin accessibility profiles of a cell or a plurality of cells, the method comprising a method according to any preceding Paragraph.

Paragraph 18. A method comprising the steps of: (a) generating transcriptome and chromatin accessibility profiles of a first cell using a method according to any preceding Paragraph; and (b) comparing the transcriptome and chromatin accessibility profiles so generated with transcriptome and chromatin accessibility profiles of a second cell.

Paragraph 19. A method according to Paragraph 18 for detecting a disease state of a cell, in which the transcriptome and chromatin accessibility profiles of a first cell are compared to those of a second cell known to be diseased, preferably in which the disease is cancer.

Paragraph 20. A method according to Paragraph 18 for identification of a cellular subgroup of a cell, in which the transcriptome and chromatin accessibility profiles of a first cell are compared to those of a second cell known to be of a particular cellular subgroup. Paragraph 21. A kit for generating nucleic acid libraries from a cell, the kit comprising

(a) reagents for providing a nucleic acid sample from an individual cell comprising chromatin associated DNA and messenger RNA (mRNA); (b) reagents for generating a library representing accessible regions of chromatin from DNA in the sample; preferably comprising: (bl) reagents for generating transposed DNA from chromatin associated DNA in the nucleic acid sample, such as transposase, for example a prokaryotic transposase, such as a hyperactive transposase, such as Tn5 and (b2) reagents for amplifying transposed DNA, preferably by polymerase chain reaction (PCR); (c) reagents for generating a cDNA library from mRNA in the sample; and (d) instructions for use.

EXAMPLES

Example 1. Materials and Methods - ASTAR Step-by-step Protocol

Reagents

Trypsin-EDTA (0.25%), phenol red (Thermo Fisher Scientific; Catalog

No.: 25200056)

ACCUTASE™ (STEMCELL Technologies; Catalog No.: #07920) Trypan Blue Solution, 0.4% (Thermo Fisher Scientific; Catalog No.: 15250061)

LIVE/DEAD™ Viability/Cytotoxicity Kit, for mammalian cells (Thermo Fisher Scientific; Catalog number: L3224)

10% NP40 (Sigma; Catalog No.: 11332473001) RNasin@ Plus RNase Inhibitor (Promega; Catalog No. : N2115) dNTP Mix (25 mM each) (Thermo Fisher Scientific; Cat. No.: Rl 121)

Superscript™ IV Reverse Transcriptase (Thermo Fisher Scientific; Cat. No.:

18090010)

0.5M EDTA, pH 8.0, Biotechnology Grade, lOOml (lst BASE; Cat. No.: BUF-1053- l00ml-pH8.0)

MgCl₂ (1 M) (Ambion; Cat. No.: AM9530G)

NEBNext® Ultra™ II Q5® Master Mix (NEB; Cat. No.:M0544L)

Dynabeads™ MyOne™ Streptavidin Cl (Thermo Fisher Scientific; Cat. No.: 65001)

Agencourt AMPure XP, 60 mL (Beckman Coulter; Cat. No.: A63881) 3.0M Sodium Acetate Solution, pH 5.2, Biotechnology Grade, 1L (lst BASE ; BUF-

H5l-lL-pH5.2

Nuclease-Free Water (Ambion; Cat. No.: AM9937)

Open App™ Reagent Kit (Catalog No.: 100-8920)

Nextera® DNA Sample Preparation Kit (Illumina; Cat. No.: FC- 121-1031) Nextera® XT DNA Sample Preparation Kit (96 Samples) (Illumina; Cat. No. : FC-131-

1096)

Nextera® XT Index Kit v2 Set B (96 Indices, 384 Samples) (Illumina; Cat. No.: FC- 131-2002)

Nextera® XT Index Kit v2 Set C (96 Indices, 384 Samples) (Illumina; Cat. No.: FC- 131-2003) MinElute PCR Purification Kit (250) (QIAGEN; Cat. No.: 28006)

High Sensitivity DNA Kit (Chips & Reagents) (Agilent Technologies; Cat. No.: 5067-

4626)

Reagent Setup

Bio-Cl-P2-T31 : Order the oligo in HPLC grade, and dissolve in Nuclease-Free H20 at a concentration of IOOmM. The sequence is as follows:

/5BiosG/GGCGACAACACCGATTGATCACGTTTTTTTTTTTTTTTTTTTTTTTTT

C1-P2-RNA-TSO: Order the TSO oligo in HPLC grade, and dissolve in Nuclease-Free H20 at a concentration of IOOmM. The sequence is as follows: rGrGrCrGrArCrArArCrArCrCrGrArUrUrGrArUrCrArGrGrG

Bio-Cl-P2-PCR2: Order the oligo in HPLC grade, and dissolve in Nuclease-Free H20 at a concentration of IOOmM. The sequence is as follows:

/5BiosG/GGCGACAACACCGATTGATCA

C1-P2-PCR2: Order the oligo in PCR grade or HPLC grade, and dissolve in Nuclease- Free H20 at a concentration of IOOmM. The sequence is as follows:

/GGCGACAACACCGATTGATCA

2X Binding and Washing buffer (2XB&W buffer): 2M NaCl, lOmM Tris-HCl (pH 7.5), 0.02% Tween-20.

IX Binding and Washing Buffer (lXB&W buffer): 1M NaCl, 5mM Tris-HCl (pH 7.5), 0.01% Tween-20.

IX TE buffer: lOmM Tris-HCl (PH 7.5), 0.02% Tween-20.

Procedure

I. Isolation of single cells

1). Cells were treated with dissociation enzymes, such as Trypsin and Accutase, to single cells. Note: conditions of enzymatic treatments, such as temperature and duration, need to be optimized for the cells of your interest

2). Cell pellet were washed with 1XPBS once.

3). Single cells were then distributed to multi -well s/chambers (96-well PCR plate or 384-well PCR plate; and Microfluidic chips) at a volume lower than 0.27 pL. Distribution of single-cell to Multi -wells (96-well/384-well) can be achieved by flow cytometry sorting or manual picking by mouth pipette. Distribution to Multi -chambers can be performed on microfluidic chip. Details are as follows:

3.1). Cell pellet were resuspended at a concentration of 300-600 cells/pL. 3.2). Optional: Cell viability was measured by staining with trypan blue dye on

TC20 Automated Cell Counter.

3.3). Chips were primed using the‘ASTAR: Prime (1861c/1862c/1863c)’ script.

3.4). Single cell suspensions at a concentration of 300-600 cells/pL were mixed with Cell Suspension Reagent at a ratio of 3 :2. 3.5). Cell mix and reagents were loaded onto the respective wells of the microfluidic chip according to the loading map of‘ASTAR: Cell Load and stain (1861c/1862c/1863c)’ script.

Note: If you don’t intend to stain cells with the Live/Dead Viability dyes on chip, load cell mix and reagents onto the respective wells of the microfluidic chip according to the loading map of‘ASTAR: Cell Load (1861c/1862c/1863c)’ script.

II ASTAR reactions

4). Lysis and tagmentation

Mix 0.54 pL lysis/ ATAC mix (0.15% NP40, L5xTagment DNA Buffer,

L5xNextera® Tagment DNA enzyme 1, 1.5 U RNasin@ Plus RNase Inhibitor, 1.5x0 No Salt Loading Reagent (if using microfluidic chip)) with 0.27 pL single cell solution, and incubate at 37°C for 30 min. In this step, cell lysis and tagementation of the open chromatin were carried out. Note: 2xTD and Tn5 transposase can be found in Nextera® DNA Sample Preparation

Kit.

5). EDTA quench

Mix with 0.54 pL EDTA inactivation mix (8.75mM dNTP mix, 8.5mM Bio-Cl-P2- T31 primer (Table El below), 2.6EG RNasin@ Plus RNase Inhibitor, 4l.25mM DTT, l8.75mM EDTA, lxCl No Salt Loading Reagent (if using microfluidic chip)), and incubate at 50 °C for 30 min for inactivation of Tn5, and 72 °C for 3 min, 4°C for 10 min, and 25°C for 1 min for priming of poly T primers for reverse transcription.

Table El. Primers

6). Reverse transcription Mix with 0.54 pL reverse transcription mix (2.93x SSIV Buffer (Invitrogen), 3.5U RNasin@ Plus RNase Inhibitor, 35U SuperScript™IV Reverse Transcriptase (Invitrogen), 7.995mM Cl-P2-RNA-Tso primer (Table El above), 22.05mM MgCl₂, lxCl No Salt Loading Reagent (if using microfluidic chip)). Reverse transcription was carried out at 50°C for 60 min, followed by the heat inactivation of SSIV reverse transcriptase at 80°C for 10 min. In the reverse transcription mix, MgCb was added to quench excess EDTA from the previous inactivation step.

7). cDNA amplification

Mix with 8.1 pL cDNA-PCR mix (1 233xNEBNext® Ultra™ II Q5 PCR master mix, 1.042mM Bio-Cl-P2-PCR-2 primer (Table El above), lxCl No Salt Loading Reagent (if using microfluidic chip)). cDNA was amplified using the following conditions: 98°C for 3 min; and 5 cycles at 98°C for 20s, 58°C for 4min, and 68°C for 6min, with a final extension step at 72°C for 10 min. In cDNA-PCR mix (Bio-Cl-P2-PCR-2). The use of biotinylated primer for cDNA amplification ensured the cDNA were labelled with biotin, which enabled the segregation of cDNA from the ATAC-DNA by pulldown using streptavidin beads.

Note: If using microfluidic chip, load the above reagent mixes (step 4-7) onto the respective wells of the microfluidic chips with the required volumes according to the loading map of‘ASTAR: ASTAR (1861c/1862c/1863c)’ script.

SAFE STOPPING POINT

8). Separation of cDNA and ATAC-DNA

8.1). Streptavidin Beads Preparation: streptavidin beads (5m1 per cell) were aliquoted into l.5mL micro-centrifuge tubes. Wash the streptavidin beads twice with lmL 2X B&W buffer, each for 5min at RT, and resuspend with 2X volumes of 2X B&W buffer (10m1 per cell). 8.2). Separation:

8.2.1). Mix the beads with 10 pL of step 7 product, and incubate at RT for 20min.

8.2.2). Transfer supernatant to a new plate and proceed to step 19.

8.2.3). Wash the beads twice with 100 mΐ IX B&W buffer, each for 5min at RT. 8.2.4). Wash with 100 pL IX TE buffer and immediately remove as much as possible.

8.2.5). Add 10 pL of Nuclease-Free H₂0 to each well and proceed to step 9. 9). cDNA amplification: 9.1). Mix 15 pL cDNA PCR mix with 10 pL step 8.2.5 product.

9.2). Run the following PCR program:

SAFE STOPPING POINT

10). Purification of cDNA:

10.1). Mix 25 pL step 9 product with 22.5 pL AMPure XP beads, and incubate at RT for lOmin.

10.2). Wash AMPure XP beads twice with 180 pL freshly prepared 75% ethanol, each for 30s. 10.3). Air dry for 10-15 min.

10.4). Elute the AMPure XP beads with 10 mE Nuclease-Free H₂0.

III. mRNA-Seq Library Preparation:

11). Measure the cDNA concentration by Agilent High Sensitivity Bioanalyzer chip. 12). Calculate the average concentration of the cDNA and the dilution factor to reach the average concentration between 150-200 pg/pL.

13). Dilute the cDNA from step 10 by the calculated dilution factor.

14). Mix 3.75 pL tagmentation mix with 1.25 pL of the diluted cDNA, and incubate at 55 °C for lOmin.

15). Once the reaction finished, immediately add 1.25 pL NT Buffer to each well to neutralize the tagmentation.

16). Add 6.25 pL amplification mix to each well, and run the following PCR program.

Note: Nextera XT Index kit A, B, C, D are available. Each Index kit contains 12 N7 indexes and 8 S5 indexes (12x8=96). Load the same N7 index per column, and load the same S5 index per row.

SAFE STOPPING POINT

17). mRNAseq Library Purification 1 (AMPure XP beads purification)

17.1). Pool all the mRNAseq libraries to a l.5mL micro-centrifuge tube. 17.2). Mix with 1.5X volume of AMPure XP and incubate at RT for lOmin.

17.3). Wash AMPure XP beads twice with 180 pL freshly prepared 75% ethanol, each for 30s.

17.4). Air dry for 10-15 min.

17.5). Elute the AMPure XP beads with 50 pL Nuclease-Free FEO. SAFE STOPPING POINT

18). mRNAseq Library Purification 2 (AMPure XP beads purification)

18.1). Mix 50 pL mRNAseq library (step 17) with 75 pL AMPure XP beads, and incubate at RT for lOmin.

18.2). Wash AMPure XP beads twice with 180 pL freshly prepared 75% ethanol, each for 30s.

18.3). Air dry for 10-15 min.

18.4). Elute the AMPure XP beads with 23 pL Nuclease-Free FEO.

IV. ATAC-Seq Library Preparation:

19). Mix 90 pL ATAC PCR Mix with 10 pL of step 8.2.2 product, and run the following PCR program.

SAFE STOPPING POINT

20). ASTAR-ATACseq Library Purification 1 (Ethanol Precipitation) 20.1). Collect the ASTAR ATACseq libraries (step 19) to a 50 mL Falcon Tube and mix with the ethanol precipitation mix.

20.2). Keep the Falcon tube at -80 °C for overnight.

20.3). Centrifuge at 14000 RPM for 20min. at 4°C.

20.4). Wash the DNA pellet twice with lmL 75% ethanol , and spin at 14000 RPM for 5min, at 4 °C.

20.5). Air dry for 20min. 20.6). Elute the DNA pellet with 200 pL Nuclease-Free FFO by pipetting until the solution is clear.

Note: To increase the dissolution of DNA pellet, ATACseq library can be further incubated at 37 °C for lOmin. SAFE STOPPING POINT

21). ASTAR-ATACseq Library Purification 2 (MinElute column purification)

21.1). Purify the ATACseq library (step 20.6) using the Qiagen MinElute PCR Purification Kit.

21.2). Elute ATACseq library in 50 pL elution buffer. SAFE STOPPING POINT

22). ASTAR-ATACseq Library Purification 3 (AMPure XP beads purification)

22.1). Mix 50 pL ATACseq library (step 21.2) with 75 pL AMPure XP beads, and incubate at RT for lOmin.

22.2). Wash AMPure XP beads twice with 180 pL freshly prepared 75% ethanol, each for 30s.

22.3). Air dry for 10-15 min.

22.4). Elute the AMPure XP beads with 50 pL Nuclease-Free FLO.

SAFE STOPPING POINT

23). Repeat step 22, and elute with 23 pL Nuclease-Free FLO. Example 2. Derivation of ATAC-seq Library & mRNA-seq Library from Individual Cells

We developed the ASTAR-seq (Assay for Single-cell Transcriptome and Accessibility Regions) platform, in which a single cell’s tagmented open chromatin (ATAC-DNA) was separated from its biotinylated cDNA by streptavidin beads, and both the ATAC-DNA and cDNA were further amplified and sequenced in parallel (Figure 1 A). The optimized ASTAR pipeline was tested with 1000 cells on benchtop. Tn5 treatment before reverse transcription of mRNA didn’t affect the quality the RNA and the total amount of cDNA obtained (Figure 1B and Figure 1C). We have further tested the pipeline in lesser number of starting cells. The open chromatin and cDNA was detectable, even when starting with as low as 2 cells (Figure 1D).

Moreover, the ATAC-DNA and cDNA were well separated by streptavidin beads (Figure 1E).

As a proof of concept, we applied ASTAR to BJ foreskin fibroblast and K562 chronic myeloid leukemia cells, both of which were previously characterized by ATAC-seq and mRNA-seq on microfluidic chip. Of the 192 libraries analyzed, most of the ASTAR RNA-seq libraries passed the quality control (Figure 2A). t-SNE analysis based on ASTAR RNA-seq libraries showed distinct clusters of BJ and K562 cells (Figure 2B).

Furthermore, RCA analysis of ASTAR RNA-seq libraries demonstrated distinct correlation for BJ and K562 cells with cells of fibroblast foreskin and muscle lineages or leukemia and bone marrow cells, respectively (Figure 2C).

Furthermore, our BJ and K562 ASTAR RNA-seq libraries cluster together with their respective published RNA-seq libraries, further confirms its reliability (Figure 2D).

Similarly, most of the ASTAR ATAC-seq libraries passed the quality control and exhibited characteristic nucleosomal patterns (Figure 3 A and Figure 3B), which highly correlated with the published single cell ATAC-seq libraries (Figure 3C).

PCA clustering of ASTAR ATAC-seq libraries showed BJ and K562 to have distinct chromatin architecture (Figure 3D), which can be attributed to the differential expression and binding of their respective regulatory transcription factors. Example 3. Materials and Methods - Cell Line Culture mES-El4TG2a mouse embryonic stem cells were maintained in DME+4500 mg/l medium (HyClone) supplemented with 15% Fetal Bovine Serum (Gibco), MEM Non- Essential Amino Acids Solution (100X, Gibco), 200mM L-Glutamine (100X, Gibco), O. lmM b-mercaptoethanol, 100 U/ml Penicillin-Streptomycin (100X, Gibco), 10^L7 unit/ml ESGRO® Recombinant Mouse LIF Protein (10000X, Merck).

2i cells were derived by culturing mES-El4TG2a cells with N2B27 based 2i medium and harvested for ASTAR-seq at passage 3. The recipe of N2B27 based 2i medium is as follows: DMEM/F12 medium (Gibco), Neurobasal medium (Gibco), N2 Supplement (200X, Gibco), B27 Supplement (100X, Gibco), 200mM L-Glutamine (200X, Gibco), O.lmM b- mercaptoethanol (Gibco), 7.5% BSA (1500X, HyClone), 3uM CHIR99021 (STEMCELL Technologies), luM PD032590l (STEMCELL Technologies), 10^L7 unit/ml ESGRO® Recombinant Mouse LIF Protein (10000X, Merck). Cells were routinely propagated onto 0.1% gelatin coated plates. Cells were passaged with accutase or trypsin every three to four days.

Human neonatal fibroblast cell line, BJ (Stemgent, Cambridge, MA), were maintained in BJ medium, which is composed of DME+4500 mg/l medium (HyClone) supplemented with 10% Fetal Bovine Serum (Gibco), MEM Non-Essential Amino Acids Solution (100X, Gibco), 200mM L-Glutamine (100X, Gibco), 10000 U/ml Penicillin-Streptomycin (100X, Gibco).

K562 (ATCC® CCL-243™) were maintained in K562 medium, which is composed of RPMI1640 medium with L- Glutamine (HyClone) supplemented with 10% Fetal Bovine Serum (Gibco), 200mM L-Glutamine (100X, Gibco), 10000 U/ml Penicillin-Streptomycin (100X, Gibco). Jurkat, Clone E6-1 (ATCC® TIB-152™) were maintained in RPM1640 medium (Hyclone) supplemented with 10% Fetal Bovine Serum (Gibco). JK-l (ACC 347) were maintained in RPMI 1640 medium (Hyclone) supplemented with 20% Fetal Bovine Serum (Gibco).

Example 4. Materials and Methods - ASTAR-seq Method -ASTAR-seq Library

Sequencing 200 ASTAR-mRNAseq libraries were sequenced on a lane of HiSeq 4000 sequencer by 101 bp pair-end sequencing. 200 ASTAR-ATACseq libraries were sequenced on a lane of HiSeq 4000 sequencer by 50 bp pair-end sequencing. Example 5. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis:

Mapping of scRNA-seq Libraries

The scRNA-seq libraries were mapped to mm9 (for mouse libraries) and hg 19 (for human libraries) using star aligner²⁸. We allowed up to 2 mismatches and removed reads that map to more than one locus. The option“-outSAMstrandFiled intronMotif’ was used to make the bam outputs of star compatible with subsequent analyses.

Example 6. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis:

Filtering of scRNA-seq Libraries

The bam outputs were uploaded to SeqMonk

(https://www.bioinformatics.babraham.ac.uk/projects/seqmonk/). The RNA-seq QC plot was generated. For cell lines, libraries with gene detection rate below 15% and/or exon mapping percentage below 70 were filtered out.

Example 7. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis: Gene Coverage We merged the bam files belonging to each category of cells using samtools²⁹ merge.

Then we used the geneBody coverage. py module of RSeQC³⁰ to determine the distribution of the mapped fragments over the genebodies of house-keeping genes.

Example 8. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis:

Generation of hgl9 and mm9 GTF Files The GTF files were generated using the“genePredToGtf’ tool designed by UCSC.

Example 9. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis:

Quantification and Normalization of the scRNA-seq Libraries

The bam files and the generated GTF files were used as inputs for Cuffquant³¹.

Options -u and -m were included in the cuffquant script. The abundances files were then used as inputs for Cuffnorm³¹. The classic-fpkm normalization style was used. Example 10. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis:

Correlation with the Mouse Cell Atlas

The FPKM table output of cuffnorm (mouse libraries) was uploaded to Mouse Cell Atlas¹⁸ (http : //bi s zju . edu. cn/MC A/h 1 ast. htm 1) . The output obtained after the MCA anlaysis was used to create PC A graphs using FactoMineR³² package in R and categorize the populations of the 2i and mESC ASTAR RNA-seq libraries. The treshold used to consider a cell to belong to a particular lineage was 0.414.

Example 11. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis: RCA Analysis

The FPKM table output of cuffnorm (human libraries) was used as an input for RCA³³. The human ASTAR RNA-seq libraries were clustered using the Global panel mode of RCA with default parameters.

Example 12. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis:

UCSC Genome Browser Screenshots

The libraries belonging to each cell type were merged using samtools merge. This was followed by the creation of tag directories using the“makeTagDirectory” script of HOMER³⁵. Finally, the“makeUCSCfile” script was used and the options -style and -fragLength were set to rnaseq and given respectively.

Example 13. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis:

Meta-Analysis (RNA-seq)

For meta-analysis, FPKM values for Bulk K562 RNA-seq, K562 ASTAR RNAseq, BJ ASTAR RNAseq and BJ scRNA-seq were quantified using SeqMonk (RNA-seq

Quantification pipeline). The option correct for gene length was chosen. Genes were not converted to log. The values obtained were then used to calculate spearman correlation among the libraries (cor function in R). The correlation values were then subjected to PC A using “FactoMineR” ³² (http://factominer.firee.fr/) package in R. Example 14. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis:

RN A- Velocity

Mapped bam files were used as input for Velocyto³⁶ which was run using the run_smartseq2 mode to generate loom files. The R module of Velocyto was used to determine the spliced/unspliced ratio of genes. The spliced values for genes were used to perform the tSNE clustering by Pagoda2³⁷. Then velocity was estimated and embedded on the clusters identified by Pagoda2 using“gene. relative. velocity. estimates” and

“ show vel ocity . on . emb edding . cor” .

Example 15. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: Mapping of the scATAC-seq Libraries:

Libraries were mapped to mm9 and hgl9 genome using STAR aligner²⁸ similar to the mapping of the scRNA-seq libraries with the addition of the options— alignlntronMax 1 and— alignEndsType EndToEnd.

Example 16. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: Determination of Highly Accessible Regions (HARs)

Human and mouse libraries were merged independently using samtools merge²⁹. Duplicates were removed using the MarkDuplicates module of PICARD. Peak calling was performed using MACS2³⁸ and the—nomodel—nolambda— keep-dup all -call-summits options were utilised. The narrowPeaks output of MACS2 was considered as the HARs.

Example 17. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: chromVAR Analysis

The duplicates were removed from the single-cell libraries using the MarkDuplicates module of PICARD. Reads mapping to chromosome M and Y were removed. These libraries were then uploaded to chromVAR³⁹ along with the narrowPeaks file as an input for the getCounts function of chromVAR. QC was performed using the filterSamples function. Motif variability over these HARs were measured using the computeDeviations function followed by the computeVariability option. The scATAC-seq libraries were correlated using the getSampleCorrelation module. t-SNE clustering of the scATAC-seq libraries was carried out using the deviationsTSNE option with a perplexity setting of 30. For meta-analysis, the published scATAC-seq libraries were processed similarly and included in the analysis. Example 18. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: Further Processing of scATAC-seq Libraries

Group information was added to each de-duplicated library using the

AddOrReplaceReadGroups module of PICARD. This was followed lexicographical sorting of each library using ReorderSam module of PICARD. The option

ALLO W IN C OMPLETE DICT CON C ORD AN CE was set to TRUE. The sequence dictionaries for hgl9 and mm9 that were used for sorting lack ChrM and other ambiguous chromosomes. Each library was further indexed using samtools index.

Example 19. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: Nucleosomal Pattern Determination

The histograms were generated using CollectlnsertSizeMetrics module of PICARD.

Example 20. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: Depth of Coverage Detection

Coverage of each processed library over the HARs was quantified using

DepthOfCoverage module GATK⁴⁰ tools v3.46. The option COUNT FRAGMENTS was used. The values indicated in the total cvg columns of the interval summary outputs of GATK were used for downstream analysis.

For correlation with published scATAC-seq libraries, the single cells belonging to each study were merged using samtools merge. Then the merged bam files were subjected to further processing and filtering as mentioned above. The coverage values were then used to generate the correlation dotplots. The correlation values were calculated using EXCEL.

Example 21. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: Motif Analysis

The fmdMotifsGenome.pl script of HOMER³⁵ was executed in order to identify the known motifs enriched in the differentially accessible regions. Example 22. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: UCSC Genome Browser

The single-cell libraries belonging to each individual cell type were merged using samtools merge. Tag directories were created using makeTagDirectory module of HOMER. This was followed by the implementation of makeUCSCfile script³⁵.

Example 23. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: Average Enrichment Profile

The average enrichment of the ASTAR ATACseq libraries over tss was determined using ngsplot⁴³. Example 24. Materials and Methods - Bioinformatic Analysis - ATAC-seq and RNA-seq Analysis: Inegrative Analysis

CoupleNMF²⁰ was used to cluster the cells based on the integration of both scATAC- Seq part and scRNA-Seq part. The K setting was determined by running the script starting with k=5 and then determining the NMF score. We kept on reducing the K value and re- running the script until the NMF score obtained was > 1 (For human K=3 and Mouse K=2). The motifs enriched by the peaks specific to each NMF clusters were determined by fmdMotifsGenome.pl. For the human libraries, the genes identified to be significantly expressed in each cluster were subjected to cTen²⁶ (http://www.influenza- x. org/~i shoemaker/cten/) for identifying the lineages they enrich. The mouse libraries were clustered using Seurat⁴⁴ to ensure correlation of the ASTAR RNA-Seq with the CoupleNMF clusters. For human libraries, the NMF clusters were superimposed on the pseudotime trajectory. For the mouse ASTAR ATAC-Seq, the cluster-specific regions were used to calculate deviations instead of JASPAR motifs using the“getAnnotations” Function of ChromVAR. The heatmaps of ASTAR ATAC NMF clusters were created by obtaining the reads counts over the NMF cluster-specific peaks using featureCounts⁴⁵ with -F SAF option. The values obtained were used to generate the heatmaps using heatmap.2 function of gplots in R. Example 25. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: Cis- Regulatory Interactions Prediction

ASTAR ATACseq libraries belonging to each NMF cluster were merged using samtools merge to determine the HARs of each cluster as described above. Then coverage of each library belonging to a particular cluster over the HARs of its corresponding cluster was measured using DepthOfCoverage as described above. The raw coverage table was used as an input for CICERO²¹. Cicero Cds were created using make cicero cds and then“run cicero” was performed (with default settings). The plot connections function was used to visualize the predicted regulatory elements of NMF-identified genes. The coordinates used were 10X zoomed-out from the precise co-ordinates of the gene of interest.

Example 26. Materials and Methods - Bioinformatic Analysis - RNA-seq Analysis:

Network Analysis

Genes specific for each mouse NMF clusters were uploaded to STRING46. The text- mining option was disabled. The network formed was downloaded in tabular format and then the tables were uploaded to cytoscape⁴⁷ for visual formatting. Moreover, each network was subjected to MCODE clustering⁴⁸ to identify the most significant complexes within each overall network. The names of the members of each significant complex were uploaded to metascape⁴⁹ to identify the biological process in which this complex is involved.

Example 27. Materials and Methods - Bioinformatic Analysis - ATAC-seq Analysis: Code Availability

Codes for the bioinformatic analysis are available.

Example 28. Results

In ASTAR-seq, single cells were isolated and distibuted to separate reaction tubes, wells, or chambers of microfluidic chip. The open chromatin of a single cell was then transposed with Tn5 transposase to generate accessible regions fragments (ATAC-DNA).

Next, the reverse transcription of mRNA was carried out, and the cDNA was labeled with biotin during the PCR amplification process, which enabled the separation of the cDNA from the ATAC-DNA by pulling it down with streptavidin beads. The separated ATAC-DNA and cDNA fractions were further processed for library preparation and sequenced in parallel (Figure 1 A).

The current ASTAR-seq protocol was first optimized and tested with 1000 cells on the benchtop, where clear separation of the ATAC-DNA and cDNA was observed (Figure 7A, Figure 1C and Figure 1E). We have further tested the pipeline in lesser number of starting cells. The open chromatin and cDNA was detectable, even when starting with as low as 2 cells (Figure 1D).

Example 29. Results - Quality Control for ASTAR-seq Libraries of Mouse Cells

As a proof of concept, we applied ASTAR-seq to 96 E14 mESCs and 96 2i cells, induced from the E14 mESCs. Of the 192 libraries analyzed, 155 (80.7%) ASTAR RNAseq libraries passed the quality control based on their detected gene rate (>=15%) and exon mapping rate (>=75%) (Figure 4A).

The sequencing reads of ASTAR RNAseq libraries spread across the entire gene body, without biasing towards any end of the mRNA (Figure 4B). The quality of ASTAR RNAseq library was assessed by comparing to the existing scRNA-seq datasets. A median of 771822 de-duplicated sequencing reads, were detected in each ASTAR RNAseq library, which is comparable to the multimodal scRNA-seq libraries. Among them, a median of 90% of reads was mapped to exon. Besides, 86-88% of HARs (highly accessible regions) of ASTAR- ATACseq libraries showed more than 3-fold enrichment in ATACseq over RNAseq libraries. These evidence suggests the ASTAR-RNAseq library had minimal contamination from the ATACseq libraries.

Meanwhile, all the sequenced mouse ASTAR ATACseq libraries passed the QC thresholds for chromVAR. Of note, a median of 87781 peaks were called, where 25% of fragments were in the peaks of each ATACseq library (Figure 4C, Figure 7B). The insert-size distribution of ASTAR ATACseq libraries also demonstrated clear periodicity of 200bp, a characteristic nucleosomal pattern of an ATAC-seq library³ (Figure 4D). The reads were highly enriched at Transcription Start Sites (TSS), an indication of a high quality library preparation³ (Figure 7C). Example 30. Results - Heterogeneity Observed from Transcriptomic and Accessible Chromatin Profile

We next clustered the mESCs and 2i ASTAR ATACseq libraries based on the enrichment of Mouse JASPAR motifs on the HARs. Although the mESC and 2i cells were mostly clustered separately, a certain degree of overlapping was observed (Figure 5A).

The motifs of Klf4, Rarg, Zfx, Klfl2, and Mlxip showed significant variability in terms of chromatin accessibility (P -value <0.05) (Figure 8A).

Among the variable motifs, Klf4 had higher deviation scores in 2i cells, whereas Zfx showed the opposing trend (Figure 8B). RNA velocity analysis using the ASTAR RNAseq demonstrated that 2i cells are in a more stable state as compared to mESCs, which are known to be in a metastable state¹⁷ (Figure 5B).

Hence, to study the heterogeneity within the mESCs, we correlated its ASTAR RNAseq libraries with the Mouse Cell Atlas (MCA) panel¹⁸. The analysis revealed 3 types of cells, including the bulk mESCs, and small populations of the ICM-like mESCs and the 2C- like (2-cell) mESCs, in agreement with previous reports¹⁹ (Figure 5C).

PCA clustering of mESCs also revealed the minority 2C-like cells clustered away from the other mESCs (Figure 8C). To confirm the presence of the 2C-like cells in our mESCs culture, we utilized the previously constructed 2C::tdTomato reporter¹⁹. Indeed, around 1-2% of cells consistently exhibited the activation of the 2C reporter in culture. Moreover, the depletion of G9a , a H3K9me3 methyltransferase, resulted in increased activation of the 2C reporter as previously described¹⁹ (Figure 8D).

Example 31. Results - Coupled Analysis Identifies Sub-population Specific Regulatory DNA Regions and Their Target Genes To correlate the two-dimensional information within the same cell, we applied coupled

Non-negative Matrix Factorization (NMF)²⁰ to cluster mESCs and 2i cells based on both ASTAR RNAseq and ATACseq libraries. ETsing NMF analysis, the differentially accessible regions and the differentially expressed genes of each NMF cluster were identified. We observed two distinct clusters of cells. Cluster 1 mostly comprised of mESCs and cluster 2 comprised of 2i cells (Figure 5D). Further, the correlation between accessibility and gene expression helped to identify the set of genes specific to each cluster (Figure 5E).

For example, Dnmt3l was highly accessible and highly expressed for cells in cluster 1. ETsing the Cicero²¹ co-accessibility analysis, Dnmt3l gene and its surrounding loci in cluster 1 display open chromatin architecture, highly suggestive of the existence of cluster-specific putative regulatory elements governing its expression.

On the other hand, Scdl, Gdf3 , and Dppa3 were highly expressed, with greater chromatin accessibility in cells of cluster 2 than in cluster 1. Consistently, more 3D genomic interactions were observed (Figure 5E, Figure 5F and Figure 8E).

In addition, the genes of each NMF cluster interacted extensively with each other (Figure 8F, Figure 8G). Cluster 2 genes were involved in pathways related to ribosome, mRNA splicing, and glutathione metabolism, whereas cluster 1 genes were associated with nodal inhibition and spliceosome pathway (Figure 5G and Figure 5H). To further examine the transcriptions factors regulating the genes of each cluster, we performed motif enrichment analysis. Klf3, Ctcf, Spl and Maz motifs were enriched in cluster 1 specific accessible regions. Consistently, these transcription factors were also highly expressed. Interestingly, an earlier study reported lower genomic looping frequencies in 2i cells as compared to mESCs²². On the contrary, Tcfp2ll, Nfe212, Klf6, Tcf7 and Sp5 motifs were enriched in cluster 2 specific accessible regions (Figure 51, Figure 5J).

Example 32. Results - Quality Control for ASTAR-seq Libraries of Human Cells

Next, to expand the applicability of ASTAR-seq to human cells, we prepared ASTAR- seq libraries of multiple human cell lines, including the adherent BJ cells, and the suspension cultures of JK1, K562, and Jurkat. Of the 384 ASTAR RNAseq libraries, 296 libraries

(77.1%) passed the quality control (Threshold: 6% detected gene rate vs. 75% exon mapping rate) (Figure 9A).

The sequencing reads covered the full transcript, without bias towards either end of the mRNA (Figure 9B). Meta-analysis on the K562 ASTAR RNAseq, K562 bulk RNA-seq, BJ ASTAR RNAseq and BJ scRNA-seq illustrated the expected clustering according to the respective cell types²³. Hence, the human ASTAR RNAseq libraries are comparable to the bulk RNAseq or unimodal scRNA-seq previously generated (Figure 2D). Furthermore, 375 ASTAR ATACseq libraries out of 384 libraries passed the QC thresholds of chromVAR. In median, 55193 peaks were called and 35% fragments were within the peaks (Figure 9C and Figure 9D).

The insert-size distribution of ASTAR ATACseq libraries also exhibited characteristic nucleosomal pattern of an ATAC-seq library (Figure 3B). Of note, the ASTAR ATACseq libraries showed Pearson correlation of 0.8-0.89 with the published single cell ATACseq libraries³, indicative of its robustness in comparison to the unimodal datasets (Figure 3C).

Example 33. Results - Integrative Analysis of ASTAR-Seq Identifies the Cell Type Specific Regulatory Networks Driven by Transcription Factors

We next clustered the ASTAR ATACseq libraries based on the enrichment of human JASPAR motifs on the HARs of these cell lines. There were 2 major clusters and 4 sub- clusters. Notably, BJ cells were clustered distinctly from the blood lineage cell lines (Figure 6A). Variability analysis identified the transcriptions factors critical for the specification of each cell types (Figure 6B).

For example, FOS-JUN motifs were uniquely enriched in BJ cells, whereas GATA- TAL motifs were only enriched in the myeloid K562 and JK1 cells. Some motifs were however differentially enriched between K562 and JK1, such as ETS1 and ZBTB33 in K562 and NFE2 in JK1. Common motifs, such as KLF5 and KLF14, SP1 and SP2, and nuclear transcription factors NFYA and NFYB were enriched in the 3 cell lines, except for Jurkat. On the other hand, TEAD motifs were enriched in BJ and Jurkat cells (Figure 6B, Figure 6C and Figure 6D and Figure 9E).

Notably, RCA analysis demonstrated distinct correlation for BJ cells with cells of fibroblast foreskin and muscle lineages, and Jurkat cells with leukemia lymphoblast.

Expectedly, K562 cells showed highest correlation to leukemia K562 and erythroblast progenitors, whereas JK1 exhibited correlation to erythroblast progenitors, early and late erythroblast (Figure 6E). Consistently, GATA1 chromatin was only accessible and expressed in JK1 and K562 cells, but not in Jurkat and BJ cells, whereas SP1 was accessible and expressed in BJ, JK1, and K562 cells (Figure 6F). Our work demonstrates that parallel profiling of transcriptome and chromatin accessibility from the same single cell is an enabling technology which yields data of high quality comparable to conventional scRNA-seq or scATAC-seq. ASTAR-seq is a powerful integrated approach to understand the connectivity between transcription and epigenetic regulation. We expect the technology to have attractive applications in early embryonic testing, identification of rare cancer populations, and atlasing of whole tissues or cell types.

All the raw data has been deposited to GEO database: GSE113418.

REFERENCES

Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W.J.

Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature MethodslO, 1213-1218 (2013)

Buenrostro, J. D., et al. (2015).“Single-cell chromatin accessibility reveals principles of regulatory variation.” Nature 523(7561): 486-490. Goetz, J. J. and J. M. Trimarchi (2012).“Transcriptome sequencing of single cells with

Smart-Seq.” Nat Biotechnol 30(83: 763-765.

Picelli, S., et al. (2014).“Full-length RNA-seq from single cells using Smart-seq2.” Nat Protoc 9(1): 171-181.

1. Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90-95 (2011).

2. Tang, F. et al. RNA-Seq analysis to capture the transcriptome landscape of a single cell. Nat. Protoc. 5, 516-535 (2010).

3. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486-490 (2015). 4. Lorthongpanich, C. et al. Single-cell DNA-methylation analysis reveals epigenetic chimerism in preimplantation embryos. Science (80-. ). 341, 1110-1112 (2013).

5. Nagano, T. et al. Single-cell Hi-C reveals cell-to-cell variability in

chromosome structure. Nature 502, 59-64 (2013). 6. Zong, C., Lu, S., Chapman, A. R. & Xie, X. S. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science (80- ). 338, 1622-1626 (2012).

7. Raj, B. et al. Simultaneous single-cell profiling of lineages and cell types in the vertebrate brain. Nat. Biotechnol. (2018). doi: l0. l038/nbt.4l03

8. Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865-868 (2017).

9. Macaulay, I. C. et al. G&T-seq: Parallel sequencing of single-cell genomes and transcriptomes. Nat. Methods 12, 519-522 (2015).

10. Dey, S. S., Kester, L., Spanjaard, B., Bienko, M. & Van Oudenaarden, A. Integrated genome and transcriptome sequencing of the same cell. Nat. Biotechnol. 33, 285- 289 (2015).

11. Cheow, L. F. et al. Single-cell multimodal profiling reveals cellular epigenetic heterogeneity. Nat. Methods 13, 833-836 (2016).

12. Angermueller, C. et al. Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat. Methods 13, 229-232 (2016).

13. Clark, S. J. et al. ScNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells e. Nat. Commun. 9, (2018).

14. Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science (80-. ). (2018). doi:l0. H26/science.aau0730

15. Hou, Y. et al. Single-cell triple omics sequencing reveals genetic, epigenetic, and transcriptomic heterogeneity in hepatocellular carcinomas. Cell Res. (2016).

doi: 10.1038/cr.2016.23

16. Satpathy, A. T. et al. Transcript-indexed ATAC-seq for precision immune profiling. Nat. Med. (2018). doi : 10.1038/s41591-018-0008-8

17. Marks, H. et al. The transcriptional and epigenomic foundations of ground state pluripotency. Cell (2012). doi : 10.1016/j . cell.2012.03.026 18. Han, X. et al. Mapping the Mouse Cell Atlas by Microwell-Seq. Cell 172, 1091-1097. el7 (2018).

19. Macfarlan, T. S. et al. Embryonic stem cell potency fluctuates with endogenous retrovirus activity. Nature 487, 57-63 (2012).

20. Duren, Z. et al. Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proc. Natl. Acad. Sci. (2018).

doi: l0.l073/pnas.l80568l 115

21. Pliner, H. A. et al. Cicero Predicts cis-Regulatory DNA Interactions from Single-Cell Chromatin Accessibility Data. Mol. Cell (2018).

doi: l0.l0l6/j.molcel.20l8.06.044

22. Krijger, P. H. L. et al. Cell-of-origin-specific 3D genome structure acquired during somatic cell reprogramming. Cell Stem Cell 18, 597-610 (2016).

23. Bernstein, B E . et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74 (2012).

24. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381-386 (2014).

25. Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979-982 (2017).

26. Shoemaker, J. E. et al. CTen: a web-based platform for identifying enriched cell types from heterogeneous microarray data. BMC Genomics (2012). doi: 10.1186/1471- 2164-13-460

27. Koeffler, H. & Golde, D. Human myeloid leukemia cell lines: a review. Blood (1980). doi: 10.3233/JAD-2012-120580

28. Dobin, A. et al. STAR: ETltrafast universal RNA-seq aligner. Bioinformatics 29, 15-21 (2013).

29. Li, H. et al. The Sequence Alignment/Map format and SAMtools.

Bioinformatics 25, 2078-2079 (2009). 30. Benjamini, Y. & Speed, T. P. RSeQC: Quality Control of RNA-seq

experiments. Bioinformatics 40, e72 (2012).

31. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511-515 (2010).

32. Le, S., Josse, J. & Husson, F. FactoMineR : An R Package for Multivariate Analysis. J. Stat. Softw. 25, 253-258 (2008).

33. Li, H. et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 49, 708-718 (2017).

34. Dennis, G. et al. DAVID: Database for Annotation, Visualization, and

Integrated Discovery. Genome Biol. 4, R60 (2003).

35. Heinz, S. et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol. Cell 38, 576-589 (2010).

36. La Manno, G. et al. RNA velocity of single cells. Nature (2018).

doi : 10.1038/s41586-018-0414-6

37. Fan, J. et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods (2016). doi: l0.l038/nmeth.3734

38. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9,

(2008).

39. Schep, A. N., Wu, B., Buenrostro, J. D. & Greenleaf, W. J. ChromVAR:

Inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods 14, 975-978 (2017).

40. McKenna, A. et al. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297-1303 (2010).

41. Love, M. L, Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, (2014). 42. Ignatiadis, N., Klaus, B., Zaugg, J. B. & Huber, W. Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat. Methods 13, 577- 580 (2016).

43. Shen, L., Shao, N., Liu, X. & Nestler, E. Ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases. BMC Genomics 15, (2014).

44. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. (2018). doi: l0.l038/nbt.4096

45. Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics (2014).

doi : 10.1093/bioinformatics/btt656

46. Szklarczyk, D. et al. STRING vl 1 : protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. (2018). doi :https://doi.org/l 0.1093/nar/gky 1131

47. Shannon, P. et al. Cytoscape: A software Environment for integrated models of biomolecular interaction networks. Genome Res. (2003). doi: 10. l lOl/gr.1239303

48. Bader, G. D. & Hogue, C. W. V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics (2003).

doi: 10.1186/1471-2105-4-2

49. Tripathi, S. et al. Meta- and Orthogonal Integration of Influenza‘oMICs’ Data Defines a Role for UBR4 in Virus Budding. Cell Host Microbe (2015).

doi: 10. l0l6/j.chom.2015.11.002

In this document and in its claims, the verb“to comprise” and its conjugations is used in its non-limiting sense to mean that items following the word are included, but items not specifically mentioned are not excluded. In addition, reference to an element by the indefinite article“a” or“an” does not exclude the possibility that more than one of the element is present, unless the context clearly requires that there be one and only one of the elements. The indefinite article“a” or“an” thus usually means“at least one”. Each of the applications and patents mentioned in this document, and each document cited or referenced in each of the above applications and patents, including during the prosecution of each of the applications and patents (“application cited documents”) and any manufacturer’s instructions or catalogues for any products cited or mentioned in each of the applications and patents and in any of the application cited documents, are hereby incorporated herein by reference. Furthermore, all documents cited in this text, and all documents cited or referenced in documents cited in this text, and any manufacturer’s instructions or catalogues for any products cited or mentioned in this text, are hereby incorporated herein by reference. Various modifications and variations of the described methods and system of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific preferred embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention which are obvious to those skilled in molecular biology or related fields are intended to be within the scope of the claims.

Claims

1. A method comprising the steps of:

(a) providing a nucleic acid sample from an individual cell, the nucleic acid sample comprising genomic DNA and messenger RNA (mRNA);

(b) obtaining chromatin associated DNA from genomic DNA in the sample; and

(c) obtaining complementary DNA (cDNA) from mRNA in the sample.

2. A method according to Claim 1, in which steps (b) and (c) are conducted

simultaneously.

3. A method according to Claim 1 or 2, which further comprises step (d) in which chromatin associated DNA is separated from complementary DNA (cDNA).

4. A method according to Claim 1, 2 or 3, which further comprises step (e) in which chromatin associated DNA is amplified, for example by PCR.

5. A method according to any preceding claim, which further comprises step (f) in which cDNA is amplified, for example by PCR.

6. A method according to any preceding claim, in which step (b) comprises generating transposed DNA from chromatin associated DNA in the nucleic acid sample.

7 A method according to Claim 6, in which transposed DNA is generated by cutting DNA with transposase.

8. A method according to Claim 7, in which the transposase comprises a prokaryotic transposase, such as a hyperactive transposase, such as Tn5.

9. A method according to any preceding claim, in which step (c) comprises reverse transcription-polymerase chain reaction (RT-PCR).

10. A method according to Claim 3 or any of Claims 4 to 9 as dependent on Claim 3, in which cDNA is labelled with biotin and separated from transposed DNA by binding to streptavidin.

11. A method according to any preceding claim, in which both the chromatin associated DNA and/or the cDNA are sequenced.

12. A method of generating nucleic acid libraries from a plurality of individual cells, comprising performing a method according to any preceding claim on each of a plurality of individual cells.

13. A method according to Claim 12, in which the method is performed independently and in parallel on each individual cell.

14. A method according to Claim 12 or 13, which further comprises a step of separating cells in a sample comprising multiple cells to obtain a plurality of individual cells, such as by flow cytometry sorting or manual picking to multi-well plates or by microfluidic chips to multi -chambers.

15. A method according to Claim 12, 13 or 14, in which the plurality of individual cells is comprised in a tissue sample.

16. A method according to any preceding claim for generating genome-wide accessible chromatin region and transcriptome information from a single cell.

17. A method for generating transcriptome and chromatin accessibility profiles of a cell or a plurality of cells, the method comprising a method according to any preceding claim.

18. A method comprising the steps of:

(a) generating transcriptome and chromatin accessibility profiles of a first cell using a method according to any preceding claim; and

(b) comparing the transcriptome and chromatin accessibility profiles so generated with transcriptome and chromatin accessibility profiles of a second cell, such as generated using a method according to any preceding claim.

19. A method according to Claim 18 for detecting a disease state of a cell, in which the transcriptome and chromatin accessibility profiles of a first cell are compared to those of a second cell known to be diseased, preferably in which the disease is cancer.

20. A method according to Claim 18 for identification of a cellular subgroup of a cell, in which the transcriptome and chromatin accessibility profiles of a first cell are compared to those of a second cell known to be of a particular cellular subgroup.

21. A method of diagnosis of a disease in a patient, the method comprising: (a) generating transcriptome and chromatin accessibility profiles of a first cell from a sample of the patient using a method according to any preceding claim; and (b) comparing the transcriptome and chromatin accessibility profiles so generated with transcriptome and chromatin accessibility profiles of a second cell known to be diseased; and diagnosing a disease where the transcriptome and chromatin accessibility profiles of the first cell are similar to those of the second cell.

22. A method which comprises the following steps in order:

(a) providing a nucleic acid sample from an individual cell comprising genomic DNA and messenger RNA (mRNA);

(b) producing transposed DNA from the sample, preferably by treating the sample with Tn5 transposase;

(c) producing complementary DNA (cDNA) from the sample;

(d) separating transposed DNA from cDNA;

(e) amplifying transposed DNA, preferably by PCR;

(f) amplifying cDNA, preferably by PCR and preparing a cDNA library from

amplified cDNA; and

(g) optionally sequencing transposed DNA and cDNA.

23. A kit comprising:

(a) reagents for providing a nucleic acid sample from an individual cell comprising chromatin associated DNA and messenger RNA (mRNA);

(b) reagents for obtaining chromatin associated DNA from genomic DNA in the sample, preferably comprising reagents for generating transposed DNA from chromatin associated DNA in the nucleic acid sample, such as transposase, for example a prokaryotic transposase, such as a hyperactive transposase, such as Tn5;

(c) reagents for generating complementary DNA (cDNA) from mRNA in the sample;

(d) reagents for separating transposed DNA from cDNA; (e) reagents for amplifying transposed DNA and/or cDNA, preferably by polymerase chain reaction (PCR) and reagents for preparing libraries from amplified nucleic acid; and

(f) instructions for use.