WO2023194331A1

WO2023194331A1 - CONSTRUCTION OF SEQUENCING LIBRARIES FROM A RIBONUCLEIC ACID (RNA) USING TAILING AND LIGATION OF cDNA (TLC)

Info

Publication number: WO2023194331A1
Application number: PCT/EP2023/058731
Authority: WO
Inventors: Christina ERNST; Didier Trono
Original assignee: Ecole Polytechnique Federale De Lausanne (Epfl)
Priority date: 2022-04-04
Filing date: 2023-04-04
Publication date: 2023-10-12

Abstract

The present invention provides a method for preparing a sequencing library from a ribonucleic acid (RNA) sample.

Description

CONSTRUCTION OF SEQUENCING LIBRARIES FROM A RIBONUCLEIC ACID (RNA) USING TAILING AND LIGATION OF cDNA (TLC)

FIELD OF THE INVENTION

The invention provides a method for preparing a sequencing library from a ribonucleic acid (RNA) sample.

BACKGROUND OF THE INVENTION

Massively parallel (or “next generation”) and long-read sequencing platforms are rapidly transforming data collection and analysis in genome, epigenome, transcriptome and epitranscriptome research. All current short-read sequencing (SRS) platforms, such as those marketed by Illumina®, Ion Torrent™, Roche™, Life Technologies™, as well as long-read sequencing (LRS) platforms from Pacific Biosciences and Oxford Nanopore Technology (ONT) require the addition of known adapter sequences to each end of a target polynucleotide.

When constructing high-throughput sequencing libraries from a ribonucleic acid (RNA), the generation of double-stranded cDNA is a crucial and often limiting step of available technologies. The most common strategies involve random priming of the second strand relying on the RNaseH activity of Reverse Transcriptases (RTs) resulting in short RNA fragments hybridized to the first-strand cDNA, that can prime second-strand synthesis. However, the location from where second stand synthesis is initiated from is not controlled in this aspect, leading to cDNA molecules of variable length. In many instances preservation of, and priming from the original 3’ end of the first-strand cDNA molecule is desirable to preserve the full- length of the original target RNA molecule, including: i) increased transcript coverage during RNA-Seq applications, including single-cell approaches; ii) profiling of 5’ ends of RNA molecules to identify transcription start sites (e.g. rapid amplification of cDNA ends (RACE- Seq); iii) full-length cDNA sequencing using long-read sequencing platforms; iv) profiling of short RNA species; v) ribosome footprinting; and vi) characterization of short RNA fragments co-purified with RNA-binding proteins or enriched for RNA modifications of interest following UVC crosslinking, in which case the RT termination site needs to be preserved to obtain information regarding the exact crosslinking position.

Current approaches to initiate second-strand synthesis from the 3’ end of cDNA molecules rely on template switch oligos (TSOs) for applications related to i)-iii), adapter ligation to both ends of the RNA target molecule for applications related to iv); and single- stranded DNA (ssDNA) ligation or circularization of first-strand cDNA molecules for applications related to v)-vi).

The main limitation of TSOs is a bias towards capped RNA molecules, a restriction in possible RT conditions (e.g., limitation to low RT temperatures, which are not ideal for structured templates and processivity for long templates); the potential for intramolecular priming leading to truncated cDNA molecules; and high levels of contamination with adapter concatemers.

Limitations of ssDNA ligation and circularization of cDNA are mainly related to inefficient enzymatic reactions leading to permanent loss of cDNA molecules from the library pool and the large number of steps involved including extensive purifications to avoid unwanted amplification of unligated adapters, which otherwise can block subsequent amplification reactions through hybridisation with amplification primers.

There is a need for new methods of preparing a sequencing library that represents the full-length of the original RNA template for sequence analysis.

SUMMARY OF THE INVENTION

An aspect of the present invention provides a method for preparing a sequencing library from a ribonucleic acid (RNA) sample, the method comprising:

(a) obtaining a test sample comprising a plurality of template RNA or RNA precursors,

(b) providing a set of oligonucleotide adapters and primers, the set comprising a plurality of first adapters, a plurality of first strand cDNA synthesis primers, a plurality of second adapters, and a plurality of amplification primers, wherein each of the first adapters comprises (i) a 5’ primer binding domain and a 3’ poly A domain or (ii) a 5’ poly A domain and a 3’ primer binding domain; each of the first strand cDNA synthesis primers comprises an RNA hybridization domain complementary to the template RNA or to the first adapters (e.g., oligo(dT)), and said each of the first strand cDNA synthesis primers is covalently linked to magnetic beads; each of the second adapters comprises primer binding sites, and each of the amplification primers comprises sequencing platform adapter constructs;

(c) ligating the plurality of first adapters to the 3' end of the plurality of RNA precursors to generate template RNA;

(d) generating a plurality of solid-phase first strand cDNA through reverse transcription primed by the first strand cDNA synthesis primer starting from the plurality of template RNA of step (c) or of step (a);

(e) separating the solid-phase first strand cDNA from the plurality of template RNA; (f) tailing the 3’ ends of the plurality of solid-phase first strand cDNA with non-template ribonucleotides;

(g) ligating the plurality of second adapters to 3' end of the plurality of solid-phase first strand cDNA;

(h) amplifying the plurality of solid-phase cDNA with amplification primers to generate a plurality of double stranded cDNA that are processed into a sequencing library through addition of sequencing platform adapter constructs.

BRIEF DESCRIPTION OF THE FIGURES

Figure 1 shows schematic representation of tailing and ligation of cDNA (TLC) strategy from full-length polyadenylated mRNA to obtain and amplify full-length cDNA. The first strand cDNA synthesis primer consists of a 3’ RNA hybridization domain (RHD) (e.g., oligo(dT)) and a 5’ primer binding site (PBS) containing any desirable sequence.

Figure 2 shows schematic representation of tailing and ligation of cDNA (TLC) strategy from full-length polyadenylated mRNA to obtain and amplify full-length cDNA, followed by tagmentation to generate fragments compatible with short-read sequencing platforms (a). The first strand cDNA synthesis primer consists of a 3’ RNA hybridization domain (RHD) (e.g., oligo(dT)), an adjacent post-tagmentation amplification primer binding domain and a 5’ pretagmentation amplification primer binding domain containing any desirable sequence. Variations in primer binding domains provided during tagmentation enable specific 5’ endcapture (b), or 3’ end-capture (c). TnRPl and TnRP2 present post-tagmentation amplification primer binding domains.

Figure 3 shows schematic representation of tailing and ligation of cDNA (TLC) strategy from fragmented RNA (a) or short RNA species (b), employing a random priming approach during reverse transcription (step 3). The first strand cDNA synthesis primer consists of a 3’ RNA hybridization domain (RHD) (e.g., a random er depicted as N) and a 5’ primer binding site (PBS) containing any desirable sequence.

Figure 4 shows schematic representation of tailing and ligation of cDNA (TLC) strategy from fragmented RNA (a) or short RNA species (b), employing polyadenylation of the precursor RNA prior to reverse transcription (step 3) to allow priming from the 3’ end of template RNA using oligo(dT) reverse transcription primers. The first strand cDNA synthesis primer consists of a 3’ RNA hybridization domain (RHD) (e.g., oligo(dT)) and a 5’ primer binding site (PBS) containing any desirable sequence.

Figure 5 shows schematic representation of tailing and ligation of cDNA (TLC) strategy from fragmented RNA (a) or short RNA species (b), employing a ligation approach prior to reverse transcription (step 3), which adds a first adapter oligonucleotide to the 3’ end of the RNA precursor, introducing primer binding domains and hybridisation domains (e.g., a poly A stretch). Reverse transcription can then be initiated using first strand cDNA synthesis primers that contain a complementary sequence to the first adapter oligonucleotide (e.g., oligo(dT)).

Figure 6 shows a timecourse of capturing a poly-adenylated first adapter oligonucleotide on oligo(dT)25 Dynabeads.

Figure 7 shows denaturing 10% TBE-Urea PAGE of tailing reaction performed with different Terminal Transferases in the presence of ATP or dATP showing the addition of only a short ribotail in the presence of ribonucleotides.

Figure 8 shows denaturing 10% TBE-Urea PAGE of mock ligation, testing different experimental conditions. TLC ligation conditions are highlighted in black rectangle, eCLIP ligation conditions are highlighted in grey rectangle.

Figure 9 shows agarose gels of optimisation reactions to amplify cDNA from long template RNA. Efficiency to generate molecules of 4kb and 8kb length is shown for different reverse transcriptases and different reaction temperatures.

Figure 10 shows the percentage of usable reads out of total read fraction for different public CLIP libraries compared to TLC-CLIP libraries.

Figure 11 shows TLC libraries do not form concatemers in the absence of RNA input.

Figure 12 shows per base sequence content before (top) and after (bottom) homopolymer trimming of 1-2T bases to remove the overrepresentation of T nucleotides resulting from the ribotailing approach. Figure 13 show the fraction of overlap between CLIP libraries produced with TLC and public eCLIP libraries, when restricting the comparison to peaks present on genes that have similar expression levels between 293T (TLC) and HepG2 (eCLIP) cells.

Figure 14 show the result of de novo motif discovery on RBFOX2 peaks, recapitulating the known binding motif and showing a larger fraction of peaks with motifs for CLIP libraries prepared with TLC compared to eCLIP.

Figure 15 shows the motif density at RBFOX2 peaks, comparing CLIP libraries prepared with TLC and eCLIP.

Figure 16 shows the percentage of reads carrying deletions in CLIP libraries compared with TLC and eCLIP.

Figure 17 shows a high correlation of crosslink-induced deletions at single nucleotide resolution between biological replicates.

Figure 18 shows the nucleotide resolution of crosslink-induced deletions when centred on the consensus motif.

Figure 19 shows that crosslink-induced deletions increase the specificity of CLIP libraries prepared with TLC by distinguishing between crosslinked fragments and co-purifying, non-crosslinked fragments.

Figure 20 shows the percentage of peaks that harbour the consensus motif, depending on the ratio of crosslink-induced deletions per peak.

Figure 21 shows that CLIP libraries prepared with TLC capture high-resolution position-dependent enrichment of RBPs from as little as 500 cells.

Figure 22 shows a representative example of the enrichment of modified RNA, in this case N6-methyladenosine (m6A), over input at specific transcript regions. Figure 23 shows the dependency of deletions on UV crosslinking conditions (A) and their precise location at single-nucleotide positions (B) which allows the detection of the m6A core motif ‘GGAC’ using de novo motif discovery (C).

DETAILED DESCRIPTION OF THE INVENTION

All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. The publications and applications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. In addition, the materials, methods, and examples are illustrative only and are not intended to be limiting.

In the case of conflict, the present specification, including definitions, will control. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as is commonly understood by one of ordinary skill in the art to which the subject matter herein belongs. As used herein, the following definitions are supplied in order to facilitate the understanding of the present invention.

The term “comprise” is generally used in the sense of include, that is to say permitting the presence of one or more features or components. Also as used in the specification and claims, the language “comprising” can include analogous embodiments described in terms of “consisting of “ and/or “consisting essentially of’.

As used in the specification and claims, the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise.

As used in the specification and claims, the term “and/or” used in a phrase such as “A and/or B” herein is intended to include “A and B”, “A or B”, “A”, and “B”.

A domain refers to a stretch of length of nucleic acid made up of a plurality of nucleotides, where the stretch of length provides a defined function to the nucleic acid. Examples of domains include primer binding domains, hybridization domains, barcode domains (such as source barcode domains), unique molecular identifier domains, sequencing adaptor domains, sequencing indexing domains, etc. While the length of a given domain may vary, in some instances the length ranges from 1 to 100 nt, such as 5 to 50 nt.

Amplification primer binding domains are domains that are configured to bind via hybridization to an amplification primer.

Tagmentation involves fragmentation of double-stranded DNA and simultaneous tagging with primer binding domains and is employed in many next generation sequencing protocols. Pre-tagmentation amplification primer binding domains are domains which are configured to bind to pre-tagmentation amplification primers during an amplification that occurs before a tagmentation step, e.g., a cDNA amplification protocol which occurs prior to a tagmentation step. Post-tagmentation amplification primer binding domains are domains which are configured to bind to post-tagmentation amplification primers during an amplification that occurs after a tagmentation step, e.g., a tagmented sample amplification protocol which occurs after a tagmentation step.

A barcode domain is a domain that serves as an identifier of a nucleic acid. Barcode domains may vary, wherein examples include RNA source barcode domains, e.g., cell barcode domains, host barcode domains, etc.; container barcode domains, such as plate or well barcode domains; in-line barcode domains, indexing barcode domains, etc.

Unique Molecular Identifiers are employed in many next generation sequencing applications. Unique Molecular Identifiers (i.e., UMIs) are randomers of varying length, e.g., ranging in length in some instances from 6 to 12 nts, that can be used for counting of individual molecules of a given molecular species. Counting is achieved by attaching UMIs from a diverse pool of UMIs to individual molecules of a target of interest such that each individual molecule receives a unique UMI. By counting individual transcript molecules, PCR bias can be reduced during NGS library preparation and a more quantitative understanding of the sample population can be achieved. See e.g., U.S. Pat. No. 8,835,358; Fu et al., “Molecular Indexing Enables Quantitative Targeted RNA Sequencing and Reveals Poor Efficiencies in Standard Library Preparations,” PNAS (2014) 5: 1891-1896 and Fu et al., “Digital Encoding of Cellular mRNAs Enabling Precise and Absolute Gene Expression Measurement by Single-Molecule Counting,” Anal. Chem (2014)86:2867-2870. The term “complementary” as used herein refers to a nucleotide sequence that basepairs by non-covalent bonds to all or a region of a target nucleic acid (e.g., a template RNA or other region of the double stranded product nucleic acid). In the canonical Watson-Crick basepairing, adenine (A) forms a basepair with thymine (T), as does guanine (G) with cytosine (C) in DNA. In RNA, thymine is replaced by uracil (U). As such, A is complementary to T and G is complementary to C. In RNA, A is complementary to U and vice versa. Typically, “complementary” refers to a nucleotide sequence that is at least partially complementary. The term “complementary” may also encompass duplexes that are fully complementary such that every nucleotide in one strand is complementary to every nucleotide in the other strand in corresponding positions. In certain cases, a nucleotide sequence may be partially complementary to a target, in which not all nucleotides are complementary to every nucleotide in the target nucleic acid in all the corresponding positions. For example, a primer may be perfectly i.e., 100%) complementary to the target nucleic acid, or the primer and the target nucleic acid may share some degree of complementarity which is less than perfect (e.g., 70%, 75%, 85%, 90%, 95%, 99%). The percent identity of two nucleotide sequences can be determined by aligning the sequences for optimal comparison purposes (e.g., gaps can be introduced in the sequence of a first sequence for optimal alignment). The nucleotides at corresponding positions are then compared, and the percent identity between the two sequences is a function of the number of identical positions shared by the sequences (i.e., % identity = #of identical positions / total # of positions xlO). When a position in one sequence is occupied by the same nucleotide at the corresponding position in the other sequence, then the molecules are identical at that position. A non-limiting example of such a mathematical algorithm is described in Karlin et al., Proc Nati. Acad. Sci USA 90:5873-5877 (1993)/ Such an algorithm is incorporated into the NBLAST and XBLAST programs (version 2.0) as described in Altschul et al., Nucleic Acids Res. 25:389-3402 (1977). When utilizing BLAST and Gapped BLAST programs, the default parameters of the respective programs (e.g., NBLAST) can be used. In one aspect, parameters for sequence comparison can be set at score=100, wordlength=12, or can be varied (e.g., wordlength=5orwordlength=20).

As used herein, the term “hybridization conditions” means conditions in which a primer specifically hybridizes to a region of the target nucleic acid (e.g., a template RNA or other region of the double stranded product nucleic acid). Whether a primer specifically hybridizes to a target nucleic acid is determined by such factors as the degree of complementarity between the polymer and the target nucleic acid and the temperature at which the hybridization occurs, which may be informed by the melting temperature (Tu) of the primer. The melting temperature refers to the temperature at which half of the primer-target nucleic acid duplexes remain hybridized and half of the duplexes dissociate into single strands. The T_m of a duplex may be experimentally determined or predicted using the following formula T_m=81. 5+16.6 (logl0[Na⁺])+0.41 (fraction G+C)-(60/N), where N is the chain length and [Na⁺] is less than IM. See Sambrook and Russell (2001; Molecular Cloning: A Laboratory Manual, 3rd ed, Cold Spring Harbor Press, Cold Spring Harbor N.Y., Ch. 10). Other more advanced models that depend on various parameters may also be used to predict Tm of primer/target duplexes depending on various hybridization conditions. Approaches for achieving specific nucleic acid hybridization may be found in, e.g., Tijssen, Laboratory Techniques in Biochemistry and Molecular Biology-Hybridization with Nucleic Acid Probes, part I, chapter 2, “Overview of principles of hybridization and the strategy of nucleic acid probe assays.” Elsevier (1993).

A “poly(A)” is a polyA-sequence. The poly(A) sequence is commonly known as a tail that consists of multiple adenosine monophosphates; in other words, it is a stretch of RNA or DNA that has adenine bases. In eukaryotes, polyadenylation is part of the process that produces mature messenger RNA (mRNA) for translations.

A “template RNA” or “RNA template” refers to a ribonucleic acid (RNA) molecule which serves as template during reverse transcription, during which an RNA-dependent DNA polymerase, or reverse transcriptase, synthesizes a complementary DNA (cDNA). The template RNA needs to contains a known or desired sequence which can hybridize with a first strand cDNA synthesis primer that is then extended in 5’ - 3’ direction during the reverse transcription reaction, resulting in a first strand cDNA with the first strand cDNA synthesis primer at its 5’ end, followed by a domain complementary to the template RNA at its 3’ end.

A “precursor RNA” or “RNA precursor” refers to a ribonucleic acid (RNA) molecule, or fragments thereof, which requires the addition of nucleotides to its 3’ ends, e.g., through polyadenylation or adapter ligation, before it can serve as template RNA during reverse transcription.

Enzymatic ligation of oligonucleotides is a standard procedure in numerous protocols for oligonucleotide manipulation and is required for sequencing, cloning and many other DNA- and RNA-based technologies. The enzymes involved in the catalysis of the ligation reaction from phosphodiester bonds between 5’-phosphate ends of DNA or RNA and 3’-hydroxyl ends. The ligation reaction can join any 5’-phosphae with any 3 ’-hydroxyl end, since the reaction is not sequence specific. This lack of substrate specificity is a major advantage for a broad general application of enzymatic ligations and has contributed to its wide application.

As used herein, the term "RNA modifications" refers to a broad class of chemical modifications that can occur on RNA molecules, including messenger RNA (mRNA), transfer RNA (tRNA), and ribosomal RNA (rRNA). These modifications can include methylation, acetylation, phosphorylation, oxidation, and other chemical changes that alter the structure, stability, and function of RNA. The modifications can occur at various sites within RNA molecules, including the base, sugar, and phosphate moieties, and can affect various aspects of RNA biology, such as gene expression regulation, RNA processing, and translation.

As used herein, the term "antibody" refers to a type of protein that can specifically recognize and bind to a particular antigen, such as a protein, peptide, or specific nucleic acid modification.

As used herein, the term “non-template ribonucleotides” refers to the terminal transferase catalyzed addition of ribonucleotide to the 3’ end of solid-phase first strand cDNA without base-pairing to a template strand.

As used herein, the term “homonucleotide stretch” refers to a stretch of length of nucleic acid made up of the same nucleotide (e.g., all dCTP, all dGTP, all dTTP, all dATP, all CTP, all GTP, all UTP, or all ATP).

As used herein, the term “heteronucleotide stretch” refers to a stretch of length of nucleic acid made up of a plurality of nucleotides.

Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the methods. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges and are also encompassed within the methods, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the methods. Certain ranges are presented herein with numerical values being preceded by the term “about.” The term “about” is used herein to provide literal support for the exact number that it precedes, as well as a number that is near to or approximately the number that the term precedes. In determining whether a number is near to or approximately a specifically recited number, the near or approximating unrecited number may be a number which, in the context in which it is presented, provides the substantial equivalent of the specifically recited number.

It is appreciated that certain features of the methods, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the methods, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination. All combinations of the embodiments are specifically embraced by the present invention and are disclosed herein just as if each and every combination was individually and explicitly disclosed, to the extent that such combinations embrace operable processes and/or devices/systems/kits. In addition, all sub-combinations listed in the embodiments describing such variables are also specifically embraced by the present methods and are disclosed herein just as if each and every such sub-combination was individually and explicitly disclosed herein. As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present methods. Any recited method can be carried out in the order of events recited or in any other order which is logically possible.

As disclosed herein, the invention provides methods of preparing a sequencing library from a ribonucleic acid (RNA) sample. Sequencing libraries produced by methods of the invention are those whose nucleic acid members include a partial or complete sequencing platform adapter sequence at their termini useful for sequencing using a sequencing platform of interest. Sequencing platforms of interest include, but are not limited to, HiSeq, MiSeq, NextSeq and NovaSeq sequencing systems from Illumina®; the PACBIO RS II Sequel systems form Pacific Biosciences; the SOLiD sequencing systems from Life Technologies™; the MinlON™, GridlON™ and PromethlON™ system from Oxford Nanopore, or any other sequencing platform of interest. Methods of preparing sequencing libraries form a ribonucleic acid (RNA) sample are provided. Aspects of the present invention include combining the RNA sample with a first strand cDNA synthesis primer under first strand cDNA synthesis conditions, where in some embodiments the first strand cDNA synthesis primer contains primer binding domains and is complementary to sequences within the RNA sample itself, whereas in other embodiments the first strand cDNA synthesis primer is complementary to specific sequences that contain primer binding domains and were added to the 3’ end of an RNA precursor prior to cDNA synthesis. The resultant first-strand cDNA is combined with one type of nucleoside triphosphates (NTP) and a second adapter oligonucleotide under conditions that allow 3’ tailing and ligation of first- strand cDNA to the second adapter oligonucleotide, which contains primer binding domains. The resultant product is sufficient to produce double-stranded cDNA, which in one embodiment can then be subjected to amplification conditions, e.g., PCR amplification, using first and second amplification primers that include sequencing adaptor constructs to generate libraries for the desired sequencing platform, whereas in other embodiments, the resultant doublestranded cDNA is subjected to tagmentation prior to amplification. Aspects of the invention further include compositions produced by the methods and kits that find use in practicing the methods.

(a) obtaining a test sample comprising a plurality of template RNA or RNA precursors;

(c) ligating the plurality of first adapters to the 3' end of the plurality of RNA precursors to generate template RNA; (d) generating a plurality of solid-phase first strand cDNA through reverse transcription primed by the first strand cDNA synthesis primer starting from the plurality of template RNA of step (c) or of step (a);

(e) separating the solid-phase first strand cDNA from the plurality of template RNA;

(f) tailing the 3’ ends of the plurality of solid-phase first strand cDNA with non-template ribonucleotides;

In some embodiments of the method for preparing a sequencing library disclosed herein, each template RNA of step (a) already contains a known sequence, such as 3’ polyA domain or specific target sequences of interest as primer binding domain, to serve as hybridization domain; therefore it does not require the ligation of first adapters to the 3’ end of each template RNA of step (c).

In some embodiments of the method for preparing a sequencing library disclosed herein, the precursor RNA is fragmented.

In some embodiments of the method for preparing a sequencing library disclosed herein, nucleotides are added to the 3’ end of the precursor RNA through polyadenylation or ligation.

In other embodiments of the method for preparing a sequencing library disclosed herein, each of the first adapters and/or each of the first strand cDNA synthesis primers further comprise a sample barcode and/or unique molecular identifier.

In other embodiments of the method for preparing a sequencing library disclosed herein, each of the second adapters further comprises a sample barcode and/or unique molecular identifier.

In some embodiments of the method for preparing a sequencing library disclosed herein, each of the second adapters further comprises a sequencing read primer domain. In some embodiments of the method for preparing a sequencing library disclosed herein, each of the first strand cDNA synthesis primers is not covalently linked to magnetic beads.

In some further embodiments of the method for preparing a sequencing library disclosed herein, the method further comprises tagmenting the plurality of the double stranded cDNA of step (h) with transposomes to generate a tagmented sample, wherein the transposomes comprise a transposase and a transposon nucleic acid; and wherein the transposon nucleic acid comprises a transposon end domain and a second post-tagmentation amplification primer binding domain.

In some embodiments of the method for preparing a sequencing library disclosed herein, the RNA hybridization domain comprises a heteronucleotide stretch.

In some embodiments of the method for preparing a sequencing library disclosed herein, any of the provided oligonucleotide adapters comprise one or more nucleotide analogs.

In some embodiments of the method for preparing a sequencing library disclosed herein, the template RNA or the RNA precursor is messenger RNA.

In some embodiments of the method for preparing a sequencing library disclosed herein, the RNA hybridization domain of each of the first strand cDNA synthesis primers is primed using randomers.

In some embodiments of the method for preparing a sequencing library disclosed herein, the method further comprises pooling the plurality of first adapters ligated to the plurality of RNA precursors.

In some embodiments of the method for preparing a sequencing library disclosed herein, the method further comprises pooling the plurality of solid-phase first strand cDNA.

In some embodiments of the method for preparing a sequencing library disclosed herein, the test sample comprising a plurality of template RNA or precursor RNA is obtained from a single cell. In some embodiments of the method for preparing a sequencing library disclosed herein, the method further comprises subjecting the sequencing library to a sequencing protocol.

In some embodiments of the method for preparing a sequencing library disclosed herein, the method further comprises quantitating one or more RNA species of the test sample.

In some embodiment, the methods of the disclosure can be performed according to the schematic diagrammed in FIG. 1. As illustrated in FIG. 1, an RNA sample (squiggly line) can be combined with a reverse transcriptase (not shown), dNTPs (not shown), and a first strand cDNA synthesis primer covalently linked to a magnetic bead, in a reaction mixture under first strand cDNA synthesis conditions, e.g., conditions sufficient to produce a double stranded product nucleic acid that includes the template RNA hybridized to the first strand complementary deoxyribonucleic acid (cDNA), where the first strand cDNA is covalently linked to the magnetic bead and includes the first strand cDNA synthesis primer containing a primer binding domain at its 5’ end and a newly synthesized length or portion that is complementary to domains found in the template RNA. The resultant first-strand cDNA is separated from the template RNA and is contacted with one source of nucleotide triphosphates (e.g., ATP) (not shown), a terminal transferase (not shown), T4 RNA ligase (not shown) and a second adapter oligonucleotide which includes a primer binding domain (PBS) under tailing and ligation conditions, e.g., conditions sufficient to ribotail and ligate the 3’ end of cDNA, which includes the addition of non-template ribonucleotides to the 3’ end of cDNA (depicted as (rA)i-s) followed by ligation of the second adapter oligonucleotide to the 3’ end of cDNA. The resulting first-strand cDNA can be amplified with a primer that binds the primer binding domain of the second adapter oligonucleotide, generating full-length double-stranded cDNA that can be amplified and uncoupled from magnetic beads with primers that bind the primer binding domains at both ends of the cDNA, which can include additional sequencing adaptor sequences, such as the P5 and P7 sequences, as well as the forward and reverse indexes (e.g., i5, i7) for sequencing, as desired.

In another embodiment of the method for preparing a sequence library of the invention, each of the first adapters and/or each of the second adapter oligonucleotides further comprises a sample barcode and/or unique molecular identifier. In a preferred embodiment of the method of preparing a sequence library of the invention, the cDNA synthesis primers are covalently linked to magnetic beads generating solid-phase first-strand cDNA. In another embodiment, the cDNA synthesis primers are uncoupled, requiring additional purification procedures in-between aspects of the invention.

By “conditions sufficient to produce a double stranded product nucleic acid” is meant reaction conditions that permit hybridization of the first strand cDNA synthesis primer to the template RNA and polymerase-mediated extension of its 3’ end. Achieving suitable reaction conditions may include selecting reaction mixture components, concentrations thereof, and a reaction temperature to create an environment in which the polymerase is active and the relevant nucleic acids in the reaction interact (e.g., hybridize) with one another in the desired manner. For example, in addition to the template RNA, the polymerase, the first strand cDNA synthesis primer and dNTPS, the reaction mixture may include buffer components that establish an appropriate pH, salt concentrations (e.g., KC1 concentration), metal cofactor concentration (e.g., Mg²⁺ or Mn²⁺ concentration), and the like, for the extension reaction to occur. Other components may be included, such as one or more nuclease inhibitors (e.g., an RNase inhibitor and/or a DNase inhibitor), one or more additives for facilitating amplification/replication of GC rich sequences, (e.g., betaine, DMSO, ethylene glycol, 1,2-propanediol, or combinations thereof), one or more molecular crowding agents (e.g., polyethylene glycol, Ficoll, dextran or the like), one or more enzyme-stabilizing components (e.g., DTT, or TCEP, present at a final concentration ranging from 0.1 to 10 mM (e.g., 1 mM)), and/or any other reaction mixture components useful for facilitating polymerase-mediated extension reaction. The reaction mixture can have a pH suitable for the primer extension reaction, which in certain embodiments can range from 5 to 9, such as from 7 to 9, including from 8 to 9, e.g., 8 to 8.5. In some instances, the reaction mixture includes a pH adjusting agent. pH adjusting agents of interest include, but are not limited to sodium hydroxide, hydrochloric acid, phosphoric acid buffer solution, citric acid buffer solution and the like. For example, the pH of the reaction mixture can be adjusted to the desired range by adding an appropriate amount of the pH adjusting agent. The temperature range suitable for production of the double stranded product nucleic acid may vary according to factors such as the particular polymerase employed, the melting temperatures of any optional primers employed, etc. According to one embodiment, the polymerase is a reverse transcriptase (e.g., an MMLV mutant such as SuperScript® IV reverse transcriptase from ThermoFisher®) and the reaction mixture conditions sufficient to produce the double stranded product nucleic acid include bringing the reaction mixture to a temperature ranging from 4C to 72C, such as from 16C to 70C, e.g., 37C to 50C, including 50C.

By “conditions sufficient to ribotail and ligate the 3’ end of cDNA” is meant reaction conditions that permit the terminal transferase-mediated extension of 3’ end of cDNA with nontemplate NTPs (e.g., ATP), followed by ligation of the second adapter oligonucleotide to the 3’ end of cDNA. Achieving suitable reaction conditions may include selecting reaction mixture components, concentrations thereof, and a reaction temperature to create an environment in which the terminal transferase and RNA ligase are active in the desired manner. For example, in addition to the first strand cDNA, the terminal transferase, the RNA ligase, the second adapter oligonucleotide and one source of NTPs (e.g., ATP), the reaction mixture may include buffer components that establish an appropriate pH, salt concentrations (e.g., KC1 concentration), metal cofactor concentration (e.g., Mg²⁺ or Mn²⁺ concentration), and the like, for the extension and ligation reaction to occur. Other components may be included, such as nuclease inhibitors (e.g., a DNase inhibitor), one or more additives that inhibit secondary structures, (e.g., betaine, DMSO, ethylene glycol, 1,2-propanediol, or combinations thereof), one or more molecular crowding agents (e.g., polyethylene glycol, Ficoll, dextran or the like), and/or any other reaction mixture components useful for facilitating tailing and ligation. The reaction mixture can have a pH suitable for the ligation reaction, which in certain embodiments can range from 5 to 9, such as from 7 to 9, including from 8 to 9, e.g., 7 to 8. In some instances, the reaction mixture includes a pH adjusting agent. pH adjusting agents of interest include, but are not limited to sodium hydroxide, hydrochloric acid, phosphoric acid buffer solution, citric acid buffer solution and the like. For example, the pH of the reaction mixture can be adjusted to the desired range by adding an appropriate amount of the pH adjusting agent. The temperature range suitable for tailing and ligation may vary and include a temperature ranging from 4C to 37C. According to one embodiment, the terminal transferase is a terminal deoxynucleotidyl transferase (e.g., TdT from Takara®), which catalyzes the template-independent incorporation of ribonucleotides into the 3 ’-OH termini of single strand cDNA and is added to the reaction mixture to a final concentration from 0.1 to 10 units/ul (U/ul). The RNA ligase (e.g., T4 RNA Ligase 1) catalyzes the ligation of a 5’ phosphoryl-terminated nucleic acid donor (e.g., the second adapter oligonucleotide) to the 3 ’-OH termini of single strand cDNA through the formation of a 3 ’-5’ phosphodiester bond with hydrolysis of ATP to AMP and PPi, and is added to the reaction mixture to a final concentration from 1 to 50 units/ul (U/ul, e.g., 2.25 U/ul). The template ribonucleic acid (RNA) or RNA precursor within the RNA sample or the test sample may be a polymer of any length composed of ribonucleotides. The template RNA or precursor RNA may be any type of RNA (or sub-type thereof), including but not limited to, a messenger RNA (mRNA), a microRNA (miRNA), a small interfering RNA (siRNA), a transacting small interfering RNA (ta-siRNA), a natural small interfering RNA (nat-siRNA), a ribosomal RNA (rRNA), a transfer RNA (tRNA), a small nucleolar RNA (snoRNA), a small nuclear RNA (snRNA), a long non-coding RNA (IncRNA), a non-coding RNA (ncRNA), a transfer-messenger RNA (tmRNA), a precursor messenger RNA (pre-mRNA), a small Cajal body-specific RNA (scaRNA), a piwi-interacting RNA (piRNA), an endoribonuclease- prepared siRNA (esiRNA), a small temporal RNA (stRNA), a signal recognition RNA, a telomere RNA, a ribozyme, or any combination of RNA types thereof or subtypes thereof.

The template RNA or RNA precursor may be subject to a variety of chemical modifications that can alter its structure, function, stability, or interactions with other molecules. Such modifications include, but are not limited to, methylation, acetylation, phosphorylation, oxidation, deamination, ribose methylation, uridine isomerization, pseudouridylation, and many others. These modifications can occur at various positions of the RNA molecule, including the bases, sugars, or phosphate backbone, and can be catalyzed by various enzymes or chemical reagents.

The RNA sample or test sample that includes the template RNA or RNA precursor may be combined into the reaction mixture in an amount sufficient for producing the product nucleic acid. In some embodiments, the RNA sample or test sample that includes the template RNA or RNA precursor is isolated from 1 or more, 10 or more, 20 or more, 50 or more, 100 or more, 500 or more cells, such as 750 or more, 1000 or more, 2000 or more cells, including 5000 or more cells.

The template RNA or RNA precursor may be present in any nucleic acid sample of interest, including but not limited to, a nucleic acid sample isolated from a single cell, a plurality of cells (e.g., cultured cells), a tissue, an organ, a body fluid, and/or an organism (e.g., bacteria, yeast, or higher eukaryotic organisms, such as aa plant, or a mouse, or a worm or the like). In certain aspects, the nucleic acid sample is isolated from a cell(s), tissue, organ, and/or the like of a mammal (e.g., a human, a rodent (e.g., a mouse), or any other mammal of interest). In other aspects, the sample may be isolated from a bodily compartment suitable for use in diagnosis, such as blood, urine, saliva, platelets, microvesicles, exosomes, serum, or other bodily fluids. In other aspects, the nucleic acid sample is isolated form a source other than a mammal, such as bacteria, yeast, insects (e.g., drosophila), amphibians (e.g., frogs (e.g., Xenopus)), viruses, plants, or any other non-mammalian nucleic acid sample source.

In some embodiments, the test sample is a biological sample, such as a tissue and/or body fluid sample or a combination thereof. Biological samples in accordance with embodiments of the invention can be collected in any clinically acceptable manner. In some embodiments, a biological sample can comprise a tissue, a body fluid, or a combination thereof. In some embodiments, a biological sample is collected from a healthy subject. In some embodiments, a biological sample is collected from a subject who is known to have a particular disease or disorder (e.g., a particular cancer or tumor). In some embodiments, a biological sample is collected from a subject who is suspected of having a particular disease or disorder.

As used herein, the term "tissue" refers to a mass of connected cells and/or extracellular matrix material(s). Non-limiting examples of tissues that are commonly used in conjunction with the present methods include skin, hair, fingernails, endometrial tissue, nasal passage tissue, central nervous system (CNS) tissue, neural tissue, eye tissue, liver tissue, kidney tissue, placental tissue, mammary gland tissue, gastrointestinal tissue, musculoskeletal tissue, genitourinary tissue, bone marrow, and the like, derived from, for example, a human or nonhuman mammal. Tissue samples in accordance with embodiments of the invention can be prepared and provided in the form of any tissue sample types known in the art, such as, for example and without limitation, formalin-fixed paraffin-embedded (FFPE), fresh, and fresh frozen (FF) tissue samples.

As used herein, term "body fluid" refers to a liquid material derived from a subject, e.g., a human or non-human mammal. Non-limiting examples of body fluids that are commonly used in conjunction with the present methods include mucous, blood, plasma, serum, serum derivatives, synovial fluid, lymphatic fluid, bile, phlegm, saliva, sweat, tears, sputum, amniotic fluid, menstrual fluid, vaginal fluid, semen, urine, cerebrospinal fluid (CSF), such as lumbar or ventricular CSF, gastric fluid, a liquid sample comprising one or more material(s) derived from a nasal, throat, or buccal swab, a liquid sample comprising one or more materials derived from a lavage procedure, such as a peritoneal, gastric, thoracic, or ductal lavage procedure, and the like. In some embodiments, a biological sample can comprise a fine needle aspirate or biopsied tissue. In some embodiments, a biological sample can comprise media containing cells or biological material. In some embodiments, a biological sample can comprise a blood clot, for example, a blood clot that has been obtained from whole blood after the serum has been removed. In some embodiments, a biological sample can comprise stool. In one preferred embodiment, a biological sample is drawn whole blood. In one aspect, only a portion of a whole blood sample is used, such as plasma, red blood cells, white blood cells, and platelets. In some embodiments, a biological sample is separated into two or more component parts in conjunction with the present methods. For example, in some embodiments, a whole blood sample is separated into plasma, red blood cell, white blood cell, and platelet components.

In some embodiments, a sample includes a plurality of nucleic acids not only from the subject from which the sample was taken, but also from one or more other organisms, such as viral or bacterial DNA/RNA that is present within the subject at the time of sampling.

Approaches, reagents and kits for isolating RNA form such sources are known in the art. For example, kits for isolating RNA from a source of interest are commercially available. In certain aspects, the RNA is isolated from a fixed biological sample, e.g., formalin-fixed, paraffin-embedded (FFPE) tissue. RNA from FFPE tissue may be isolated using commercially available kits.

In some embodiments as depicted in FIG. 3a, the subject methods include producing the template RNA from a precursor RNA. For example, when it is desirable to control the size of the template RNA that is combined into the reaction mixture, an RNA sample containing RNA precursors from a source of interest may be subjected to shearing/fragmentation, e.g., to generate a sample that includes template RNAs that are shorter in length as compared to precursor no-sheared RNAs (e.g., full-length mRNAs) in the original sample. In some embodiments, the RNA may be used directly from the lysed cell by placing the cell in a suitable buffer (e.g., a hypotonic solution), optionally in the presence of detergent (e.g., Tween-20, Triton X100, NP40, and/or IgepalCA-630), so as to lyse the cell. RT reaction components may then be added directly to the lysate without further isolation to generate cDNA from the cellular RNA. The template RNA may be generated by a shearing/fragmentation strategy including, but not limited to, passing the sample one or more times through a micropipette tip or fine-gauge needle, nebulizing the sample, sonicating the sample (e.g., using Bioruptor, Branson, or Covaris sonicator), bead-mediated shearing, enzymatic shearing (e.g., using one or more RNA-shearing enzymes, or by enzymatic digestions, e.g., with restriction enzymes or other endonucleases appropriate for the polynucleotides of interest, including, but not limited to, RNase A, RNase I, RNase Tl, and MNase), chemical based fragmentation, e.g., using divalent cations, fragmentation buffer (which may be used in combination with heat) or any other suitable approach for shearing/fragmenting a precursor RNA to generate a shorter template RNA. In certain aspects, the template RNA generated by shearing/fragmentation of a starting nucleic acid sample has a particular length, as appropriate for the sequencing platform chosen.

Additional strategies for producing a template RNA from a precursor RNA may be employed as depicted in FIG. 4. For example, producing a template RNA may include adding nucleotides to an end of the precursor RNA. In certain aspects, the precursor RNA is a nonpolyadenylated RNA (e.g., a microRNA, small RNA, or the like), and producing the template RNA includes adenylating (e.g., polyadenylating) the precursor RNA. Adenylating the precursor RNA may be performed using any convenient approach. According to certain embodiments, the adenylation is performed enzymatically, e.g., using Poly(A) polymerase or any other enzyme suitable for catalyzing the incorporation of adenine residues at the 3 ’ terminus of the precursor RNA. Reaction mixtures for carrying out the adenylation reaction may include any useful components, including but not limited to, a polymerase, a buffer (e.g., a Tris-HCL buffer), one or more metal cations (e.g., MgCL2, MnCL2, or combinations thereof), a salt (e.g., NaCl), one or more enzyme-stabilizing components (e.g., DTT), ATP, and any other reaction components useful for facilitating the adenylation of a precursor RNA. The adenylation may be carried out at a temperature (e.g., 30C - 50 C, such as 37C) and pH (e.g., pH 7 - pH 8.5, such as pH 7.9) compatible with the polymerase being employed, e.g., polyA polymerase.

In another embodiment, illustrated in FIG. 5, approaches for adding nucleotides to a precursor RNA include ligation-based strategies, where an RNA target can be combined with an RNA ligase (e.g., T4 RNA ligase) and a first adapter oligonucleotide, which contains an amplification primer and a hybridization sequence domain complementary to the cDNA synthesis primer (e.g., polyA) under ligation conditions, e.g., conditions sufficient to produce a product chimeric nucleic acid that includes the template RNA with the first adapter oligonucleotide at its 3’ end. By “conditions sufficient to produce a product chimeric nucleic acid” is meant reaction conditions that permit the ligation of the first adapter oligonucleotide to the 3’ end of the RNA precursor, catalyzed by the RNA ligase. Achieving suitable reaction conditions may include selection reaction mixture components, concentrations thereof, and a reaction temperature to create an environment in which the terminal transferase and RNA ligase are active in the desired manner. For example, in addition to the RNA precursor, first adapter oligonucleotide, and the RNA ligase, the reaction mixture may include buffer components that establish an appropriate pH, salt concentrations (e.g., KC1 concentration), metal cofactor concentration (e.g., Mg²⁺ or Mn²⁺ concentration), and the like, for the ligation reaction to occur. Other components may be included, such as nuclease inhibitors (e.g., a RNase inhibitor), one or more additives that inhibit secondary structures, (e.g., betaine, DMSO, ethylene glycol, 1,2- propanediol, or combinations thereof), one or more molecular crowding agents (e.g., polyethylene glycol, Ficoll, dextran or the like), and/or any other reaction mixture components useful for facilitating tailing and ligation. The reaction mixture can have a pH suitable for the ligation reaction, which in certain embodiments can range from 5 to 9, such as from 7 to 9, including from 8 to 9, e.g., 7 to 8. In some instances, the reaction mixture includes a pH adjusting agent. pH adjusting agents of interest include, but are not limited to sodium hydroxide, hydrochloric acid, phosphoric acid buffer solution, citric acid buffer solution and the like. For example, the pH of the reaction mixture can be adjusted to the desired range by adding an appropriate amount of the pH adjusting agent. The temperature range suitable for ligation may vary and include a temperature ranging from 4 °C to 37 °C. According to one embodiment, the RNA ligase is T4 RNA Ligase I, which catalyzes the ligation of a 5’ pre-adenylated nucleic acid donor (e.g., the first adapter oligonucleotide) to the 3 ’-OH termini of the RNA precursor through the formation of a 3 ’-5’ phosphodiester bond, and is added to the reaction mixture to a final concentration from 1 to 50 units/ul (U/ul, e.g., 2.25 U/ul).

In such embodiment, the test sample is obtained by a method for purifying ribosome nascent-chain complexes of a biological sample of interest to obtain ribosome-coated mRNA fragments that serve as RNA precursors for the methods described herein.

In another embodiment, the test sample is obtained by a method for purifying an RNA molecule from a biological sample, where the RNA molecule carries a particular modification of interest, comprising:

(I) cleaving the RNA molecule by contacting the biological sample with an agent capable of cleaving the phosphodiester bond, thereby generating a fragment of the RNA molecule, wherein the majority of fragments is around 100 nucleotides in length; (II) contacting the RNA fragment in said biological sample with a molecule that specifically interacts with a particular modification of interest, wherein said molecule can be a protein such as an antibody;

(III) contacting the biological sample with an agent that creates a covalent bond between the RNA molecule and the molecule that specifically interacts with the modification of interest, thereby generating a covalently bound complex containing the RNA with the modification of interest;

(IV) purifying the complex obtained in step (III) to provide RNA fragments containing the modification of interest, wherein said RNA fragments are used as precursor RNA in the method for preparing a sequence library disclosed herein.

In an embodiment, the agent capable of cleaving a phosphodiester bond in step (I) is a chemical agent, such as divalent cations (e.g., zinc, magnesium), which can catalyze the cleavage of the phosphodiester bond under specific conditions. The use of divalent cations is advantageous in that they can be easily removed from the reaction mixture, minimizing potential interference with downstream analysis. In another embodiment, enzymatic fragmentation is used, where a ribonuclease or other suitable enzyme is added to the biological sample to cleave the RNA molecule at the site of interest. This approach allows for site-specific cleavage of the RNA molecule and can be optimized to achieve high specificity and efficiency. In another embodiment, heat fragmentation, which involves heating the RNA molecule to high temperatures, can also be used to break the phosphodiester bond and generate RNA fragments for downstream analysis.

In another embodiment, the test sample is obtained by a method for purifying an RNA molecule interacting with an RNA binding protein (RBP) of interest in a biological sample, comprising:

(1) contacting the biological sample with an agent that creates a covalent bond between the RNA molecule and the RBP of interest, thereby generating a covalently bound RBP-RNA complex containing the RNA molecule;

(2) cleaving the RNA molecule by contacting the RBP-RNA complex with an agent capable of cleaving a bond thereof, thereby generating a fragment of the RNA molecule, wherein the fragment is at least 22 nucleotide bases in length; (3) selecting the RBP-RNA fragment complex in said biological sample with a molecule that specifically interacts with a component of the RBP-RNA fragment complex; and

(4) purifying the RBP-RNA fragment complex obtained in step (3) to provide RNA fragments interacting with the RBP of interest, wherein said RNA fragments are used as precursor RNA in the method for preparing a sequence library disclosed herein.

In an embodiment, the agent capable of cleaving a bond in step (2) is a nuclease, including but not limited to, RNase A, RNase I, RNase T1 and/or MNase.

In an embodiment, purifying the RNA complex of step (IV) and (4) of the abovedisclosed methods for obtaining test samples comprises a chromatographic method.

In another embodiment, purifying the RNA-protein complex of step (IV) and (4) of the above-disclosed methods for obtaining test samples is performed under stringent conditions comprising:

(i) washing the complexes with buffer at least 5 times;

(ii) boiling the complexes in a denaturing ionic detergent;

(iii) separating the complexes by SDS-PAGE;

(iv) transferring said complexes to a substrate that preferentially binds RNA covalently crosslinked to protein over RNA not covalently crosslinked to protein; and

(v) digesting said protein with a protease to liberate said fragments of RNA from said RNA-protein complexes

In another embodiment, purifying the RNA complex of step (IV) and (4) of the abovedisclosed methods for obtaining test samples is performed using hybridization or affinity capture of nucleotides.

In one embodiment, the covalent bond of step (1) and (III) of the above-disclosed methods for obtaining test samples is formed with irradiation. The source of irradiation may emit, in one embodiment, radiation of a discrete wavelength. In another embodiment, the source may emit radiation dispersed throughout a region of the electromagnetic radiation spectrum. In another embodiment, the source may emit a mixture of radiation, some of which is of a discrete wavelength, and some of which is dispersed throughout a region of the electromagnetic radiation spectrum.

In one embodiment, the irradiation may result from a polychromatic irradiation source. Polychromatic refers, in one embodiment, to a source that emits radiation of various wavelengths. Such wavelengths may be anywhere in the electromagnetic radiation spectrum. The radiation emission spectra of various types of irradiation sources are known in the art.

In another embodiment, the irradiation may result from a monochromatic irritation source. Monochromatic refers, in one embodiment, to a source that emits radiation of a single wavelength. In another embodiment, monochromatic refers to a source that emits radiation primarily of a single wavelength.

In another embodiment, the irradiation may result from a mercury light. Mercury lamps emit radiation of 254 nm, and may also have polychromatic background emissions at other discrete wavelengths, e.g., 313 nm, 365 nm, 405 nm, 436 nm, 546 nm, 579 nm, 1015 nm and 1140 nm. This is a fairly unique characteristic of these types of lamps (see U.S. Pat. No. 6,611,375).

In another embodiment, the irradiation may result from a two-photon excitation apparatus (So P T et al, Cell Mol Bio (Noisy le grand) 44:771). In this technique, small structures are formed by multiple photon-induced polymerization or cross-linking of a precursor composition. “Multiple photon” as used herein means, in one embodiment, the simultaneous absorption of multiple photons by a reactive molecule. This method is described in detail in U.S. Pat. No. 6,316,153 and references therein.

In one embodiment, the irradiation used to form the covalent bond of step (1) and (III) of the above-disclosed methods for obtaining test samples is ultraviolet irradiation. Ultraviolet radiation, in one embodiment, is a form of energy that occupies a portion of the electromagnetic radiation spectrum (the electromagnetic radiation spectrum ranges from cosmic rays to radio waves). Ultraviolet radiation can come from many natural and artificial sources. Depending on the source of ultraviolet radiation, it may be accompanied by other (non-ultraviolet) types of electromagnetic radiation (e.g., visible light). Particular types of ultraviolet radiation are herein described in terms of wavelength. Wavelength is herein described in terms of nanometers (“nm”). In one embodiment, ultraviolet radiation extends from approximately 180 nm to 400 nm. In another embodiment, the ultraviolet radiation has a wavelength of about 254 nm. In another embodiment, the ultraviolet radiation has a different wavelength. When a radiation source, by virtue of filters or other means, does not allow radiation below a particular wavelength (e.g., 320 nm), it is said to have a low end “cutoff’ at that wavelength (e.g., “a wavelength cutoff at 300 nanometers”). Similarly, when a radiation source allows only radiation below a particular wavelength (e.g., 360 nm), it is the to have a high end “cutoff’ at that wavelength (e.g., “a wavelength cutoff at 360 nanometers”). In another embodiment, the source of ultraviolet radiation is a fluorescent source. All of these sources represent separate embodiments of the present invention. In one embodiment, the device of the present invention comprises an additional filtering means. In one embodiment, the filtering means comprises a liquid filter solution that transmits only a specific region of the electromagnetic spectrum. The use of sources of irradiation is well known to those skilled in the art (see, for example Diffey, B L, Methods 28:4-13; and Chen J et al, Cancer J. 8: 154-63). Each type of radiation represents a separate embodiment of the present invention.

In one embodiment, a chemical group such as, for example, puromycin is added to RNA to facilitate formation of the covalent bond of step (1). This method is described in Rodriguez- Fonseca C et al (RNA 6:744-54).

In some embodiments, a photoreactive nucleoside (e.g., 4-thiouridine and 6- thioguanosine) can be added to the biological sample of interest to increase crosslinking efficiency at a wavelength which is significantly absorbed by the photoreactive nucleoside such that covalent cross-links are formed between the modified RNA transcript and a protein and the RNA is not damaged.

In one embodiment, the covalent bond of step (1) or (III) of the above-disclosed methods for obtaining test samples is formed with a chemical. In one embodiment, the chemical is formaldehyde. In another embodiment, the chemical is a derivative of formaldehyde. In another embodiment, the chemical is paraformaldehyde. In another embodiment, the chemical is glutaraldehyde. In another embodiment, the chemical is osmium tetroxide. In another embodiment, the chemical is acetone. In another embodiment, the chemical is an alcohol. In another embodiment, the chemical is an NHS ester. In another embodiment, the chemical is a Maleimides. In another embodiment, the chemical is a haloacetyl. In another embodiment, the chemical is a pyridyl disulfide. In another embodiment, the chemical is a sulfhydryl modifier such as SATA, SPDP or Traut's Reagent. In another embodiment, the chemical is hydrazide. In another embodiment, the chemical is l-Ethyl-3-(3-Dimethylaminopropyl)-Carbodiimide Hydrochloride. In another embodiment, the chemical is an aryl azide or a derivative thereof. In another embodiment, the chemical is any other cross-linking compound known in the art. The cross-linking compound may, in one embodiment, be applied over a broad range of concentrations. Each type of chemical represents a separate embodiment of the present invention.

The methods of the present disclosure include combining a polymerase with the plurality of template RNA to generate solid-phase first strand cDNA. A variety of polymerases may be employed when practicing the subject methods. In certain aspects, the polymerase combined into the reaction mixture is a reverse transcriptase (RT). Reverse transcriptases suitable for the invention do not need to have template-switch capability and can include, but are not limited to, retroviral reverse transcriptase, retrotransposon reverse transcriptase, retroplasmid reverse transcriptase, retron reverse transcriptases, bacterial reverse transcriptases, group II intron-derived reverse transcriptase, and mutants, variants, derivatives, or functional fragments thereof, e.g., RNase H minus or Rnase H reduced enzymes (e.g. Maxima H Minus RT (ThermoFisher) or Superscript RT (ThermoFisher)) (Figure 9). For example, the reverse transcriptase may be a Moloney Murine Leukemia Virus reverse transcriptase (MMLV RT). In certain aspects, a mix of two or more different polymerases is added to the reaction mixture, e.g., for improved processivity, proof-reading, and/or the like. In some instances, the polymer is one that is heterologous relative to the template, or source thereof. The polymerase is combined into the reaction mixture such that the final concentration of the polymerase is sufficient to produce a desired amount of the product nucleic acid. In certain aspects, the polymerase (e.g., Superscript IV) is present in the reaction mixture at a final concentration from 0.1 to 20 units/ul (U/ul), e.g., 2 U/ul.

As summarized above, the first strand reaction mixture further includes a first strand cDNA synthesis primer. In a preferred embodiment of the method, the first strand cDNA synthesis primer is covalently linked to a magnetic bead, and includes one, two or more domains. For example, the primer may include a first (e.g., 3’) domain that hybridizes to the template RNA and a second (e.g., 5’) domain that does not hybridize to the template RNA. The sequence of the first and second domain may be independently defined or arbitrary. In certain aspects, the first domain has a defined sequence (e.g., an oligo dT sequence or an RNA specific sequence) or an arbitrary sequence (e.g., a random sequence, such as a random hexamer sequence) and the sequence of the second domain is defined, e.g., a pre-tagmentation amplification primer binding domain or an amplification primer binding domain and may have any convenient sequence such as a sequencing primer binding domain.

In addition to the first and second domains described above, in which the second domain contains a pre-tagmentation amplification primer binding domain, the first strand cDNA synthesis primer may further include a first post-tagmentation amplification, e.g., PCR amplification, primer binding domain, which may have any convenient sequence such as a sequencing primer binding domain.

In certain aspects, the first strand cDNA synthesis primer includes a barcode domain for identification of the sample after pooling post reverse transcription. In certain aspects, the first strand cDNA synthesis primer may include a unique molecular identifier or other barcode to mark each RNA molecule converted to cDNA individually. In some instances, the sequence includes all or a component of a sequencing platform adapter construct. By “sequencing platform adapter construct” is meant a nucleic acid construct that includes at least a portion of a nucleic acid domain (e.g., a sequencing platform adapter nucleic acid sequence) utilized by a sequencing platform of interest, such as a sequencing platform provided by Illumina ® (e.g., the HiSeq, MiSeq, NextSeq or NovaSeq); Pacific Biosciences (e.g., the PACBIO RS II sequencing system); or any other sequencing platform of interest. In certain aspects, a sequencing platform adapter construct includes one or more nucleic acid domains selected from: a domain (e.g., a “capture site) or “capture sequence”) that specifically binds to a surface- attached sequencing platform oligonucleotide (e.g., the P5 or P7 oligonucleotides attached to the surface of a flow cell in an Illumina® sequencing system); a sequencing primer binding domain (e.g. a domain to which the Read 1 or Read 2 primers of the Illumina® platform may bind); a barcode domain (e.g., a domain that uniquely identifies the sample source of the nucleic acid being sequences to enable sample multiplexing by marking every molecule from a given sample with a specific barcode or “tag”); a barcode sequencing primer binding domain (a domain to which a primer used for sequencing a barcode binds); a molecular identification domain (e.g., a molecular index tag, such as a randomized tag of 4, 6, or other number of nucleotides) for uniquely marking molecules of interest to determine expression levels based on the number of instances a unique tag is sequences; or any combination of such domains. In certain aspects, a barcode domain (e.g., sample index tag) and a molecular identification domain (e.g., a molecular index tag) may be included in the same nucleic acid.

A sequencing platform adapter domain, when present, may include one or more nucleic acid domain of any length and sequence suitable for the sequencing platform of interest. The nucleic acid domains may have a length and sequence that enables a polynucleotide (e.g., an oligonucleotide) employed by the sequencing platform of interest to specifically bind to the nucleic acid domain, e.g., for solid phase amplification and/or sequencing by synthesis of the cDNA insert flanked by the nucleic acid domains. Example nucleic acid domains include the P5 (5’-AATGATACGGCGACCACCGA-3’) (SEQ ID NO: 1), P7 (5’- CAAGCAGAAGACGGCATACGAGAT-3’) (SEQ ID NO: 2), Read 1 sequencing primer (5’- ACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’) (SEQ ID NO: 3) and Read 2 sequencing primer (5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT-3’) (SEQ ID NO: 4) domains employed on the Illumina®-based sequencing platforms. For example, the first strand cDNA synthesis primer may include from 3’ to 5’, a first domain that hybridizes to the template RNA, e.g., an oligo dT domain, a barcode domain, a molecular identifier, a sequencing platform adapter domain, such as a sequencing read primer domain, and an amplification primer binding domain. In some aspects, the amplification primer binding domain will resemble a pretagmentation amplification primer binding domain and the first strand cDNA synthesis primer will also include a post-tagmentation amplification primer binding domain, which may be a unique domain or partially or completely overlap with another domain of the primer, such as the sequencing read primer domain, so long as that domain is compatible with respect to the amplification protocol being performed.

In some aspects, the first adapter oligonucleotide may be pre-adenylated at its 5’ end and include from 3’ to 5’, a first domain, e.g., an oligo dT domain, a barcode domain, a molecular identifier, a sequencing platform adapter domain, such as a sequencing read primer domain, an amplification primer binding domain, and a chain terminator at its 3’ end, e.g., a near-infrared fluorescent dye.

The nucleotide sequence of nucleic acid domains useful for sequencing on a sequencing platform of interest may vary and/or change over time. Adapter sequences are typically provided by the manufacturer of the sequencing platform (e.g., in technical documents provided with the sequencing system and/or available on the manufacturer’s website). Based on such information, the sequence of any sequencing platform adapter domains of the first strand cDNA synthesis primer, first or second adapter oligonucleotide, amplification primers, and/or the like, may be designed to include all or a portion of one or more nucleic acid domains in a configuration that enables sequencing the nucleic acid insert (corresponding to the template RNA) on the platform of interest.

The first strand cDNA synthesis primer and first adapter oligonucleotide may include one or more nucleotides (or analogs thereof) that are modified or otherwise non-naturally occurring. For example, the primer may include one or more nucleotide analogs (e.g., LN A, FANA, 2’-O-Me RNA, 2’-fluoro RNA, or the like), linkage modifications (e.g., phosphothioates, 3 ’-3’ and 5 ’-5’ reversed linkages), 5’ and/or 3’ end modifications (e.g., 5’ and/or 3’ amino, biotin, DIG, phosphate, thiol, dyes, quenchers, etc.), one or more fluorescently labelled nucleotides, a near-infrared fluorescent dye (e.g., LiCOR IR800), or any other feature that provides a desired functionality to the primer that primers cDNA synthesis.

It is desirable to prevent any subsequent extension reactions which use the double stranded product nucleic acid as a template from extending beyond a particular position in the region of the double stranded product nucleic acid corresponding to the primer. For example, according to certain embodiments, the first strand cDNA synthesis primer includes a polymerase blocking modification that prevents a polymerase using the region corresponding to the primer as a template from polymerizing a nascent strand beyond the modification. Useful modifications include, but are not limited to, an abasic lesion (e.g., a tetrahydrofuran derivative), a nucleotide adduct, an iso-nucleotide base (e.g., isocytosine, isoguanine, and/or the like), or in a preferred embodiment the covalent linkage to a solid surface (e.g., paramagnetic beads). Blocking modifications may be included in any of the nucleic acid reagents used when practicing the methods of the present disclosure, including first strand cDNA synthesis primer, first and second adapter oligonucleotides, first and second amplification, e.g., PCR, primers used for amplifying the first-strand cDNA to produce the product of double stranded cDNA, amplification primers used for PCR amplification or tagmentation products and any combination thereof.

The use of first strand cDNA synthesis primers covalently linked to magnetic beads in step (d) simplifies all downstream procedures due to the ease of working on magnetic beads which allows easy purification of cDNA and separation from RNA in step (e) using heat denaturation. Oligonucleotides linked to a bead surface are inert to harsh experimental conditions such as high concentrations of proteinase K and denaturing agents, thus enabling the capture and purification of target RNA molecules from a wide range of biological samples, eliminating the need for time-consuming RNA precipitations that are prone to sample loss especially at low concentrations. Capture of complementary nucleic acid domains on oligo(dT) beads is highly efficient, occurs within minutes (Figure 6) and allows stringent washes to remove traces of proteinase K or other inhibiting agents prior to reverse transcription. Solidphase cDNA also enables simple purification of the resultant cDNA molecule, which offers more flexibility in the optimization of reverse transcription conditions, as any adverse components for subsequent enzymatic reactions can be efficiently removed. Solid-phase cDNA can directly serve as acceptor molecule in the second adapter ligation without any additional purification procedures, thus minimizing samples loss.

As set forth above, the subject methods of the present disclosure include combining a terminal transferase with the first strand cDNA molecule, where the terminal transferase (TT) is capable of catalyzing template-independent addition of deoxyribonucleotides or ribonucleotides to the 3’ hydroxyl terminus of the cDNA molecule. Terminal transferases are highly processive in the presence of deoxynucleotide triphosphates (dNTPs), but self-terminate after incorporating only a few nucleotide triphosphates (NTPs). The terminal transferase may be capable of incorporating 1 or more additional deoxyribonucleotides at the 3’ end of the nascent DNA strand, in a time-dependent fashion (FIG. 7). In certain aspects, a terminal transferase incorporates 10 or less, (e.g., 1-3) additional ribonucleotides at the 3’ end of the first strand cDNA. All of the nucleotides may be the same (e.g., creating a homonucleotide stretch at the 3’ end of the first strand cDNA) or at least one of the nucleotides may be different from the other(s). In certain aspects, the terminal transferase activity results in the addition of a homoribonucleotide stretch of 1, 2, 3, 4, or more of the same ribonucleotides (e.g., all ATP, all GTP, all CTP, all UTP). These additional ribonucleotides are useful for increasing the efficiency of the subsequent ligation reaction, by effectively mimicking the 3’ end of an RNA molecule rather than a DNA molecule, which increases the affinity of T4 RNA ligase to join the second adapter molecule to the 3’ end of the first strand cDNA molecule that was extended by a short stretch of non-template ribonucleotides (Bullard and Bowater, 2006; Miura et al., 2019). The terminal transferase is combined into the reaction mixture such that the final concentration of the terminal transferase is sufficient to produce a desired amount of the product nucleic acid. In certain aspects, the terminal transferase (e.g., Terminal Deoxynucleotidyl Transferase) is present in the reaction mixture at a final concentration from 0.1 to 20 units/ul (U/ul), e.g., 0.35 U/ul.

The methods of the present disclosure include combining an RNA ligase (e.g., T4 RNA ligase I) with the first strand cDNA molecule, where the RNA ligase is capable of catalysing the ligation of a 5’ phosphoryl-terminated nucleic acid donor to a 3’ hydroxyl-terminated nucleic acid acceptor through the formation of a 3’ - 5’ phosphodiester bond with the hydrolysis of ATP to AMP and PPi. Substrates for RNA ligases include single-stranded RNA and DNA, with high substrate affinity for RNA, but lower substrate specificity toward DNA. The inventors improved sensitivity and efficiency of the ssDNA ligation reaction joining the ssDNA second adapter oligonucleotide with the first-strand cDNA through the prior addition of a short stretch (e.g., 1-3 nts) of non-template ribonucleotides as described elsewhere (FIG. 8). The RNA ligase is combined into the reaction mixture such that the final concentration of the RNA ligase is sufficient to produce a desired amount of the product nucleic acid. In certain aspects, the RNA ligase (e.g., T4 RNA ligase I) is present in the reaction mixture at a final concentration from 0.1 to 200 units/ul (U/ul), e.g., 2.25 U/ul.

As set forth above, the subject methods include combining a second adapter oligonucleotide into the tailing and ligation reaction mixture. By “second adapter oligonucleotide” is meant an oligonucleotide which can serve as a donor during the ligation reaction. In this regard, the first strand cDNA molecule, after addition of non-template ribonucleotides, may be referred to as an “acceptor molecule” and the second adapter oligonucleotide may be referred to as a “donor molecule”. As used herein, an “oligonucleotide” can refer to a single-stranded multimer of nucleotides from 2 to 500 nts, e.g., 2 to 200 nts. Oligonucleotides may be synthetic or may be made enzymatically, and, in some embodiments are 10 to 50 nts in length.

The reaction mixture includes the second adapter oligonucleotides at a concentration sufficient to generate the desired ligation product. For example, the second adapter oligonucleotide may be added to the reaction mixture at a final concentration from 0.001 to lOOuM, including 1 uM. The second adapter oligonucleotide may include one or more nucleotides (or analogs thereof) that are modified or otherwise non-naturally occurring. For example, the second adapter oligonucleotide may include one or more nucleotide analogs (e.g., LNA, FANA, 3’-O- Me RNA, 2’ fluoro RNA, or the like), linkage modifications (e.g., phosphorothioates, 3 ’-3’ and 5’ -5’ reversed linkages), 5’ and/or 3’ end modifications (e.g., 5’ and/or 3’ amino, biotin, DIG, phosphate, Thiol, dyes, quenchers, etc.), one or more fluorescently labelled nts, or any other feature that provides a desired functionality to the second adapter oligonucleotide. Any desired nucleotide analogs, linkage modifications and/or end modifications may be included in any of the nucleic acid reagents used when practicing the methods of the present disclosure, including the first strand cDNA synthesis primer, the first and second adapter oligonucleotide, the primers used for amplification, e.g., PCR amplifying the first-strand cDNA to produce the product double stranded cDNA, the post-tagmentation primers used for amplification.

The second adapter oligonucleotide includes a pre-tagmentation primer binding or amplification primer binding domain, which may also be referred to as a second strand synthesis amplification primer binding domain). For example, the second adapter oligonucleotide may include a sequence, where subsequent to ligation, second strand synthesis is performed using a primer that is complementary to that sequence. The second strand synthesis produces a second strand DNA complementary to the first strand cDNA. Alternatively, or additionally the product nucleic acid may be amplified using a primer pair in which one of the primers has a complementary sequence. According to certain embodiments, the second adapter oligonucleotide includes a first post-tagmentation (e.g., PCR) primer binding domain, e.g., for use in amplification of a tagmented product.

According to some embodiments, the second adapter oligonucleotide includes a number of additional components or domains, such as but not limited to: barcode domains, unique molecular identifier domains, a first post-tagmentation amplification primer binding domain (e.g., in those embodiments where such a domain is not present on the first strand cDNA synthesis primer), a sequencing platform adapter construct domain, etc., where these domains may be as described above.

As described above, the subject methods include combining NTPs into the tailing and ligation reaction mixture. In the preferred embodiment, a single NTP, (e.g., ATP) is added to the reaction mixture and serves both as substrate during the tailing reaction catalyzed by terminal transferase and being hydrolyzed during the ligation reaction catalyzed by T4 RNA ligase 1. For example, ATP may be added to the reaction mixture such that the final concentration is from 0.01 to 100 mM, e.g., 1 mM.

Any nucleic acids that find use in practicing the methods of the present disclosure (e.g., the first strand cDNA synthesis primer, the first and second adapter oligonucleotide, a second strand synthesis primer, one or more primers for amplifying the double stranded product nucleic acid, and/or the like) may include any useful nucleotide analogue and/or modification, including any of the nucleotide analogues and/or modifications described herein.

Once the ligated product nucleic acid, e.g., that includes first strand cDNA linked to the second adapter oligonucleotide, is produced, the methods include using the product nucleic acid as a template for second-strand synthesis and/or amplification (e.g., for subsequent sequencing of the amplicons). According to one embodiment, the methods include contacting the product nucleic acid which primers that hybridize to primer binding domain present on both ends of the cDNA, under amplification conditions, such as PCR amplification conditions, sufficient to produce a product double stranded cDNA. Amplification conditions that may be employed include the addition of one or more primers (e.g., as described above) and dNTPs. The conditions may include combining a thermostable polymerase (e.g., a Tad, Pfu, TfL, Tth, Tli, and/or other thermostable polymerase) into the reaction mixture. Amplification, e.g., PCR amplification, results in the production of a product double stranded cDNA.

A method of producing a product double stranded cDNA according to one embodiment of the present disclosure is schematically illustrated in FIG.1. As illustrated in FIG. 1, an RNA sample that includes an mRNA is combined with a first strand cDNA synthesis primer (in this example, an oligo(dT) primer covalently linked to a magnetic bead), a reverse transcriptase (not shown) and dNTPs (not shown) to produce a product first strand cDNA. The resultant cDNA:RNA hybrids are then separated via heat denaturation and solid phase first strand cDNA is retained through immobilisation on a magnet. The first strand cDNA is combined with a second adapter oligonucleotide, an NTP (in this example ATP (not shown)), a terminal transferase (not shown) and an RNA ligase (not shown). Tailing occurs at the 3’ end of the first strand cDNA molecule through the addition of non-templates nucleotides (indicated by (rA)s), which increases the sensitivity of the subsequent ligation reaction which joins the 5’ end of the second adapter oligonucleotides to the 3’ end of the first strand cDNA. In this example, the 5’ end of the mRNA is captured, allowing for downstream amplification and enrichment of full- length cDNA, e.g., by LD PCR (Long Distance PCR). The components are included in a reaction mixture under conditions sufficient to produce a ligated nucleic acid product. Product double stranded cDNA is produced by contacting the ligated single stranded nucleic acid with amplification primers complementary to PCR primer binding domains present in the first strand cDNA synthesis primer and second adapter oligonucleotide.

Following production of the product double stranded cDNA, in one embodiment, product double stranded cDNA is prepared for full-length sequencing on a long-read sequencing platform of interest.

In another embodiment schematically illustrated in FIG. 2, product double stranded cDNA is tagmented with one or more transposomes including a transposase and a transposon nucleic acid, where the transposon nucleic acid includes a transposon end domain for binding to the transposon protein and a second post-tagmentation amplification primer binding domain (e.g., a post-tagmentation PCR amplification primer binding domain), to produce a tagmented sample. In certain aspects, the second post-tagmentation amplification primer binding domain comprises a sequencing read primer domain, e.g., a sequencing read primer domain that is different from any sequencing read primer domain present in the first strand cDNA synthesis primer, or second adapter oligonucleotide. The resultant tagmented sample is then subjected to amplification conditions, e.g., PCR amplification conditions, using post-tagmentation first and second amplification, e.g., PCR, primers. These post-tagmentation first and second amplification primers may vary, and in some instances include sequencing platform adapter domains, e.g., a first primer including a first post-tagmentation amplification primer domain, a first sequencing indexing domain and a first sequencing adapter domain; and a second primer including a second post-tagmentation amplification primer domain, a second sequencing indexing domain and a second sequencing adapter domain, to produce a sequencing library. The sequencing platform adapter construct(s) may include any of the nucleic acid domains described elsewhere herein (e.g., a domain that specifically binds to a surface-attached sequencing platform oligonucleotide, a sequencing primer binding domain, a barcode domain, a barcode sequencing primer binding domain, a molecular identification domain, or any combination thereof). Such embodiments find use, e.g., where nucleic acids of the tagmented sample do not include all of the adapter domains useful or necessary for sequencing in a sequencing platform of interest, and the remaining adapter domains are provided by the primers used for the amplification of the nucleic acids of the tagmented sample.

According to certain embodiments, the methods of preparing sequencing libraries are end-capture methods for quantifying RNA (e.g., mRNA transcripts), e.g., for differential expression analysis as schematically illustrated in FIG. 2 b-c. In certain aspects, the end-capture methods capture the 3’ ends of RNAs, e.g., where end-capture is facilitated by the presence of a first post-tagmentation amplification primer binding site in the first strand cDNA synthesis primer and a second post-tagmentation PCR primer binding site introduced by tagmentation. It will be understood that numerous variations to the above example of end-capture methods are possible. Instead of capturing 3’ ends of RNAs, for example, the method may be used to capture 5’ ends of RNAs, e.g., where end-capture is facilitated by the presence of a first posttagmentation amplification primer binding site in the second adapter oligonucleotide and a 3’ second post-tagmentation PCR primer binding site introduced by tagmentation. Capturing the 5’ ends of RNAs finds use, e.g., for 5’ end mutation or splice variant analysis, etc. 5’ end capture may be carried out, e.g., by including a post-tagmentation primer binding domain (e.g., an RP2 sequence) in the second adapter oligonucleotide, rather than in the first strand cDNA synthesis primer. According to this variation, post-tagmentation amplification may be carried out using a post-tagmentation amplification primer that binds to the first post-tagmentation primer binding domain originally present in the second adapter oligonucleotides, in conjunction with a post-tagmentation amplification primer that binds to post-tagmentation primer binding domain, e.g., a TnRPl or TnRP2 sequence, added during a tagmentation step. Other variations include, e.g., replacing Illumina® specific sequencing domains in the various primers/oligonucleotides with sequencing domains required by sequence systems from e.g., Pacific Biosciences (e.g., the PACBIO RS II sequencing system); or any other sequencing platform of interest.

In some instances, following production of first strand cDNA and prior to tailing and ligation, the method includes pooling the plurality of solid-phase first strand cDNA with one or more additional first strand cDNAs (e.g., obtained from a different starting RNA source, e.g., cell) to produce a pooled cDNA sample. For example, the combining and contacting steps described above may be performed in parallel for different starting RNA sources, which in some cases can be single cells (e.g., circulating tumour cells or any other single cell of interest). The single cells may be obtained from the same individual or different individuals. According to certain embodiments, the different starting RNA sources are RNA samples obtained from different individuals, e.g., different human patients or other human individuals from whom it is desirable to obtain nucleic acid (e.g., RNA or DNA) sequence information. In certain aspects, the first strand cDNAs are tagged during their production with a unique source identifier (e.g., a cell barcode) corresponding to the starting RNA sample from which the plurality of solidphase first strand cDNA were generated. The resultant first strand cDNAs produced in parallel may then be pooled prior to tailing and ligation. Such a pooling step may include combining each first strand cDNA sample (or aliquot thereof) to be pooled into a single container (e.g., a single tube or other container, e.g., well, microfluidic chamber, droplet, nanowell, etc). The pooled solid-phase first strand cDNA sample is then tailed and ligated, e.g., as described above. Upon sequencing the pooled sample, individual sequencing reads can be traced back to particular starting RNA samples using the source, e.g., cell barcode, enabling multiplexed sequencing. Details regarding barcode-based multiplexed sequencing are described, e.g., in Wong eat al. (2013) Curr. Protoc. Mol. Biol. Chapter 7:Unit 7.11.

In some aspects of the invention, the methods include the step of obtaining single cells. Obtaining single cells may be done according to any convenient protocol. A single cell suspension can be obtained using standard methods known in the art including, for example enzymatically using trypsin or papain to digest proteins connecting cells in tissue samples or releasing adherent cells in culture, or mechanically separating cells in a sample. Single cells can be placed in any suitable reaction vessel in which single cells can be treated individually. For example, a 96-well plate, 384-well plate, or a plate with any number of wells. The multi well plate can be part of a chip and/or device. The present disclosure is not limited by the number of wells in the multiwell plate.

Following obtainment of single cells, e.g., as described above, mRNA can be released form the cells by lysing the cells. Lysis can be achieved by, for example, heating or freeze-thaw of the cells, or by the use of detergents or other chemical methods, or by a combination of these. However, any suitable lysis method can be used. A mild lysis procedure can advantageously be used to prevent the release of nuclear chromatin, thereby avoiding genomic contamination of the cDNA library, and to minimize degradation of mRNA. For example, heating the cells at 72C for 2 minutes in the presence of Tween-20 is sufficient to lyse the cells while resulting in no detectable genomic contamination from nuclear chromatin. Alternatively, cells can be heated 65C for 10 minutes in water; or 70C for 90 seconds in PCR buffer II (Applied Biosystems) supplemented with 0.5% NP-40 (Kurimoto et al., Nucleic Acid Res 34(50:e42 (2006); or lysis can be achieved with a protease such as Proteinase K or by the use of chaotropic salts such as guanidine isothiocyanate (U.S. Publication No. 2007/0281313).

Synthesis of solid-phase first strand cDNA from template nucleic acid mRNA in the methods described herein can be performed directly on cell lysates, such that a reaction mix for reverse transcription is added directly to cell lysates. Alternatively, mRNA can be purified after its release from cells. This can help to reduce mitochondrial and ribosomal contamination. mRNA purification can be achieved by any method known in the art, for example, by binding the mRNA to a solid phase. Commonly used purification methods include paramagnetic beads (e.g., Dynabeads). Alternatively, specific contaminants, such as ribosomal RNA can be selectively removed using affinity purification.

Where desired, a given single cell workflow may include a pooling step where a cDNA product composition, e.g., made up of synthesized first strand cDNAs or synthesized double stranded cDNAs, is combined or pooled with the cDNA product compositions obtained from one or more additional cells. The number of different cDNA product compositions produced from different cells that are combined or pooled in such embodiments may vary. Prior to, or after pooling, the product cDNA composition(s) can be amplified, e.g., by polymerase chain reaction (PCR), such as described above.

As indicated above, in protocols that include a pooling step, the pooling step can be performed after first adapter oligonucleotide ligation or after first strand cDNA synthesis. As such, in certain embodiments of the methods described herein, RNA precursors are obtained from different samples of interest and a first ligation reaction mixture is added to the RNA precursors, resulting in ligated product RNA templates that include a sample barcode as described above. The barcoded RNA templates are then pooled and purified as desired and subjected to reverse transcription followed by tailing and ligation and finally amplification to produce sequencing libraries. In another embodiment, RNA templates are obtained from different cells or samples of interest and reverse transcription reaction mix is added, resulting in first strand cDNA including a cell or sample barcode. The tagged cDNA samples are then pooled and amplified to produce sequencing libraries. The sequencing libraries produced according to the methods of the present disclosure may exhibit a desired complexity (e.g., high complexity). The “complexity” of a sequencing library relates to the proportion of redundant sequencing reads (e.g., sharing identical start sites) obtained upon sequencing the library. Complexity is inversely related to the proportion of redundant sequencing reads. In a low complexity library, certain target sequences are over-represented, while other targets suffer from little or no coverage. In a high complexity library, the sequencing reads more closely track the known distribution of target nucleic acids in the starting nucleic acid sample, and will include coverage, e.g., for targets known to be present at relatively low levels in the starting sample. The complexity of a library may be determined by mapping the sequencing reads to a reference genome or transcriptome. In combination with the incorporation of unique molecular identifiers, the number of PCR duplicates can be determined according to sequencing reads that have the same genomic starting position and identical unique molecular identifiers. High complexity libraries retain a larger number of sequencing reads after the removal of such PCR duplicates, which is increased using TLC over other methodologies (FIG. 9).

In certain aspects, the methods of the present disclosure further include subjecting the sequencing library to a sequencing protocol. The protocol may be carried out on any suitable sequencing platform. Sequencing platforms of interest include, but are not limited to, a sequencing platform provided by Illumina® (e.g., the HiSeq, MiSeq, NextSeq, NovaSeq sequencing systems); Pacific Biosciences (e.g., the PACBIO RS II Sequel sequencing system; or any other sequencing platform of interest. The sequencing protocol will vary depending on the particular sequencing system employed. Detailed protocols for sequencing a library, e.g., which may include further amplification (e.g., solid phase amplification), sequencing the amplicon, and analyzing the sequencing data are available from the manufacturer of the sequencing system employed.

In certain embodiments, the subject methods may be used to generate sequencing libraries corresponding to mRNAs for downstream sequencing on a sequencing platform of interest. According to certain embodiments, the subject methods may be used to generate a sequencing library corresponding to non-polyadenylated RNAs for downstream sequencing on a sequencing platform of interest. For example, microRNAs may be polyadenylated and then used as templates for reverse transcription followed by tailing and ligation of cDNA described elsewhere herein. Random or gene-specific priming may also be used, depending on the goal of the researcher. The library may be mixed 50:50 with a control library (e.g., Illumina’s PhiX control library) and sequenced on the sequencing platform (e.g., an Illumina® sequencing system). The control library sequences may be removed and the remaining sequences mapped to the transcriptome of the source of the mRNAs (e.g., human, mouse, or any other mRNA source).

Aspects of the invention described herein can be performed using any type of computing device, such as a computer, that includes a processor, e.g., a central processing unit, or any combination of computing devices where each device performs at least part of the process or method. In some embodiments, systems and methods described herein may be performed with a handheld device, e.g., a smart tablet, or a smart phone, or a specialty device produced for the system.

Methods of the invention can be performed using software, hardware, firmware, hardwiring, or combinations of any of these. Features implementing functions can also be physically located at various positions, including being distributed such that portions of functions are implemented at different physical locations (e.g., imaging apparatus in one room and host workstation in another, or in separate buildings, for example, with wireless or wired connections).

Processors suitable for the execution of computer programs include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random-access memory, or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non- volatile memory, including, by way of example, semiconductor memory devices, (e.g., EPROM, EEPROM, solid state drive (SSD), and flash memory devices); magnetic disks, (e.g., internal hard disks or removable disks); magneto- optical disks; and optical disks (e.g., CD and DVD disks). The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, the subject matter described herein can be implemented on a computer having an VO device, e.g., a CRT, LCD, LED, or projection device for displaying information to the user and an input or output device such as a keyboard and a pointing device, (e.g., a mouse or a trackball), by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user can be received in any form, including acoustic, speech, or tactile input.

The subject matter described herein can be implemented in a computing system that includes a back-end component (e.g., a data server), a middleware component (e.g., an application server), or a front-end component (e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described herein), or any combination of such back-end, middleware, and frontend components. The components of the system can be interconnected through a network by any form or medium of digital data communication, e.g., a communication network. For example, a reference set of data may be stored at a remote location and a computer can communicate across a network to access the reference data set for comparison purposes. In other embodiments, however, a reference data set can be stored locally within the computer, and the computer accesses the reference data set within the CPU for comparison purposes. Examples of communication networks include, but are not limited to, cell networks (e.g., 3G, 4G or 5G), a local area network (LAN), and a wide area network (WAN), e.g., the Internet.

The subject matter described herein can be implemented as one or more computer program products, such as one or more computer programs tangibly embodied in an information carrier (e.g., in a non-transitory computer-readable medium) for execution by, or to control the operation of, a data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). A computer program (also known as a program, software, software application, app, macro, or code) can be written in any form of programming language, including compiled or interpreted languages (e.g., C, C++, Perl), and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. Systems and methods of the invention can include instructions written in any suitable programming language known in the art, including, without limitation, C, C++, Perl, Java, ActiveX, HTML5, Visual Basic, or JavaScript.

A computer program does not necessarily correspond to a file. A program can be stored in a file or a portion of a file that holds other programs or data, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A file can be a digital file, for example, stored on a hard drive, SSD, CD, or other tangible, non-transitory medium. A file can be sent from one device to another over a network (e.g., as packets being sent from a server to a client, for example, through a Network Interface Card, modem, wireless card, or similar).

Suitable computing devices typically include mass memory, at least one graphical user interface, at least one display device, and typically include communication between devices. The mass memory illustrates a type of computer-readable media, namely computer storage media. Computer storage media may include volatile, non-volatile, removable, and nonremovable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory, or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, Radiofrequency Identification (RFID) tags or chips, or any other medium that can be used to store the desired information, and which can be accessed by a computing device. Functions described herein can be implemented using software, hardware, firmware, hardwiring, or combinations of any of these. Any of the software can be physically located at various positions, including being distributed such that portions of the functions are implemented at different physical locations.

Also provided by the present disclosure are compositions. Compositions of embodiments of the invention may include, e.g., one or more of any of the reaction mixture components described above with respect to the subject methods. For example, the compositions may include one or more of a RNA (e.g., a control RNA), a first adapter oligonucleotide in some instances, a polymerase (e.g., a reverse transcriptase, or the like), a first-strand cDNA synthesis primer having any of the domains described above, a second adapter oligonucleotides having any of the domains described above, dNTPS, NTPs, a terminal transferase, an RNA ligase, a second strand cDNA primer having any of the domains described above, amplification primers having any of the domains described above, a salt, a metal cofactors one or more nuclease inhibitors (e.g., an RNase inhibitor), one or more enzyme- stabilizing components (e.g., DTT), or any other desired reaction mixture component(s). In certain aspects, the subject compositions include a first adapter oligonucleotide in addition to the compositions listed above.

The subject compositions may be present in any suitable environment. According to one embodiment, the composition is present in a reaction tube (e.g., a 0.2 mL tube, a 0.6 mL tube, a 1.5 mL tube or the like), or a well, or microfluidic chamber, or droplet, or other suitable container.

In certain aspects, the composition is present in two or more (e.g., a plurality of) reaction tubes or wells (e.g., a plate, such as a 96-well plate, a multi-well plate, e.g., containing about 1000, 5000, or more wells). The tubes and/or plates may be made of any suitable material, e.g., polypropylene, or the like, PDMS, or aluminum. The containers may also be treated to reduce adsorption of nucleic acids to the walls of the container. In certain aspects, the tubes and/or plates in which the composition is present provide for efficient heat transfer to the composition (e.g., when placed in a heat block, water bath, thermocouples, and/or the like), so that the temperature of the composition may be altered within a short period of time, e.g., as necessary for a particular enzymatic reaction to occur. According to certain embodiments, the composition is present in a thin-walled polypropylene tube, or a plate having thin-walled polypropylene wells or materials such as aluminum having high heat conductance. In some instances, the compositions of the disclosure may be present in droplets. In certain embodiments it may be convenient for the reaction to take place on a solid surface or a bead, in such case, the first strand cDNA synthesis primer may be attached to the solid support or bead by methods known in the art - such as biotin linkage or by covalent linkage - and reaction allowed to proceed on the support. Alternatively, the oligos may be synthesized directly on the solid support - e.g., as described in Macosko, E Z et. Al, Cell 161, 1202-1214, May 21, 2015).

Other suitable environments for the subject compositions include, e.g., a microfluidic chip (e.g., a “lab-on-a-chip device”, e.g., a microfluidic device comprising channels and inlets). The composition may be present in an instrument configured to bring the composition to a desired temperature, e.g., a temperature-controlled water bath, heat block, heat block adaptor, or the like. The instrument configured to bring the composition to a desired temperature may be configured to bring the composition to a series of different desired temperatures, each for a suitable period of time (e.g., the instrument may be a thermocycler). Aspects of the present disclosure also include kits. The kits may include, e.g., one or more of any of the reaction mixture components described above with respect to the subject methods. For example, the kits may include: a first strand cDNA synthesis primer including a 3’ oligo(dT) domain and a 5’ amplification primer binding domain; a second adapter oligonucleotide including an amplification primer binding domain, e.g., as described above. In other instances, the kits may include a first adapter oligonucleotide including a 5’ amplification primer binding domain and a 3’ polyA domain, a first strand cDNA synthesis primer including an oligo dT domain, and a second adapter oligonucleotide including an amplification primer binding domain, as described above.

The kits may further include amplification primers which may include any of the domains/features described above in the section relating to the methods of the present disclosure.

The kits may further include one or more of a template ribonucleic acid (RNA), components for producing a template RNA from a precursor RNA (e.g., a poly(A) polymerase and associated reagents for polyadenylating a non-polyadenylated precursor RNA), components for purifying RNA-protein complexes of interest, a polymerase (e.g., a reverse transcriptase), a terminal transferase, an RNA ligase (e.g., T4 RNA ligase I), dNTPs, NTPs, a salt, a metal cofactors, one or more nuclease inhibitors (e.g., an RNase inhibitor and/or a DNase inhibitor), one or more molecular crowding agents (e.g., polyethylene glycol, or the like), one or more enzyme-stabilizing components (e.g., DTT), or any other desired kit component(s), such as solid supports, e.g., tubes, beads, microfluidic chips, etc.

In certain embodiments, the kits may include reagents for isolating RNA from a source of RNA. The reagents may be suitable for isolating nuclei acid samples from a variety of RNA sources including single cells, cultured cells, tissues, organs, or organisms. The subject kits may include reagents for isolating a nucleic acid sample from a fixed cell, tissue or organ, e.g., formalin-fixed, paraffin-embedded (FFPE) tissue. Such kits may include one or more deparaffinization agents, one or more agents suitable to de-cross link nucleic acids, end/or the like.

Components of the kits may be present in separate containers, or multiple components may be present in a single container. In certain embodiments, it may be convenient to provide the components in a lyophilized form, so that they are ready to use and can be stored conveniently at room temperature.

In addition to the above-mentioned components, a subject kit may further include instructions for using the components of the kit, e.g., to practice the subject method. The instructions are generally recorded on a suitable recoding medium. For example, the instructions may be printed on a substrate, such as paper or plastic, etc. As such, the instructions may be present in the kits as a package insert, in the labelling of the container of the kit or components thereof (i.e., associated with the packaging or sub-packaging) etc. In other embodiments, the instructions are present as an electronic storage data file present on a suitable computer readable storage medium, e.g., CD-ROM, diskette, Hard Disk Drive (HDD), portable flash drive, etc. In yet other embodiments, the actual instructions are not present in the kit, but means for obtaining the instructions from the remote source, e.g., via the internet, are provided. An example of this embodiments is a kit that includes a web address where the instructions can be viewed and/or from which the instructions can be downloaded. As with the instructions, this means for obtaining the instructions is recorded on a suitable substrate.

The methods of the present disclosure find use in a variety of applications, including those that require the presence of particular nucleotide sequences at both ends of nucleic acids of interest. Such applications exist in the areas of basic research and diagnostics (e.g., clinical diagnostics) and include, but are not limited to, the generation of sequencing libraries. Such libraries may include adapter sequences that enable sequencing of the library members using any convenient sequencing platform, including: the HiSeq, MiSeq, NextSeq and NovaSeq sequencing systems from Illumina®, the PACBIO RS II Sequel sequencing system from Pacific Biosciences, the MinlON, GridlON or PromethlON from Oxford Nanopore Technologies, or any other convenient sequencing platform. The methods of the present disclosure find use in generating sequencing libraries corresponding to any RNA starting material of interest (e.g., mRNA) and are not limited to polyadenylated RNAs. For example, the subject methods may be used to generate sequencing libraries from non-polyadenylated RNAs, including microRNAs, small RNAs, siRNAs, and/or any other type of non-polyadenylated RNAs of interest such as ribosome-associated mRNAs or RNA fragments associated with an RNA- binding protein of interest that were purified with appropriate methods (e.g., CLIP). The methods also find use in generating strand-specific information, which can be helpful in determining allele-specific expression or in distinguishing overlapping transcripts in the genome.

An aspect of the subject methods is that - utilizing a template RNA - a cDNA species having sequencing platform adapter sequences at one or both of its ends is generated, by employing tailing and ligation of first strand cDNAs (TLC) that improves on traditional approaches for generating chimeric nucleic acid molecules and provides an alternative strategy to generate full-length cDNA, preserving the original 5’ end of the template RNA molecule.

Prior art that documents the generation of sequencing libraries from RNA frequently relies on the use of Template Switch Oligos (TSOs) to introduce sequencing platform adapter sequences to the 3’ end of first-strand cDNA molecules, e.g., WO 2021/208036, WO 2020/136438, and WO 2017/048993. These methods rely on the addition of non-templated nucleotides to the 3’ end of the first-strand cDNA (e.g., typically CCC) during reverse transcription, which allow hybridization of the TSO that contains a short complementary sequence (e.g., typically GGG) followed by a non-complementary sequence (e.g., typically sequencing platform adapter sequences) that is added to the first-strand cDNA molecule through extension by the reverse transcriptase.

The methods of the present disclosure (TLC) also rely on the incorporation of a short stretch of non-template nucleotides to the 3’ end of first-strand cDNA molecules, but differ in a number of important aspects: i) TLC incorporates ribonucleotides instead of deoxyribonucleotides to mimic the 3’ end of an RNA molecule for subsequent ligation reaction; ii) the non-template overhang is used as a ligation acceptor instead of anchoring sites for the TSO and greatly increases ligation efficiency; iii) TLC uncouples the terminal transferase reaction from reverse transcription, giving higher flexibility in RT conditions such as higher reaction temperatures that are beneficial for long and/or structured molecules; iv) TLC is not dependent on the presence of a 5’ cap structure of the template RNA as observed for TSOs (Wulf, M G et al, J Biol Chem, 294, 18220-18231, 2019), making it less restrictive in terms of the RNA molecules that can serve as template RNA. Uncoupling of the tailing reaction from the reverse transcription reaction as described in point (iii) and 5’-cap independence in point (iv) are crucial for the applicability of TLC to generate sequencing libraries from a more varied source of input materials. This includes but is not limited to, uncapped RNA molecules, such as specific small RNA species, viral RNA, or RNA fragments obtained after purification of modified RNA or RNA-protein complexes following cross-linking and immunoprecipitation (e.g., CLIP), in which case reverse transcription frequently terminates prematurely at crosslinking sites, preventing the addition of non-template nucleotides by the reverse transcriptase rendering template switching unfeasible.

Prior art relying on non-enzymatic joining of oligonucleotides rather than ligation-based approaches for the generation of sequencing libraries from RNA such as WO 2019/063803, and WO 2021/130151 rely on the click chemistry concept, a class of highly specific and efficient chemical reactions that occur rapidly under mild conditions, such as the copper-catalyzed reaction of azides with alkynes to give 1,2,3-traizoles (R. Huisgen, 1 ,3-Dipolar Cycloaddition Chemistry (Ed.: A. Padwa), Wiley, New York, 1984). However, the need to incorporate artificial nucleotides during reverse transcription can make the resulting cDNA incompatible with standard library preparation methods and repeated purification procedures between individual steps to avoid chemical inhibition of downstream enzymatic reactions (e.g., Copper- induced inhibition of polymerases) lead to sample loss throughout the workflow and large input requirements of 1 ug of total RNA (ClickTech Library Kit full-length mRNA Seq V2.0).

The methods of the present disclosure instead employ an enzymatic ligation-based approach for the generation of sequencing libraries from RNA that is fully compatible with standard enzymes in conventional library workflows without extensive purification procedures which minimizes sample loss and lowers input requirements to as few as 500-1000 cells (e.g., equivalent to 5-20 ng of total RNA assuming a concentration of 10-20 pg of RNA/cell). Furthermore, TLC is not prone to concatemerization, and performing TLC library preparation without input RNA does not yield fragments larger than the expected size without any insert (e.g., amplicons consisting purely of amplification primers and first and second adapter oligonucleotides) (Figure 11), which can be easily removed through size-selection using standard techniques known in the art, such as, for example and without limitation, AMPure size selective chemistry or purification via gel electrophoresis. This greatly reduces the amount of non-specific background and, by extension, sequencing cost. Accordingly, the methods of the present disclosure are more efficient, versatile, cost-effective, and provide more flexibility than traditional approaches.

Those skilled in the art will appreciate that the invention described herein is susceptible to variations and modifications other than those specifically described. It is to be understood that the invention includes all such variations and modifications without departing from the spirit or essential characteristics thereof. The invention also includes all of the steps, features, compositions and compounds referred to or indicated in this specification, individually or collectively, and any and all combinations or any two or more of said steps or features. The present disclosure is therefore to be considered as in all aspects illustrated and not restrictive, the scope of the invention being indicated by the appended claims, and all changes which come within the meaning and range of equivalency are intended to be embraced therein.

The foregoing description will be more fully understood with reference to the following Examples. Such Examples, are, however, exemplary of methods of practising the present invention and are not intended to limit the application and the scope of the invention.

EXAMPLES

RNA-binding proteins are instrumental for post-transcriptional gene regulation and play an active part in numerous human pathologies, including neurodegenerative diseases, cancer, as well as infection. Despite their crucial role in regulating all aspects of RNA metabolism, transcriptome-wide methods to profile RNA-protein interactions remain technically challenging. Protein-centric approaches to study RNA-protein interactions mainly rely on cross-linking and immunoprecipitation (CLIP) of RNA-binding proteins (RBPs) and generation of sequencing libraries from co-purified RNA. Over the years, several variations of this technique emerged, most prominently iCLIP along with derivations such as enhanced CLIP (eCLIP), infrared CLIP (irCLIP) and more recent improvements including iCLIP and improved iCLIP (iiCLIP). These techniques enable the mapping of RNA binding sites at nucleotide resolution and while individual steps differ between protocols, they follow the same overall strategy: cells are cross-linked with UVC light followed by lysis and partial RNA digestion before or after immunoprecipitation of the RBP of interest. Co-purified RNA is then 3’ adapter ligated prior to SDS polyacrylamide gel electrophoresis (SDS-PAGE) and transfer onto nitrocellulose from where RNA is liberated, purified and reverse transcribed into cDNA prior to second adapter ligation and PCR amplification to generate sequencing-compatible libraries. Major bottlenecks, particularly during library preparation, include extensive purification steps and suboptimal enzymatic reactions such as the second adapter ligation, that lead to sample loss, low complexity libraries and the requirement for large amounts of starting material (~20M cells) and sequencing depth.

The feasibility of TLC was demonstrated by preparing sequencing libraries from RNA fragments that co-precipitated with RNA-binding proteins during crosslinking and immunoprecipitation (CLIP) and show that the increased sensitivity of the library preparation reduces input requirements by a factor of up to 40.000 compared to eCLIP, with high quality libraries obtained from as little as 500 cells. Despite drastically lowered input material, TLC libraries require less PCR amplification compared to eCLIP libraries, increasing library complexity and resulting in a higher number of sequencing reads retained for downstream analysis after the removal of PCR duplicates, which lowers sequencing requirements (Figure 10).

When combined with CLIP, TLC follows the procedure outlined in FIG. 5 (Steps 2 - 9, with RNA precursors resulting from nuclease digestion (Step 2) following the purification of an RNA-binding protein of interest (not pictured). 3’ ends of RNA precursors of interest are then ligated to the first adapter oligonucleotide, containing a primer binding domain (PBS) and a polyA stretch (Step 3). Ligated RNA molecules are then captured on oligo(dT) beads and reverse transcribed into first strand cDNA, with the oligo(dT) serving as first strand cDNA synthesis primer (Step 4). Solid-phase cDNA is then separated from template RNA through heat denaturation and used as acceptor molecule for a subsequent ligation reaction (Steps 5-7). To increase the efficiency of ssDNA ligation, a tailing strategy is used that results in the addition of a few (e.g., 1-3) (Figure 12) non-template ribonucleotides (e.g., ATP) at the 3’ end of solidphase first strand cDNA (Step 6). This increases the affinity of T4 RNA Ligase 1 to join the 3’ end of the first strand cDNA molecule with the 5’ phosphorylated second adapter oligonucleotide, containing a sample barcode, a unique molecular identifier and a primer binding site containing the sequence of the read 1 sequencing primer. Following ligation, solidphase first strand dcDNA can be directly amplified via PCR with the addition of necessary sequencing adapters and simultaneously eluted off the magnetic beads. In some aspects, amplification is performed with amplification primers fully complementary to the primer binding sites present on both ends of the cDNA, resulting in short amplicons that may be desirable for additional size selection of the insert. Following size selection, further PCR amplification can be performed to add additional sequencing platform adapter domains to complete the preparation of sequencing libraries. In this example, the nucleic acids in the library are suitable for sequencing on an Illumina® sequencing system and include the P5 adapter sequence; a Read 1 sequencing primer sequence; a unique molecular identifier surrounding a sample barcode; an insert corresponding to the template RNA of interest; a Read 2 sequencing primer sequence; a reverse index sequence; and a P7 adapter sequence. Such sequencing libraries are compatible with single-end sequencing protocols, with the first 15 nucleotides of the reads corresponding to a 9 nt unique molecular identifier (UMI) for deduplication, that is split around a 6 nt sample barcode for greater multiplexing capacity.

In addition to technical improvements during library preparation that reduce both experimental time and cost, TLC-CLIP libraries show superior performance compared to previous methodologies and retain a much larger fraction of sequencing reads that can be used for downstream analysis. This drastically reduces the associated cost for next-generation sequencing, lowering sequencing depth requirements by orders of magnitude (Figure 10).

A direct comparison between CLIP libraries prepared with TLC and public eCLIP datasets showed up to up to 68% overlap with eCLIP peaks, when restricting the comparison to genes with similar expression levels between 293T and HepG2 cells to account for underlying gene expression differences in the cell types that were profiled (Figure 13). CLIP libraries prepared with TLC also show improved sensitivity for de novo motif discovery and recapitulate previously reported motifs with high precision and stronger motif enrichment at the peak summit compared to eCLIP libraries (Figure 14 and Figure 15).

An additional benefit of the TLC-CLIP protocol compared to other technologies is an increased frequency of crosslink-induced mutations (Figure 16) that occur during reverse transcription and provide exact nucleotide resolution of the observed RNA-protein interaction. Crosslink-induced deletions (CIDs) are highly correlated at the single-nucleotide level (Figure 17), and increase the precision at RBP binding sites and identify the exact nucleotide residues bound by the probed RBP (Figure 18). They can also function as an additional quality filter, by examining the ratio of CIDs at individual nucleotide positions (ndel/ntotal reads), which allows efficient filtering of non-crosslinked, co-purifying fragments to increase specificity (Figure 19). This is of particular interest when applying TLC-CLIP without PAGE purification, which enables a 2-day fully automatable workflow and further lowers the input requirements down to 500 cells. Libraries generated without PAGE purification show lower motif enrichment, which is indicative of a higher level of contaminating background sequences, as expected when removing an additional purification step (Figure 20). Lower motif enrichment is accompanied by lower CID ratios, demonstrating the importance of CIDs as an additional quality filter to discern true binding sites from co-purifying, non-crosslinked fragments in samples with higher background signal. Nevertheless, libraries generated without PAGE purification recapitulate the binding behavior of a given RBP, capturing the position-dependent enrichment in relation to intronic Alu elements for hnRNPc from as little as 500 cells (Figure 21).

Taken together, the streamlined TLC library preparation protocol drastically reduces both time and cost of CLIP experiments, while generating high quality RBP binding profiles from low input material. The larger number of crosslinked induced deletions further improves both the precision and specificity of CLIP libraries generated with TLC, by providing nucleotide resolution of crosslinking sites and distinguishing true binding sites from copurifying, non-crosslinked fragments. Furthermore, by eliminating the need for PAGE purification, input requirements can be further reduced with high quality data obtained from as little as 500 cells, presenting a fully bead-based, single-tube library preparation strategy amenable to automation for high-throughput settings.

These improvements will open new opportunities in the field of post-transcriptional gene regulation to study RNA-protein interactions in larger settings, for example in combination with siRNA or drug screens, as well as its application to samples of limited quantity.

Used in combination with CLIP, TLC design innovations compared to other protocols include:

1. TLC-L3 oligo: an infrared-dye-conjugated oligo during the first adapter ligation (first introduced by Zamegar et al.) that allows visualisation of cross-linked RNA without the need for radioactive isotope labelling. The adapter sequence contains the partial sequence of the Illumina Index Sequencing primer followed by a poly(A) stretch which enables purification of ligated RNA molecules on oligo(dT) beads.

2. RNA purification via poly(A) capture: introduction of the poly(A) tail during adapter ligation allows capture und purification of RNA molecules within minutes using oligo(dT)-coupled magnetic beads instead of overnight precipitation. Furthermore, this strategy makes purification of RNA-protein complexes via SDS-PAGE optional, thus opening the potential for automation of the entire protocol on a liquid handling system. 3. Solid-phase cDNA libraries: Oligo(dT)-bead based capture is not only used for purification of RNA, but also for priming reverse transcription, resulting in cDNA covalently linked to magnetic beads. This allows efficient separation of adapter-ligated RNA from first strand cDNA via heat denaturation and facilitates purification and downstream reactions that can be performed on-bead in the same reaction tube.

4. Ribo-tailing of cDNA for increased ligation efficiency: single-stranded (ss)DNA ligations are inherently inefficient due to the low affinity of RNA ligases for DNA as an acceptor molecule. This causes the permanent loss of molecules that fail to ligate, resulting in low complexity libraries. To improve the ligation efficiency, a Terminal Transferase is included in the reaction which incorporates ATP (essential for ligation reaction) in the form of a short ribo-tail at the 3’ end of the cDNA, greatly increasing the affinity, and thus efficiency, of T4 RNA Ligase for the substrates.

TLC oligonucleotides in combination with CLIP

All adapters and oligos used throughout the protocol were ordered from Integrated DNA Technologies (IDT) and information regarding sequences, scale and purification is provided in Table 1.

The TLC-L3 oligo for the first adapter ligation was synthesized at 250 nmole scale, carrying a 5’ phosphorylation and 3’ IRDye® 800CW (NHS Ester) (v3) modification and was purified using RNase-free HPLC with a total yield of 21.1 nmoles.

Pre-adenylation was performed on 5 nmoles using the 5’ DNA Adenylation Kit (NEB, E2610L) as follows: 50 pl of 100 pM L3 adapter were set up with 25 pl 10X 5’ DNA Adenylation Reaction Buffer, 25 pl 1 mM ATP and 50 pl Mth RNA Ligase (Inmol) in a total volume of 200 pl. Reaction was incubated at 65°C for 2 hours followed by inactivation at 85°C for 10 minutes, during which it turns cloudy. Reaction was then cleaned up using the Nucleotide Removal Kit (Qiagen, Cat #28304) as follows: 200 pl were mixed with 4.8 ml of PNI buffer, distributed over 10 columns and spun down at 6000 rpm for 30 seconds. Columns were washed once in 750 pl PE buffer, spun for 1 min at 6000rpm, followed by an empty spin at full speed before transferring columns to a new collection tube. 50 pl H2O were added per column and incubated at RT for 2 minutes before centrifugation at 6000 rpm for 1 minute. Eluates were combined with an approximate final concentration of 10 pM and 1 pM working stocks were prepared and frozen at -20C. Aliquots can be freeze-thawed at least 20 times without any detectable loss in activity. Cell culture and generation of CLIP lysates

Adherent 293T cells (ATCC® CRL-1573™) were grown to -80% confluency in Dulbecco’s Modified Eagle Medium (Gibco, #41966-029) supplemented with 10% FCS (Sigma-Aldrich, #F9665-500ML, Lot #19A124) and 1% Penicillin-Streptomycin-L-Glutamine (MED30-009-CI). Cells were rinsed in ice-cold PBS and crosslinked on ice with 254 nm UV- C light at 0.3 J/cm2 in a CL-3000 Ultraviolet Crosslinker (UVPA849-95-0615-02). Cells were collected into PBS by scraping, counted and desired cell number was aliquoted and spun down. Cell pellets were resuspended in iCLIP Lysis buffer (50 mM Tris-HCl pH 7.4, 100 mM NaCl, 1% Igepal CA-630, 0.1% SDS, 0.5% sodium deoxycholate) using 50 pl per 50.000 cells. Lysates were incubated on ice for 5 minutes followed by sonication for 5-10 seconds at 0.5 seconds ON and 0.5 seconds OFF at 10% amplitude using a tip sonicator (Branson LPe 40:0.50:4T). Protein concentration was measured using the Pierce™ Rapid Gold BCA Protein Assay Kit (Thermo Scientific, A53225) and lysates were either processed directly or stored at -80°C.

RNase treatment, immunoprecipitation and first adapter ligation

Protein-G beads (100 pl for 20-30 pg of antibody) were washed twice in 1ml iCLIP Lysis buffer and resuspended in 100 pl per condition. Per IP, 1 pg of antibody against hnRNPc (Santa Cruz Biotechnology, sc-32308), RBM9 (Bethyl Laboratories, A300-864A), hnRNP Al (4B10) (Santa Cruz Biotechnology, sc-32301), or hnRNPI (Santa Cruz Biotechnology, sc- 56701) were added and antibody-bead mixture was incubated at room temperature (RT) for 30- 60 minutes on a rotating wheel.

Meanwhile cell lysates were treated with different RNase concentrations using 0.25U, 0.025U and 0.005U of RNase I (Thermo Fisher, EN0602) for high, medium and low conditions. RNase dilution was added to cell lysates together with 2ul Turbo DNase (Thermo Fisher, #AM2238) and lysates were incubated at 37°C for exactly 3 minutes at 1 lOOrpm, followed by 3 minutes on ice. Cell lysates were spun down for 10 minutes at 4°C at full speed and supernatant was transferred to a new tube.

Antibody-bead mixture was washed twice in iCLIP lysis buffer to remove unbound antibody and RNAse-treated cell lysates were added alongside cOmplete EDTA-free Protease Inhibitor Cocktail (Merck, #11836170001) and incubated for 2 hours at 4°C on a rotating wheel. After IP, beads were washed twice in 200 pl High Salt Buffer (50 mM Tris-HCl pH 7.4, 1 M NaCl, 1 mM EDTA, 1% Igepal CA-630, 0.1% SDS, 0.5% sodium-deoxycholate), with the second wash at 4°C for 3 minutes on a rotating wheel, followed by two washed in 200 pl PNK Wash Buffer (20 mM Tris-HCl, pH 7.4, 10 mM MgC12, 0.2% Tween-20).

Dephosphorylation of 3’ ends was performed in 20 pl of PNK reaction for 30 minutes at 37°C (70 mM Tris-HCl, pH 6.5, 10 mM MgC12, 1 mM DTT, 10U SUPERaselN RNase Inhibitor (ThermoFisher, #AM2696), 5U T4 Polynucleotide Kinase (NEB, #M0201L). Beads were washed twice in PNK Wash Buffer and resuspended in 20 pl of ligation mix for overnight incubation at 16°C and 1200 rpm (50 mM Tris-HCl, pH 7.8, 10 mM MgC12, ImM DTT, 10U SUPERaselN RNase Inhibitor, 10U T4 RNA Ligase (NEB, #M0204), 1 pl of 1 pM L3 adapter and 20% PEG400 (Sigma- Aldrich, #91893)).

TLC-CLIP library preparation with PAGE purification

Following the first adapter ligation, beads were washed twice in 200 pl High Salt Buffer, twice in 200 pl PNK Wash buffer and then resuspended in 20 pl IX LDS sample buffer (Thermo Fisher, #NP0008) containing 5% beta-mercaptoethanol (Sigma-Aldrich, #M6250). Samples were denatured for 1 minute at 70°C and RNA-protein complexes were resolved on NuPAGE 4-12% Bis-Tris Gels (Thermo Fisher, #WG1402A) at 180V for 1 hour. Transfer was performed onto nitrocellulose (BioRad, #1620115) in IX NuPAGE transfer buffer (Thermo Fisher, #NP00061) with 10% methanol at 30V for 2 hours at RT.

Nitrocellulose membranes were scanned on Odyssey® CLx Infrared Imager (LLCOR, 9141) with 169 pm resolution to visualise RNA localisation and then placed on filter paper soaked in PBS. Regions of interest were cut out from nitrocellulose membrane corresponding to -20-100 kDa above the molecular weight of the RBP of interest due to the ligation of L3 adapter (-15.9 kDa) and associated RNA (with 70nt of RNA averaging ~20kDa). Nitrocellulose pieces were placed in LoBind Eppendorf tubes and 200 pl Proteinase K buffer (lOOmM Tris- HCl, pH 7.4, 50 mM LiCl, 1 mM EDTA, 0.2% LiDS) containing 200 pg Proteinase K (Thermo Fisher, #AM2546) were added and incubated at 50°C for 45 minutes at 800rpm.

Meanwhile, 10 pl of Oligo(dT)25 Dynabeads™ (Thermo Fisher, #61005) per sample were washed in 1 ml of oligo(dT) Binding Buffer (20 mM Tris-HCl, pH 7.4, 1 M LiCl, 2 mM EDTA) and resuspended in 50 pl of oligo(dT) Binding Buffer per sample. Following Proteinase K treatment, supernatant was transferred to fresh tubes containing 50 pl of oligo(dT) beads and incubated for 10 minutes at RT on a rotating wheel. Following RNA capture, beads were washed twice in 125 pl oligo(dT) Wash Buffer (10 mM Tris-HCl, pH 7.4, 150 mM LiCl, 0.1 mM EDTA) and once in 20 pl IX First-Strand Buffer (50 mM Tris-HCl, pH 8.3, 75 mM KC1, 3 mM MgC12). Beads were resuspended in 10 pl of Reverse Transcription Mix (IX First-Strand Buffer, 0.5 mM dNTPs, 1 mM DTT, 6U SUPERase IN RNase Inhibitor, 20U SuperScript™ IV Reverse Transcriptase (Thermo Fisher, #18090050) and incubated for 15 minutes at 50°C followed by 10 minutes at 80°C heating up to 96°C. Samples were vortexed for 30 seconds at 96°C and then immediately placed on a magnet on ice. Supernatant containing adapter-ligated RNA was removed and efficiency of elution can be confirmed by dot-blotting on nitrocellulose membrane.

Solid-phase cDNA on beads was washed once in 60 pl oligo(dT) Wash Buffer and once in 20 pl IX T4 RNA Ligase Buffer (50 mM Tris-HCl, 10 mM MgC12, ImM DTT, pH 7.5). Beads were resuspended in 5 pl of 5’ Adapter mix (2 pl 10X T4 RNA Ligase Buffer, 2 pl of 10 pM L## oligo (see Table 1), 1 pl 100% DMSO), incubated at 75°C for 2 minutes then immediately placed on ice. 4 pl of Ligation Mix (5 mM ATP, 7U Terminal Deoxynucleotidyl Transferase (TdT) (2230B, Takara), 15 U T4 RNA Ligase High Concentration (M0437, NEB)) were added as well as 10 pl 50% PEG8000 and reaction was mixed by pipetting up- and down until beads are resuspended. Reaction was incubated at 37°C for 30 minutes, then cooled down to room temperature. 30 U of T4 RNA Ligase were added, the reaction mixed by pipetting and incubated at RT overnight with occasional vortexing for 15 seconds at 2000 rpm every two minutes.

Following overnight incubation, ligation reaction was removed, and beads washed in 100 pl oligo(dT) Wash Buffer and 20 pl IX Phusion HF Buffer (Thermo Fisher, #F518L). Beads were resuspended in 25 pl cDNA amplification mix (IX Phusion HF PCR Master Mix (NEB, #M0531L) and 0.5uM P5 and P7 short primer mix (see Table 1)) and amplification was performed with the following programme: 30 seconds at 98°C, 7 cycles of 10 seconds at 98°C, 30 seconds at 65°C and 30 seconds at 72°C followed by final extension at 72°C for 3 minutes. Meanwhile, 2pl of oligo(dT) beads per sample were washed once in 1 ml oligo(dT) Binding buffer and resuspended in 5 pl per sample. After cDNA amplification, 5 pl of oligo(dT) beads were added and incubated at RT for 5 minutes on a rotating wheel to capture unwanted amplification by-products. Samples were placed on magnet and supernatant containing amplified cDNA was transferred to a fresh tube.

Size-selection of cDNA was performed using ProNEX® Size-Selective Purification System (Promega, #NG2002) with a ratio of 2.8X to enrich for cDNA inserts of at least 20 nucleotides in length (>80bp). Library yield was then estimated by amplifying 1 pl of purified cDNA via qPCR using the full length P5 and P7 index primers and 2-3 cycles are subtracted from the obtained Ct value for final library amplification. Following PCR amplification, libraries were size-selected again using the ProNEX® Size-Selective Purification System, with a ratio of 1.8X to select fragments larger than 165bp. Quality control was performed using the Agilent High Sensitivity DNA Kit (Agilent, #5067-4626) and libraries were quantified using the KAPA Library Quantification Kit (Roche, #KK4824).

TLC-CLIP library preparation without PAGE purification

When omitting PAGE purification, the first adapter ligation was performed for 75 minutes at 25°C. Beads were washed as described above and either directly resuspended in Proteinase K reaction or in 20 pl of RecJ adapter removal reaction (1 X NEB Buffer 2 (NEB, #B7002S, 25U 5’ Deadenylase (NEB, #M0331S), 30U RecJ endonuclease (NEB, #M0264S), 10U SuperaselN and 20% PEG-400) and incubated at 37C for 30 minutes prior to Proteinase K treatment. Samples were then placed on magnet, and supernatant was transferred to fresh tubes containing oligo(dT) beads, with the remaining library preparation performed as described above.

Sequencing

TLC-CLIP libraries were sequenced on an Illumina NextSeq500 using the High Output Kit v.2.5 for 75 cycles, using Illumina protocol #15048776. 5% PhiX were added to final library pools for increased complexity and sequencing run was performed with custom configuration, running 86 cycles for Read 1 and 6 index cycles.

Mock ligations and denaturing polyacrylamide gel electrophoresis

Efficiency of second adapter ligation was tested in mock ligations using TLC-CLIP L01 as donor molecule and i7-3 as acceptor. 2 pl 10X T4 RNA Ligase Buffer, Ipl lOpM TLC-CLIP L01 oligo, Ipl lOpM i7-3 oligo and Ipl DMSO were mixed and incubated at 75°C for 2 minutes. Reaction was placed on ice and 4 pl of Ligation mix containing 0.2pl 0.1M ATP, 0.5 pl TdT and 0.5 pl T4 RNA Ligase High Concentration were added followed by addition of PEG8000 to the indicated percentage. Ligation was incubated for 30 minutes at 37°C then cooled down to 16°C. Half the reaction was removed after 30 minutes at 16°C, the remaining reaction was incubated overnight. 1 pl of Ligation reaction was mixed with 1 pl Gel Loading Buffer II (Thermo Fisher, #AM8546G) and denatured at 72°C for 3 minutes. Samples were separated on 10% TBE-Urea gels (Thermo Fisher, #EC68752BOC) and stained with IX SYBR® Gold (Thermo Fisher, #S11494) for 10 minutes in TBE buffer. Data analysis

Demultiplexing and Trimming with Flexbar

Sequencing data was demultiplexed by i7 index reads using bcl2fastq without any read trimming. Further demultiplexing by in-read 5’ barcodes and trimming of adapter sequences was performed using Flexbar v.3.5.0 (https://github.com/seqan/fl exbar) 19 in a two-step approach. In the first step, reads are demultiplexed by in-read barcodes allowing no mismatches, and UMIs are moved into the read header. Barcode sequences (see Table 1) including the UMI designated by the wildcard character ’N’ are provided in fasta format, with the arguments “-b barcodes. fasta —barcode-trim-end LT AIL —barcode-error-rate 0 — umi-tags”. In the second step, any adapter contamination at the 3’ end of the reads is removed allowing an error rate of 0.1 with the following arguments adapter-seq

'AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGT CTTCTGCTTG¹ (SEQ ID NO: 5) —adapter-trim-end RIGHT —adapter-error-rate 0.1 — adapter- min-overlap 1”. In addition, potential T-stretches at the 5’ end that are the result of ribotailing during ligation are removed by trimming ‘T’ homopolymers of 1-2 nucleotide length (Figure 12) using htrim-left T — htrim-max-length 2 — htrim-min-length 1” and reads shorter than 18 nucleotides post trimming are discarded by min-read-length 18”.

STAR Alignment

Flexbar-trimmed reads were aligned against hgl9 using STAR v.2.7.3a (https://github.com/alexdobin/STAR)20 with the following parameters to keep only uniquely mapping reads, removing the penalty for opening deletions and insertions and fully extending the 5-prime end of reads to preserve the end of cDNA molecules: outFilterMultimapNmax 1 — scoreDelOpen 0 — scorelnsOpen 0 — alignEndsType Extend5pOfReadl”. To retain UMI in read header during STAR alignment, any space in header needs to be removed prior to mapping.

Deduplication of Reads

Aligned reads were deduplicated based on unique molecular identifiers using UMI-tools v.1.0.1 (https://github.com/CGATOxford/UMI-tools)21. The dedup command was used with the parameters extract-umi-method read id —method unique — spliced-is-unique” to group reads with the same mapping position and identical UMI, while treating reads starting at the same position as unique if one is spliced and the other is not. Peak Calling

Enriched regions were identified using the peak calling algorithm CLIPper v.2.0.0 (https://github.com/YeoLab/clipper)7,16 with default settings and a p-value cutoff of 0.001 poisson-cutoff 0.01”.

Multiqc and usable reads

General quality metrics of libraries were assessed using FastQC vO.11.7 (https://github.com/s-andrews/FastQC) and QC data were collated using multiqc v.1.9 (https://github.com/ewels/MultiQC)22 to extract information from combined log files to plot usable read fractions.

Deletions

Individual nucleotide positions of crosslink-induced deletions within TLC-CLIP reads were extracted using the htseq-clip tool (https://github.com/EMBL-Hentze-group/htseq-clip) 23 with the following parameters: “htseq-clip extract —mate 1 —site d”.

Filtering of peaks

CLIPper peaks were filtered by removing ENCODE blacklisted regions from eCLIP libraries as well as peaks obtained from TLC-CLIP libraries skipping the ligation step as well as IgG controls for either Rabbit or Mouse IgG depending on RBP. An additional score filter was applied by requiring -10 log(pval) to be larger than 50 for any downstream analysis. Consensus peaks between replicates were obtained using bedtools intersect requiring a minimum overlap of 25% between peaks.

Correlation plots

For correlation plots deletion positions of individual replicates were concatenated and coverage was calculated using bedtools multicov24. Count data was normalised using the cpm function from edgeR25 against total library size and log2 transformed. Point density plots were generated using the geom_pointdensity package available on Bioconductor and correlation coefficient was calculated using Pearson correlation.

Pairwise comparison at peak level

Fraction of overlap between filtered peaks for individual replicates of either TLC-CLIP, eCLIP or easyCLIP was calculated using the intervene pairwise intersection module (https://intervene.readthedocs.io/en/latest/index.html) requiring a minimum of 25% overlap between peaks. For comparison between TLC-CLIP and eCLIP in HepG2 cells shown in Fig. 12, peaks were restricted to genes with stable gene expression between the two cell lines, as defined by differential gene expression analysis on total RNA-seq data for 293T and HepG2 cells.

De novo motif discovery

De novo motif discovery was performed using Homer27 v4.10 on peaks centred on either the apex region obtained from CLIPper or after centring peaks on the position with the highest deletion count. fmdMotifsGenome.pl was used with the parameters “-oligo -basic -rna -len5 -S10 -size given” where peak size is a 50-nucleotide window around the apex or with parameter “size 50” for peaks centred on deletions.

Density plot for deletion and motif enrichment

Deletion density in Fig. 14, was calculated using the anotatePeaks.pl function from homer for TLC-CLIP peaks centred onto the consensus motif, with motif files being generated using seq2profile.pl. Tag directories for deletions were generated using the homer makeTagDirectory function on the bed file obtained from htseq-count. peakSizeEstimate needs to be changed to 1 in taglnfo.txt file to avoid extension of deletion tags and preserve nucleotide resolution. Deletion enrichment was obtained using the annotatePeaks.pl with “-hist 1 -size 100” across deletion-centred peaks as well as peaks shuffled across the set of target genes bound by a given RBP. For RBPs recognising palindromic sequences such as ‘AGGGA’ or ‘CUUUC’ for hnRNPAl or hnRNPI respectively, the exact position of the crosslinking site cannot be determined during alignment if the deletion falls within the homopolymer stretch. By default, STAR will position the deletion at the first base of the ambiguous sequence based on the DNA sequence, without awareness of the strand orientation of the gene, resulting in an artificial shift of the deletion position between genes on the forward or reverse strand. To remove this artifact, deletion positions for genes on the reverse strand were shifted by two nucleotides for hnRNPAl and hnRNPI prior to visualisation.

Deletion-centred analysis

Peaks were centred on the maximum deletion position and coverage of this nucleotide position was calculated using bedtools multicov to calculate the CID ratio, indicating the proportion of reads at a given position that carry a deletion. Motif density across peaks with different CID ratios was calculated using annotatePeaks.pl with “-size 100 -hist5 -norevopp”.

For visualisation the percentage of peaks carrying motifs according to CID ratio, fmdMotifsGenome.pl was using with “-find motif.motif -size50 -norevopp”. Peak annotation across different transcriptomic and genomic features was performed using annotatePeaks.pl.

Deletion Visualisation

Intronic antisense Alu sequences were extracted from Repeatmasker and intersected with deletion-centred peaks with a CID ratio larger than 10 from PAGE or noPAGE libraries, yielding splice sites that were either shared experimental conditions or specific to either PAGE or noPAGE libraries.

Deletion positions from htseq-clip were merged across all replicates and converted to bam files using bedtools bedtobam. Bigwig files were then generated using deeptools function bamCoverage with a binsize of 1, normalising for total deletion count (CPM). Heatmaps and coverage profiles were generated using the createMatrix and plotHeatmap function from deeptools.

Table 1

"N" stands for any nucleotide

"n" designates a phosphorothioated DNA base (nucleotide). The phosphorothioate (PS) bond substitutes a sulfur atom for a non-bridging oxygen in the phosphate backbone of an oligo. This modification renders the internucleotide linkage resistant to nuclease degradation.

Adaptation of the above example to the detection of RNA modifications

The TLC library preparation described herein can also been applied to an adaptation of the CLIP protocol described above towards to the profiling of RNA modifications, including but not limited to N6-methyladenosine (m6A). In this example, TLC follows the procedure outlined in FIG. 5 (Steps 2 - 9), with RNA precursors resulting from chemical fragmentation (Step 2) followed by purification of RNA fragments carrying a modification of interest through affinity purification (not pictured).

Claims

1. A method for preparing a sequencing library from a ribonucleic acid (RNA) sample, the method comprising:

2. The method of claim 1, wherein each template RNA of step (a) contains a known sequence to serve as hybridization domain.

3. The method of claim 1 or 2, wherein the precursor RNA is fragmented.

4. The method of any one of claims 1-3, wherein nucleotides are added to the 3’ end of the precursor RNA through polyadenylation or ligation.

5. The method of any one of claims 1-4, wherein each of the first adapters and/or each of the first strand cDNA synthesis primers further comprise a sample barcode and/or unique molecular identifier.

6. The method of any one of claims 1-5, wherein each of the second adapters further comprises a sample barcode, unique molecular identifier and/or a sequencing read primer domain.

7. The method of any one of claims 1-6, wherein each of the first strand cDNA synthesis primers is not covalently linked to magnetic beads.

8. The method of any one of claims 1-7, wherein the method further comprises tagmenting the plurality of the double stranded cDNA of step (h) with transposomes to generate a tagmented sample, wherein the transposomes comprise a transposase and a transposon nucleic acid; and wherein the transposon nucleic acid comprises a transposon end domain and a second post-tagmentation amplification primer binding domain.

9. The method of any one of claims 1-8, wherein the test sample is obtained by a method for purifying an RNA molecule carrying a modification of interest in a biological sample, comprising

(a) cleaving the RNA molecule by contacting the biological sample with an agent capable of cleaving the phosphodiester bond, thereby generating a fragment of the RNA molecule, wherein the majority of fragments is around 100 nucleotides in length;

(b) contacting the RNA fragment in said biological sample with a molecule that specifically interacts with a particular modification of interest, wherein said molecule can be a protein, such as an antibody;

(c) contacting the biological sample with an agent that creates a covalent bond between the RNA molecule and the molecule that specifically interacts with the modification of interest, thereby generating a covalently bound complex containing the RNA with the modification of interest; (d) purifying the complex obtained in step c) to provide RNA fragments containing the modification of interest, wherein said RNA fragments are used as precursor RNA of claim 1.

10. The method of claim 9, wherein the agent capable of cleaving the phosphodiester bond in step a) is a chemical agent, such as divalent cations (e.g., zinc, magnesium).

11. The method of any one of claims 1-8, wherein the test sample is obtained by a method for purifying an RNA molecule interacting with an RNA binding protein (RBP) of interest in a biological sample, comprising

(a) contacting the biological sample with an agent that creates a covalent bond between the RNA molecule and the RBP of interest, thereby generating a covalently bound RBP-RNA complex containing the RNA molecule;

(b) cleaving the RNA molecule by contacting the RBP-RNA complex with an agent capable of cleaving a bond thereof, thereby generating a fragment of the RNA molecule, wherein the fragment is at least 22 nucleotide bases in length;

(c) selecting the RBP-RNA fragment complex in said biological sample with a molecule that specifically interacts with a component of the RBP-RNA fragment complex; and

(d) purifying the RBP-RNA fragment complex obtained in step c) to provide RNA fragments interacting with the RBP of interest, wherein said RNA fragments are used as precursor RNA of claim 1.

12. The method of claim 11, wherein the agent capable of cleaving a bond is a nuclease, such as RNAse I, RNase A, RNase T1 or MNase.

13. The method of any one of claims 9-12, wherein purifying the RNA-protein complex of step (d) is performed under stringent conditions comprising:

(i) washing the complexes with buffer at least 5 times;

(ii) boiling the complexes in a denaturing ionic detergent;

(iii) separating the complexes by SDS-PAGE;

(v) digesting said protein with a protease to liberate said fragments of RNA from said RNA- protein complexes.

14. The method of any one of claims 1-13, wherein the test sample is a biological sample.

15. The method of any one of claims 1-14, wherein the RNA hybridization domain comprises a heteronucleotide stretch.

16. The method of any one of claims 1-15, wherein any of the provided oligonucleotide adapters comprise one or more nucleotide analogs.

17. The method of any one of claims 1-16, wherein the template RNA or the RNA precursor is messenger RNA.

18. The method of any one of claims 1-17, wherein the RNA hybridization domain of each of the first strand cDNA synthesis primers consists of a random er.

19. The method of any one of claims 1-18, wherein the method further comprises pooling the plurality of first adapters ligated to the plurality of RNA precursors and/or the plurality of solid-phase first strand cDNA.

20. The method of any one of claims 1-19, wherein the test sample comprising a plurality of template RNA or precursor RNA is obtained from a single cell.

21. The method of any one of claims 1-20, wherein the method further comprises subjecting the sequencing library to a sequencing protocol.

22. The method of any one of claims 1-21, wherein the method further comprises quantitating one or more RNA species of the test sample.