WO2022101162A1

WO2022101162A1 - Paired end sequential sequencing based on rolling circle amplification

Info

Publication number: WO2022101162A1
Application number: PCT/EP2021/081027
Authority: WO
Inventors: Robert Pinard; Seiyu Hosono; Reto Muller
Original assignee: Miltenyi Biotec B.V. & Co. KG
Priority date: 2020-11-13
Filing date: 2021-11-09
Publication date: 2022-05-19

Abstract

The invention related to a method for obtaining the sequence of both strands of a DNA nucleic acid library wherein the sense and anti-sense DNA single strands are fragmented and provided with sequencing regions, a barcode region and a universal identifier region which are then sequenced and wherein the sequence information of the fragments is merged into the final sequence by matching the sequence information of the barcode region BR and universal identifier region UMI.

Description

PAIRED END SEQUENTIAL SEQUENCING BASED ON ROLLING CIRCLE AMPLIFICATION

BACKGROUND

[0001] The present invention is directed to a process for DNA/RNA sequencing aided by hash-mapping to identify target DNA moieties.

[0002] Current DNA sequencing technology identify genetic information obtained from polynucleotides DNA conjugated to adapters comprising a barcode and or unique molecular identifier and routinely produces hundreds of millions of short reads spanning tens to hundreds of base pairs.

[0003] Paired-end sequencing is defined as a process to sequence both ends of a DNA fragment and to generate more accurate sequencing data. Since paired-end reads are more likely to align to a reference, the quality of the entire data set improves.

[0004] The technique of paired-end sequencing is well known for example by Edwards et al, Genomics, 6 593-608; Roach et al , Genomics 1995; 26 345-353 and allows for the determination of two or more reads from two or more location on a ribonucleic or deoxyribonucleic acid complex.

[0005] One major problem in all sequencing methods is the amount of genetic information to be analysed since DNA or RNA may contain millions of base pairs. Identification and indexing of genetic reads based on contiguous sequences or near-matches is therefore a common challenge within the fields of bioinformatics and next-generation sequencing (NGS).

[0006] The amount of information to be analysed is further increased by attempts to improve the quality of the genetic information collected. Since sequencing errors increase with increasing the length of the DNA or RNA strands to by analysed, the quality of the genetic information can be improved by focusing on rather short-read sequencing methods. The error rate of next generation sequencing (NGS) is often a culprit for some applications where low-level base mutation is critical. Pairing of sequencing reads (paired-end) is a way to improve accuracy and sensitivity of assays. However, this approach further increases the amount of genetic information to be analysed and inter alia processing time.

SUMMARY [0007] It was found that paired-end sequencing methods of short-read sequences can be improved by gaining information from two templates that originated from the same DNA duplex. The additional information that is captured allows to reduce the sequencing errors in short reads as one strand as it can be used as a confirmation of the right base determination and helping the proper alignment of the reads onto a reference sequence.

[0008] Finally, by linking two reads that are separated by a certain known distance apart from each other or partly overlap, the overall average length of the generated reads is increased which permits an easier identification of important mutation types such as insertions, deletions, inversions, genomic rearrangement, repetitive sequence elements, gene fusion and novel transcripts

[0009] It was therefore an object of the invention to provide a method for obtaining the sequence of both strands of a DNA nucleic acid library characterized by the steps a. denaturation the target double stranded DNA nucleic acid library into a mixture of sense and anti-sense DNA single strands b. providing the sense and anti-sense DNA single strands at the 3’ and 5’ ends with sequencing regions A1, A2 and A3, a barcode region BR and a universal identifier region UMI to obtain sense and anti- sense oligonucleotides having the general formula.

(5’) A1-UMI-BR-A3 - sense DNA single strand-A2 (3’)

(3’) A1-UMI-BR-A3 - anti sense DNA single strand-A2 (5’)

Wherein A1, A2 and A3 each comprise 5 to 50 nucleotides;

BR comprise 3 -20 nucleotides;

UMI comprise 9 to 15 nucleotides c. dividing the mixture of the sense and anti-sense DNA oligonucleotides into two fractions d. providing oligonucleotide guides comprising 5 to 50 nucleotides capable of binding to A1 and A2 of the same oligonucleotide to each fraction e. circularizing and the sense and anti-sense DNA oligonucleotides by ligation with a DNA ligase into circular templates f. multiplying the circular templates of each fraction into DNA concatemers, combining the fractions and localizing the DNA concatemers on a surface g. determining the following sequences of nucleotides of the DNA concatemers from A3 in direction to A2 as sequence A from A2 in direction to A3 as sequence C from A1 in direction to A3 as sequence B from A3 in direction to A1 as sequence D h. merging the sequences A and B to generate sequence AB and sequences C and D to generate sequence CD by colocalization using solid surface rolony coordinates i. pairing the sequences AB and CD by matching the sequence information of the barcode region BR and universal identifier region UMI.

[0010] The present approach integrates the usage of two pairs of sequencing primers for each of the strand of a given portion of a polynucleotide duplex of interest and allow the concomitant sequencing of the positive and negative strands.

[0011] The method of the invention allows the sequencing a plurality of polynucleotide molecules where specific adapters are ligated to double-stranded DNA molecule. The double stranded polynucleotide molecules are denatured after adapter sequence ligation and circularized.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] Fig. 1 shows the process of the invention where targeted DNA libraries are used to generate sense and anti-sense circular template used in rolling circle amplification producing DNA concatemers forming DNA nanoballs called rolonies. The generated rolonies are sequenced in segments capturing both added unique identifier information for each strand and the target DN A of interest.

[0013] Fig. 2 shows the sequential sequencing events for both the sense and antisense DNA. Primer 1 and 2 are used in a first round of sequencing generating sequencing reads A and C corresponding to the target DNA insert of the DNA library. The second round of sequencing utilizes primer 3 and 4 and are used on the same immobilized rolonies to generate sequencing reads B and D corresponding to the identifier region containing a unique molecular identifier (UMI) and barcode.

[0014] Fig. 3 shows the results obtained when using a library of human DNA to generate paired-end sequencing reads using the invention described. The DNA reads were generated using a sequence-by-synthesis platform capable of sequencing rolonies immobilized on a solid surface and following the invention described in Fig. 1. The amount and percentage of unique paired reads and paired-groups (repeats of unique pairs due to PCR amplification of the DNA library) observed based on the number of sequencing reads analyzed: 15,612,769 reads (partial sequencing run analysis) for a set of pre-defined tiles on the flowcell is indicated.

DETAILED DESCRIPTION

[0015] The method of the invention may be used for target double stranded DNA nucleic acid library with a length of 50 to 2000 nucleotides. The target double stranded DNA nucleic acid library may be used as is i.e. as target double stranded DNA or may be obtained by segmentation/fragmentation of a double stranded DNA.

[0016] In the invention adapters are used which contain regions that allows for the circularization of the template DNA using guide oligonucleotide ligation approach. In the following, such adapter as referred to as sequencing regions A1, A2 and A3.

[0017] The adaptors also include a barcode region BR and a universal identifier region UMI, so that the sense and the anti- sense strands can be uniquely identified as pairs and a portion that allows the hybridization of primers allowing the sequencing of the DNA nanoballs/rolonies in multiple sections and in more than one for round of sequencing if required.

[0018] The circularized DNA template generated from both sense and anti-sense strands are used in rolling circle amplification (RCA) to generate multiple copies of DNA that are used for sequencing. The thus obtained copies of DNA concatemers are hereinafter referred to as “rolonies” or “DNA nanoballs”.

[0019] The circularized single-stranded DNA templates fragments from each strand are used to generate individual rolonies and therefore the positive and negative strands are located on different rolonies.

[0020] These of rolonies are preferable attached randomly to a solid surface for example via electrostatic charges on surfaces like polyamines, silicon dioxide, titanium, hexamethyldisilazane or others) via NHS ester-activated crosslinkers. Preferable the first portion of each polynucleotide molecule that generated a rolony (sense strand) is attached to a first location of the surface and the second portion of each polynucleotide molecule that generated a rolony (anti-sense strand) is attached to a second location of the surface. Each of the rolonies which comprises either the first or the second portion of the target polynucleotide molecule (sense and anti-sense) is sequenced in two segments sequentially. The first segment reads the actual targeted DNA and the second segment, the information contained in the adaptor portion and containing the unique molecular identifier (UMI) and sample barcodes. These two sequences coming from the same rolonies are linked together by co-localization and merged into one unique DNA read. The segment sequences coming from rolonies originating from the same polynucleotide sequence (positive and negative strand), but located randomly on the surface are linked/paired by using the unique identifier contained in one of the adaptor.

Step a)

[0021] In step a), the target double stranded DNA nucleic acid library containing adaptor regions is denatured into a mixture of sense and anti-sense DNA single strands. In general, any double- stranded adapted DNA library containing fragmented targeted DNA region to be sequenced can be used as starting material for the method of the invention.

[0022] In a first embodiment, the target double stranded DNA nucleic acid library is obtained by fragmentation of a target double stranded DNA.

Step b

[0023] In step b), the sense and anti-sense DNA single strands are provided at the 3’ and 5’ ends with sequencing regions A1, A2 and A3, a barcode region BR and a universal identifier region UMI to obtain sense and anti- sense oligonucleotides having the general formula

(3’) A1-UMI-BR-A3 - sense DNA single strand A2 (5’)

(5’) A1-UMI-BR-A3 - anti sense DNA single strand A2 (3’) Wherein A1, A2 and A3 each comprise 5 to 50 nucleotides;

BR comprise 3 -20 nucleotides;

UMI comprise 9 to 15 nucleotides

[0024] The two adaptors flanking the target insert DNA consist of a spacer region serving as the hybridization of sequencing primers (A1) followed by a UMI region of n >8 nucleotide(s) UMI and a n>3 nucleotide barcode region followed by another spacer region serving as the hybridization of second set of sequencing primers (Adaptor A3). The second adaptor contains spacer region serving as the hybridization of the third set of sequencing primers (Adapter A2) complete the library construct.

Step c

[0025] In step c) the mixture of the sense and anti-sense DNA oligonucleotides is divided into two fractions i.e. the double-stranded adapted DN A library is distributed into 2 tubes in equal amount and labeled sense and antisense.

Step d and e

[0026] First, the two mixtures of sense and anti-sense DNA oligonucleotides in the two fractions are provided with oligonucleotide guides comprising 5 to 50 nucleotides capable of binding to A1 and A2 of the same oligonucleotide to each fraction. One fraction receives a guide oligonucleotide complementary to the sense strand of A l and A2 and one fraction receives a guide oligonucleotide complementary to the anti-sense strand of A1 and A2.

Step f

[0027] In this step, the circular templates of each fraction (sense and anti-sense ) are multiplied into a series of DNA concatemers using Rolling Circle Amplification (RCA).

[0028] To this end, the DNA is heat denatured at 95C and cold shocked at 4C to anneal the bridge oligonucleotide onto the denatured single stranded DNA library. The bridge oligos are complementary to each extremity of the adapter region (A1 and A2), bringing the 5’ and 3’ end of the DNA library fragment in close proximity of one another.

[0029] Then, the DNA library is circularized by ligation with a DNA ligase like a T4 DNA ligase into circular template DNA library.

[0030] The circularization reaction is purified by treating the mixture with exonuclease I and III to eliminate the un-ligated non-circular DNA and excess bridge oligonucleotides. [0031] Preferably, the purified single strand circular template is replicated by a polymerase capable of rolling circle amplification into a plurality of DNA concatemers forming a DNA nanoball or rolony. For this purpose, an oligonucleotide is used to prime the binding of the replicating enzyme and hybridized to the same regions used for the hybridization of the sequencing oligonucleotides.

[0032] An equal amount (1:1 ratio) of the sense and antisense of the RCA products (rolonies) are mixed and placed onto a modified positively charged solid surface like glass, plastic equivalent (cyclo olefin polymer or others) containing polyamines such as silicon dioxide, titanium, hexamethyldisilazane or others). The rolonies can interact to the surface via electrostatic charges or via NHS ester- activated crosslinkers.

Step g

[0033] In step g), the sequence information is obtained from the following nucleotides of the DNA concatemers from A3 in direction to A2 as sequence A using primer 1 from A2 in direction to A3 as sequence C using primer 2 from A1 in direction to A3 as sequence B using primer 3 from A3 in direction to A1 as sequence D using primer 4

[0034] The first segment sequencing of the targeted DNA region (sequence A and C of both sense and antisense rolonies is performed using two sets of sequencing primers (primers 1 & 2) complementary to A3 for the sense strand & A2 for the anti-sense strand respectively and flanking the insert regions.

[0035] The sequences A and C may have each a length of 50-2000 nucleotides whereas the sequences B and D may have each a length of 20 to 50 nucleotides.

[0036] After termination of the first reaction using ddNTP and or denaturating agent like betaine, the second segment sequencing of the barcode (BC) and UMI portion of both sense and antisense region is performed using two new sets of sequencing primers (primers 3 and 4) complementary to A1 for the sense strand & A3 for the anti-sense strand flanking the UMI/barcode region. The sequencing is performed using massively parallel sequencing by synthesis approach using fluorescently-labeled nucleotides. Step h and i

[0037] Each sequencing round generates two set of reads (sense and antisense) for each rolony and four sequencing reads total for each paired rolonies (originating for the same double-stranded adapted DNA library portion). The thus obtained four sequence reads are then combined into the sequence of the target double stranded DNA nucleic acid library.

[0038] For this purpose, first the sequences A and B, which originate from the same rolonies and therefore co-localized on the surface using the rolony coordinates, are combined to generate the sequencing read AB for the sense strand. The same applies to sequences C and D to generate the sequencing read CD for the anti-sense strand. The sequencing read AB and CD contain the insert sequence and the barcode BR and UMI for the sense strand and the anti-sense respectively.

[0039] The sequences AB and CD are then paired using the sequence information of the UMIs to generate a consensus sequence of the target double stranded DNA nucleic acid library using information from both sense and anti-sense portions.

[0040] The pairing of matching of the sequences AB and CD may be performed by using the sequence information of the barcode BR and UMI with their barcode genetic sequence of consecutive nucleotide bases A, T, G, or C. Same barcode genetic sequences are assigned a partition ID using hash-map functions to indicate a unique “key” element. Sorting of such UMIs in single-cell RNA sequencing experiments is for example described in “ UMI- count modeling and differential expression analysis for single-cell RNA sequencing” by Chen et al. Genome Biology (2018). Further, the identification of barcodes for single cell genomics is described by Tambe et al. BMC Bioinformatics (2019) and an implementation of Hamming distance to sort similar dictionary entries is disclosed in “Perfect Hamming code with a hash table for faster genome mapping” by Takenaka et al. BMC Bioinformatics (2011).

EXAMPLES

[0041] A library of human DNA has been used for generating paired-end sequencing reads using the invention described. The DNA reads were generated using a sequence-by- synthesis platform capable of sequencing rolonies immobilized on a solid surface and following the invention described in Fig. 1. [0042] An exemplary process according to the invention is shown in Fig. 1

[0043] Library DNA from targeted region consist of a targeted insert region depicted with a double strand region with solid and dotted line. The insert is flanked by a spacer region (A3) which is the position where sequencing primers 1 and 4 binds (Step 8 and 9). Next comes the 9 nucleotide UMI and 9 nucleotide Barcode region. A1 and A2 adapters are located at each extremity and complete the library construct.

[0044] Step 1: The double stranded library DNA is split into 2 tubes (sense and antisense) with equal amount.

[0045] Step 2: The double stranded library DNA is mixed with appropriate bridge oligonucleotide (anti-sense oligo for sense library strand and sense oligo for anti-sense library strand) and heat denatured at 95C and cold shocked at 4C to anneal the bridge onto the denatured single stranded library DNA.

[0046] Step 3: The denatured single stranded library DNA is circularized by ligation with T4 DNA Ligase.

[0047] Step 4: The ligation reaction mix is treated with Exonuclease I and III to eliminate the un-ligated non circular DNA and bridge oligonucleotide.

[0048] Step 5: Circularized single stranded DNA is purified with magnetic beads.

[0049] Step 6: Rolling Circle Amplification (RCA) is performed with oligonucleotide primers designed from either A1 or A2 adaptor region. RCA primer complementary to the sense strand is used for sense-strand circle and RCA primer complementary to the anti-sense is used for anti- sense- strand circle.

[0050] Step 7: The resulting RCA nanoball products are quantified by Qubit. The sense and the anti-sense RCA products are mixed in equal amount and place onto the flow cell for sequencing.

[0051] Step 8: 150-200 cycle 1st segment sequencing of the target insert region is performed with sequencing primers 1 and 2.

[0052] Step 9: 20 cycle of the 2nd segment sequencing of the UMI and Barcode region is performed with sequencing primers 3 and 4.

[0053] Step 10: Primary sequencing data analysis is performed to generate the DNA sequencing reads.

[0054] Step 11: Secondary sequencing data analysis is performed a. Combining the first and second sequencing reads originating from the same rolonies using the rolony coordinates on the flowcell (co-localization). b. Pair the reads from two different rolonies originating from the same double stranded DNA (plus and minus strand) using the sequence information of the identifier region (Barcode and UMI) .

[0055] Step 12: Determined the amount of unique paired reads and paired-groups (repeats of unique pairs due to PCR amplification of the DNA library) observed for a set of pre-defined tiles on the flowcell

[0056] Step 13: Establishing a consensus sequence of the double-strand DNA library using information for both sense and anti-strand DNA (paired reads).

EXAMPLES

[0057] The construct and the primers used in the experiment according to the invention is depicted in Fig 4.

[0058] A1-A2 sense bridge is used as a splint-bridge to circularize the positive (sense) construct as well as a primer to perform the rolling circle amplification reaction.

[0059] A1-A2 antisense bridge is used as a splint-bridge to circularize the negative

(antisense) construct as well as a primer to perform the rolling circle amplification reaction.

[0060] Rolonies from (+) sense and (-) anti-sense circles are loaded onto the flowcell in a 1:1 equal ratio for a sequential paired-end sequencing.

[0061] 150 cycle 1st segment sequencing of the target insert region is performed with target insert sense-minus-2 primer and target insert antisense-minus-2 primer.

[0062] 20 cycle of the 2nd segment sequencing of the UMI and Barcode region is performed with UMI/BC-sense-0 primer and UMI/BC-antisense-0 primer.

[0063] Primary sequencing data analysis is performed to generate the DNA sequencing reads.

[0064] Secondary sequencing data analysis is performed

[0065] Combining the first and second sequencing reads originating from the same rolonies using the rolony coordinates on the flowcell (co-localization).

[0066] Pair the reads from two different rolonies originating from the same double stranded DNA (+) sense and (-) antisense strand using the sequence information of the identifier region (Barcode and UMI) .

[0067] Determined the amount of unique paired reads and paired-groups (repeats of unique pairs due to PCR amplification of the DNA library) observed for a set of pre-defined tiles on the flowcell [0068] Establishing a consensus sequence of the double-strand DNA library using information for both sense and anti-strand DNA (paired reads).

[0069] Sequencing result is shown in Fig 3 and Fig 5. Fig 3 shows the results obtained when using a library of E.coli shotgun library DNA to generate paired-end sequencing reads using the invention described. For demonstration, 22 tiles out of 759 total tiles were analyzed. The amount and percentage of unique paired reads and paired-groups (repeats of unique pairs due to PCR amplification of the DNA library) observed based on the number of sequencing reads analyzed:

[0070] 15,612,769 reads (partial sequencing run analysis) for a set of pre-defined tiles on the flowcell is indicated.

[0071] Number of unique pairs: 2,708,170

[0072] Number of copies: 9,915,542

[0073] Unique (+) strands identified: 1,256524

[0074] Unique (-) strands identified: 2,304,544

[0075] Percent of paired reads: 35.69%

[0076] Percent of paired-groups: 63.51%

Claims

1. A method for obtaining the sequence of both strands of a DNA nucleic acid library characterized by the steps a. denaturation the target double stranded DNA nucleic acid library into a mixture of sense and anti-sense DNA single strands b. providing the sense and anti-sense DNA single strands at the 3’ and 5’ ends with sequencing regions A1, A2 and A3, a barcode region BR and a universal identifier region UMI to obtain sense and anti-sense oligonucleotides having the general formula.

(5’) A1-UMI-BR-A3 - sense DNA single strand- A2 (3’)

(3’) A1-UMI-BR-A3 - anti sense DNA single strand- A2 (5’) Wherein A1, A2 and A3 each comprise 5 to 50 nucleotides;

BR comprise 3 -20 nucleotides;

2. Method according to claim 1 characterized in that the target double stranded DNA nucleic acid library has a length of 50 to 2000 nucleotides.

3. Method according to claim 1 characterized in that the target double stranded DNA nucleic acid library is obtained by segmentation/fragmentation of a target double stranded DNA.

4. Method according to any of the claims 1 to 3 characterized in that the DNA concatemers are localized on a positively charged surface.

5. Method according to claim 4 characterized in that the DNA concatemers interact to the surface via electrostatic charges or via NHS ester- activated crosslinkers.

6. Method according to any of the claims 1 to 5 characterized in that the sequences of nucleotides sequences of the DNA concatemers are determined by sequencing by synthesis using fluorescently labeled oligonucleotides.

7. Method according to any of the claims 1 to 6 characterized in that the sequences A and C have each a length of 50-2000 nucleotides.

8. Method according to any of the claims 1 to 7 characterized in that the sequences B and D have each a length of 20 to 50 nucleotides.