WO2018174821A1

WO2018174821A1 - A sequencing method for detecting dna mutation

Info

Publication number: WO2018174821A1
Application number: PCT/SG2018/050124
Authority: WO
Inventors: Li-feng ZHANG; Ru HONG; Udita CHANDOLA
Original assignee: Nanyang Technological University
Priority date: 2017-03-20
Filing date: 2018-03-20
Publication date: 2018-09-27
Also published as: CN110392739B; CN110392739A

Abstract

A method for detecting a gene deletion in a host species, comprising: (a) amplifying a first DNA region surrounding the gene deletion with at least a pair of pre-PCR primers to form a pre-PCR product, wherein one of the pair of prePCR primers carries an adaptor sequence at 5'-end, wherein the adaptor sequence is not found in the host species' genome; (b) hybridizing the pre-PCR product to at least one circularizing probe, wherein the circularizing probe is designed to have a ligation arm and an extension arm targeting a strand complementary to the adapter sequence. In a preferred embodiment, said circularizing probe is a padlock probe. In another embodiment, the method comprises using a plurality of circularizing probes or kebab probes. Preferably, the method is used for detecting gene deletions due to mutations found in alpha-thalassemia or beta-thalassemia.

Description

A SEQUENCING METHOD FOR DETECTING DNA MUTATION

CROSS-REFERENCE TO RELATED APPLICATION

This application claims benefit of, and priority from, Singapore patent application No. 10201702238W, filed on 20 March 2017 the contents of which are hereby incorporated herein by reference.

TECHNICAL FIELD OF THE INVENTION

The present invention relates to sequencing methods for detecting DNA mutations and kits for carrying out the same, particularly relevant for large DNA deletions with unknown/variable boundaries.

BACKGROUND OF THE INVENTION

The following discussion of the background to the invention is intended to facilitate understanding of the present invention. However, it should be appreciated that the discussion is not an acknowledgment or admission that any of the material referred to was published, known or a part of the common general knowledge in any jurisdiction as at the priority date of the application.

Although deep sequencing technologies have made personal genome sequencing possible, the wide use of the technology on population-based carrier screens for genetic disorders is limited by the lack of a robust and cost- effective targeted sequencing method capable of detecting large DNA deletions.

First, it is important to have a suitable approach to concentrate the sequencing power onto a short list of target DNA regions (targeted sequencing). Without target enrichment, the vast majority of the sequencing power will be wasted on aimlessly sequencing the entire genome (3 billion base pairs). Padlock capture (Zhang, K. et al. Nat Methods 6, 613-618 (2009)) is an available targeted sequencing method. A padlock probe is a DNA oligo designed for a DNA target (Fig, 1 A). Each padlock probe carries an extension arm and a ligation arm, which are designed specifically for the DNA target.

Similar to a pair of PGR (Polymerase Chain Reaction) primers, the two arms bind to the template DNA through complementary base pairing, but differently from a PGR primer pair in that they bind to the same strand of a DNA molecule. After the probe binds to its DNA template, the 3 ' -end of the extension arm primes for DNA polymerase DNA chain elongation. When the elongation reaction reaches the 5' -end of the ligation arm, the padlock can be " locked up" by ligases to form a single-stranded circular DNA molecule. The rest of the linear DNA molecules in the reaction can then be efficiently removed by exonucleases. The common linker sequence of each padlock probe allows a common PGR primer pair to amplify all the padlock capture products for deep sequencing. It has been shown that a padlock probe library containing tens of thousands of padlock probes worked efficiently (Zhang, K. et al. Nat Methods 6, 613-618 (2009)). Compared with other available methods for targeted sequencing, padlock capture is more suitable for population-based carrier screens, as once synthesized, the padlock library can be easily regenerated by PGR, whereby microarrays or RNA baits used for target enrichment in other methods are expensive and non-reusable (Teer, J. K. et al. Genome research 20, 1420-1431 (2010).)

Second, the targeted sequencing method should be able to detect large DNA deletions with variable or unknown deletion boundaries, as these types of mutations are frequently seen in human genetic disorders. A well-known example is thalassemia, an inherited blood disorder caused by mutated genes encoding the hemoglobin a-chain (a-thalassemia / alpha-thalassemia) and β- chain (β-thalassemia / beta-thalassemia) (Weatherall, D. J. Nat Rev Genet 2, 245-255 (2001 )) Hemoglobin defects cause red blood cell malfunctions and result in mild or severe anemia. However, the same defect also provides a degree of protection against malaria. The selective survival advantage of heterozygous carriers is believed to be responsible for perpetuating the mutations in human populations (Flint, J. et ai. Nature 321 , 744-750 (1986)). Thalassemia is one of the most common genetic disorders worldwide, posing an important public health problem in Southeast Asia, the Mediterranean region, the Middle East and sub-Saharan Africa (Weatherall, D. J. Nat Rev Genet 2, 245-255 (2001 )). Approximately 18% of the population in Guangxi province (China)( Li, C. G. et al. Hemoglobin 33, 296-303 (2009).) and 3% of the Singaporean population

(https://www.kkh. com. sg/HealthPedia/Pages/PregnancyPlanningForBabyThal assaemia.aspx) are carriers of thalassemia mutations. In contrast to the point mutations commonly seen in β-thalassemia (Harteveid, C. L. et al. J Med Genet 42, 922-931 (2005).) the common mutations found in a-thalassemia are a series of large DNA deletions (-3-40 kb) (Galanello, R. & Cao, A. Alpha-thalassemia. Genet Med 13, 83-88 (201 1 )). Although the carrier rate for thalassemia mutations is extraordinarily high, a population-based carrier screen is difficult to perform. The experimental techniques being used in clinical labs for detecting large DNA deletions in thalassemia (Galanello, R. & Cao, A. Alpha-thalassemia. Genet Med 13, 83-88 (201 1 )) such as gap-PCR, are low throughput (one test for one patient sample) and not comprehensive (one test for one specific mutation). These techniques are only used for patient DNA diagnosis and are unsuitable for population-based carrier screens. It is worth noting that alternative sequencing approaches, such as Nanopore sequencing (Branton, D. et al. Nature biotechnology 26, 1 146-1 153 (2008)) and paired-end long-insert lllumina sequencing (Liang, W. S. et al. Nucleic Acids Res 42, e8 (2014), which are methods capable of detecting large genomic DNA deletions. However, neither method is a targeted sequencing method. Both methods require a suitable target enrichment step if they are to be used for population-based mutation carrier screens. Moreover, both methods are not suitable for the clinical detection of small DNA mutations, lllumina paired-end sequencing is not cost-efficient, as paired-end sequencing is not necessary for the detection of small DNA mutations. For Nanopore sequencing, its high sequencing error rate (Branton, D. et al. Nature biotechnology 26, 1 146-1 153 (2008)) makes it very difficult to apply the method for DNA mutation detection, especially for small DNA mutations.

The strength of padlock capture is to detect small DNA mutations such as SNPs (single-nucleotide polymorphism). It is straight forward to design a padlock probe library targeting a panel of small DNA mutations. However, the panel cannot include thalassemia DNA deletions, which is one of the most common mutations in human genetic disorders. The length of the DNA region captured by a padlock probe is restricted by the length limits of the synthesized padlock probes (Krishnakumar, S. et al. Proc Natl Acad Sci USA 105, 9296-9301 (2008).). For a large DNA deletion with variable or unknown deletion boundaries, it is difficult and unreliable to design a padlock probe to directly capture the junction region of the deletion. Any probes directed to the deleted region cannot help to distinguish heterozygous mutants from the wild type, which is the most important genotyping information for a population- based carrier screen. Taken together, the large genomic deletions observed in thalassemia represent a special type of mutation that is frequently observed in human genetic disorders but is difficult to detect using conventional sequencing approaches.

Thus, there exists a need to develop a method that at least alleviates some of the technical problems identified above.

SUMMARY OF THE INVENTION

According to an aspect of the invention there is provided a method for detecting a gene deletion in a host species, comprising: (a) amplifying a first DNA region surrounding the gene deletion with at least a pair of pre-PCR primers to form a pre-PCR product, wherein one of the pair of pre-PCR primers carries an adaptor sequence at 5 -end, wherein the adaptor sequence is not found in the host species' genome; (b) hybridizing the pre-PCR product to at least one circularizing probe, wherein the circularizing probe is designed to have a ligation arm and an extension arm targeting a strand complementary to the adapter sequence.

In accordance with another aspect of the invention, there is provided a kit for detecting a gene deletion in host species, comprising: at least a pair of pre- PCR primers adapted for amplifying a first DNA region surrounding the gene deletion to form a pre-PCR product, wherein one of the pair of pre-PCR primers comprises an adaptor sequence at 5 -end, wherein the adaptor sequence is not found in the host species' genome; at least one circularizing probe adapted for hybridizing to the pre-PCR product, wherein the circularizing probe is designed to have a ligation arm and an extension arm targeting a strand complementary to the adapter sequence.

Other aspects of the invention will become apparent to those of ordinary skill in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described, by way of example only, with reference to the accompanying drawings, in which:

Figure 1 : the experimental design of Cat-D. (A) A general method for padlock capture. Note: Solid and dashed lines indicate the sense and antisense strands of the DNA templates, respectively. (B) The designs of the "Cat-D" and "Kebab" padlock probes. (C) Cat-D and Kebab padlock probes are used together in a padlock probe library to make the genotype calls for large genomic DNA deletions.

Figure 2. Optimization of pre-PCR cycles and test run setup. (A) Gap PGR results for two large DNA deletions in a-thalassemia (-SEA and— FIL). (B) A gap PGR result from a patient sample (Corieil Biorepository GM10796) shows that the deletion boundary of -FIL differs among individual patient samples. The PGR amplicon size was estimated according to a previous publication¹⁴. The predicted PGR amplicon sizes are included in the primer names. (C) Padlock capture using Cat-D padlock probes successfully detected -FIL. A PGR primer pair was designed to specifically amplify the Cat-D padlock capture products of -FIL. The orientation of the PGR primers ensures that the primers only amplify the circular DNA template of the successful padlock capture. The arrowheads point to the padlock capture products with the expected sizes. The -120 bp and -240 bp bands correspond to PGR extension around the circular DNA templates (a unique feature of successful padlock capture) 1 and 2 times, respectively. Cat-D was successful with a minimum of 16 pre-PCR cycles. The wild type sample returned negative results even after a full 35-cycle pre-PCR. (D) Genomic DNA samples used in this study. Note: Uncropped images of the full-length gels shown in this figure are presented in Figure 10.

Figure 3. Genotype scores and genotype calls for a-thalassemia mutations. (A) Sequencing data headcounts. For each sample, the total count of all the mapped reads from all the Cat-D probes targeting -FIL is taken as the headcount for -FIL (Cat-D). The same analysis was performed to generate the headcounts for -SEA (Cat-D) and Kebab. The sequencing depth was normalized to 200 K reads per sample. (B) The mathematical method to calculate the genotype scores and to make the genotype calls on large DNA deletions detected by Cat-D probes and Kebab probes. (C) -FIL. (D) -SEA. (E) Kebab. Note: For genotype scores, the samples are labelled in light grey (wild type), dark grey (mutants) and grey (genotypes to be tested). For genotype calls, samples are labelled in dark grey (positive genotype calls) and grey (negative genotype calls).

Figure 4. Genotype scores and calls for the β-thalassemia point mutation. (A) Sequencing data headcounts. (B) The mathematical method to calculate the genotype scores and to make the genotype calls on SNPs and other small DNA mutations. (C) Allele frequencies of the padlock capture products. To determine the minor allele frequency used in the data analysis, we calculated the allele frequencies of all the nucleotide positions captured by one padlock probe. The first 20 nucleotides of each sequencing read belong to the ligation arm. The padlock captured region is located between the 21 st nucleotide and the 67th nucleotide. For each nucleotide position, we calculated the allele frequency of A, T, C and G. Five percent was selected as the threshold for the minor allele frequency in the data analysis. The position of the β-thalassemia point mutation, codon 17 (A > T), is marked by the dash circle. (D) Genotype scores.

Figure 5 shows -FIL and --SEA, two a-thalassemia deletions mainly seen in Southeast Asia.

Figure 6 shows correlation co-efficient between padlock capture duplicates of 8 DNA samples. The sequencing depth was normalized to 200K reads per sample. The sequence read counts of each padlock probe in the experimental duplicate are plotted along the x and y axis.

Figure 7 shows gap PGR to detect -FIL and -SEA. (A) Each PGR reaction, containing 100 ng genomic DNA, was carried out in 35 cycles. The arrow heads indicate PGR products with expected size for -FIL (~3 kb) and for-

SEA (-900 bp). (B) Gap PGR was repeated on G304A.Lot1 and G304A.Lot2.

Each PGR reaction, containing 200 ng genomic DNA, was carried out in 38 cycles. A clear PGR product of -SEA was detected in G304A.Lot2. This result confirms the genotyping result of Cat-D and shows that Cat-D is more sensitive than gap PGR. Full-length gels shown in this figure are presented in

Figure 10.

Figure 8 shows genotype scores of β -thalassemia mutations.

Figure 9 shows genotype calls of β -thalassemia mutations. Samples are labelled in light grey (wild type) and grey (genotypes to be tested). Since all samples are negative for all the β -thalassemia mutations included in the figure, sample identities are not provided.

Figure 10 shows uncropped gel pictures of all the gels in the description.

Other designs/ arrangements of the invention are possible and, consequently, the accompanying drawings are not to be understood as superseding the generality of the preceding description of the invention. DETAILED DESCRIPTION

Particular embodiments of the present invention will now be described with reference to the accompany drawings. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present invention. Additionally, unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one or ordinary skill in the art to which the present invention belongs. Where possible, the same reference numerals are used throughout the figures for clarity and consistency. Throughout this document, unless otherwise indicated to the contrary, the terms "comprising", "consisting of", and the like, are to be construed as non- exhaustive, or in other words, as meaning "including, but not limited to".

Throughout the specification, unless the context requires otherwise, the word "comprise" or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated integer or group of integers but not the exclusion of any other integer or group of integers.

Throughout the specification, unless the context requires otherwise, the word "include" or variations such as "includes" or "including", will be understood to imply the inclusion of a stated integer or group of integers but not the exclusion of any other integer or group of integers. Method

According to an aspect of the invention there is provided a method for detecting a gene deletion in a host species, comprising: (a) amplifying a first DNA region surrounding the gene deletion with at least a pair of pre-PCR primers to form a pre-PCR product, wherein one of the pair of pre-PCR primers carries an adaptor sequence at 5'-end, wherein the adaptor sequence is not found in the host species' genome; (b) hybridizing the pre-PCR product to at least one circularizing probe, wherein the circularizing probe is designed to have a ligation arm and an extension arm targeting a strand complementary to the adapter sequence.

In some embodiments, the method is particularly relevant for detecting a large gene deletion. Using this method, the first DNA region, which surrounds the large gene deletion, will be amplified as long as the large DNA deletion is present in at least one of the homologous chromosomes - if there is no large DNA deletion (e.g., in wildtype), the two pre-PCR primers are located too far apart, due to the presence of the large gene sequence, to be amplified using conventional PGR. Accordingly, no pre-PCR product of the first DNA region will be generated. In contrast, if there is large DNA deletion present in at least one of the homologous chromosomes (i.e., applicable to both homozygous and heterozygous mutations), the pre-PCR primers are now located close to each other to amplify the first DNA region, and pre-PCR products of the first DNA region will be generated. Thus, the present invention is able to distinguish between wildtype and mutation (e.g., both homozygous and heterozygous mutations) based on "positive read-out" (i.e., positive reading from the method (i.e., the first DNA region amplified) indicates that there is large DNA deletion). However, with just (a) and (b), the present invention will not be able to distinguish between homozygous and heterozygous mutations, because as long as one of the homologous chromosome carries the gene deletion, pre-PCR product of the first DNA region will still be generated. As used herein, the term "gene deletion" refers to a loss of DNA sequence in both complementary strands from a gene when compared to a wildtype gene indicative of a healthy condition. The loss of DNA sequence shall be construed to include both (i) the loss of the entire DNA sequence from the gene (i.e., the entire gene is deleted from the chromosome) and (ii) the loss of a portion of the DNA sequence from the gene. Examples of "gene deletion", include, but not limited to, large DNA deletions with variable or unknown deletion boundaries. The term "large DNA deletion" refers to deletions of large chromosomal regions, leading to loss of optimal gene functions within those regions. For example, mutations commonly found in alpha-thalassemia are series of large DNA deletions, having sizes ranging from 3 to 40kb (kilo base pairs). The term "small DNA mutations" refers to DNA mutations such as substitution mutations and point mutations (e.g., silent mutations, missense mutations, nonsense mutations, insertions and deletions).

The term "host species" refers to an organism that carries the gene deletion. Examples of the "host species" include, but not limited to, animals, plants, bacteria, fungi, or viruses. In various embodiments the animals are vertebrates, preferably mammals such as humans, horses, cows, mice, rats or rabbits. In various embodiments the host species is human.

The phrase "a first DNA region surrounding the gene deletion" shall be construed to include, but not limited to. DNA sequences located in proximity to the gene in the wildtypes (but the gene is deleted in DNA mutations).

As used herein, the term "pre-PCR" refers to a PGR reaction specifically adapted for amplifying the first DNA region surrounding the DNA deletion (i.e., the mutant allele carrying the DNA deletion is amplified). The purpose of the pre-PCR is to form a pre-PCR product (i.e., DNA sequences surrounding the gene deletion) that is subjected to padlock capture as downstream assay. Thus, the pre-PCR does not have to be completed with full PGR cycles (e.g., 30 cycles) - instead, fewer than 30 cycles or fewer than 25 cycles, or fewer than 20 cycles, or fewer than 18 cycles or fewer than 17 cycles or fewer than 16 cycles might be sufficient for the pre-PCR step. Furthermore, a "pair of pre- PCR primers" (e.g., a reverse primer and a forward primer) flanking the first DNA region is required for conducting the pre-PCR.

As used herein, the term "adapter sequence" is a DNA sequence located at the 5'-end of a pre-PCR primer. The adapter sequence should not be found in the genome of the host species - in other words, if generation of the complementary sequence of the adapter sequence is observed, it is confirmed that the method (e.g., the PGR reactions) is successful (i.e., not due to the "noise" amplification of the host species' own DNA sequence). In various embodiments the adapter sequence is specifically designed artificial sequence. In some embodiments of the method, the adapter sequence is at least 20 nucleotides in length. In some embodiments of the method, the adapter sequence comprises a nucleotide sequence represented by any one of SEO ID Nos: 1 to 7.

As used herein, the term "hybridizing" refers to formation of a hybrid nucleic acid through base-pairing between complementary or at least partially complementary nucleotide sequences under define conditions (e.g., PGR).

As used herein, the term "circularizing probe" refers to or includes a probe sequence complementary to a target sequence (comprising a ligation arm and an extension arm), and the probe sequence is adapted to hybridize to and capture the target sequence. Once the probe sequence hybridizes to the target sequence, the probe sequence circularizes. In other words, the circularizing probe is capable of being transformed into a circular shape, when it binds to the target sequence. Prior to hybridization, the circularizing probe might exist in a linear configuration. Examples of circularizing probes suitable for the present method include, but are not limited to, padlock probe, molecular inversion probe, and connector inversion probe. In some embodiments of the method, the at least one circularizing probe comprises a nucleotide sequence represented by any one of SEQ ID Nos: 8 to 17.

As used herein, the term "ligation arm" refers to a first group of nucleic acid base pairs located at 5 -end of the circularizing probe, and the term "extension arm" refers to a second group of nucleic acid base pairs located at 3'-end of the circularizing probe. Both the ligation arm and the extension arm bind to the same strand of the target sequence. In some embodiments of the method, the ligation arm and/or the extension arm is at least 20 nucleotides in length. In some embodiments of the method, the Tm (primer melting temperature) of the ligation arm and/or the extension arm is close to 55°C. As used herein the term 'primer melting temperature has the same meaning as that known in the art, wherein the melting temperature (Tm) is defined as the temperature at which half of the DNA strands are in the random coil or single stranded state. The phrase "close to 55°C" shall be construed to cover a temperature range from 50°C to 60°C (i.e., 55°C ± 5 °C); and the temperature range from 50°C to 60°C includes 50°C and 60°C, and may include 51 °C, 52 °C, 53 °C, 54 °C, 55 °C, 56 °C, 57 °C, 58 °C and 59 °C. In some embodiments of the method, the ligation arm may be designed to specifically hybridize a second DNA region adjacent to the pre-PCR primer, in order to avoid non-specific primer binding in PGR reaction. The term "a second DNA region" refers to a DNA sequence in the pre-PCR product, wherein the DNA sequence is located, for example, immediately downstream of the pre-PCR primer sequence.

In some embodiments the method further comprises (c) hybridizing a first plurality of additional circularizing probes to the first DNA region. In the context of the method, the term "first plurality of additional circularizing probes" refer to a series of circularizing probes adapted for covering the first DNA region wherein there may be gene deletion present. As long as one of the homologous chromosome still carries the gene in the first DNA region, the first plurality of additional circularizing probes is able to detect and amplify the gene. In other words, if the first plurality of additional circularizing probes is not able to detect the gene, one may then conclude that a homozygous mutation has occurred in the first DNA region (i.e., the gene missing on both homologous chromosomes). However, it is not possible to use the first plurality of additional circularizing probes to distinguish between heterozygous mutation and wildtype.

In some embodiments of the method, the first plurality of additional circularizing probes are a series of padlock probes designed to cover the first DNA region surrounding the gene (Fig. 1 B, the "Kebab" design). One can imagine that these padlock probes bind to the template DNA and form a "Kebab" shape. Therefore, these padlock probes are named "Kebab probes" in the context of the present method. Kebab probes return negative results from homozygous mutants - i.e., no gene amplification detected from the PGR reactions of (c) since the gene is missing on both homologous chromosomes of the first DNA region. In some embodiments of the method, the first plurality of additional circularizing probes comprise a nucleotide sequence represented by any one of SEQ Nos: 27 to 43.

In some embodiments the method further comprises comparing a first result of gene deletion detection obtained from (a) and (b), and a second result of gene deletion detection obtained from (c) to determine a genotype of the gene deletion of the host species. As discussed above, with the result obtained from (a) and (b) ("the first result"), one would be able to distinguish between mutations (both homozygous and heterozygous) and wildtype, but not between homozygous and heterozygous mutations. Comparing the first result with the additional result obtained from (c) ("the second result"), one would be able to make a genotype call on the gene deletion:

1 . If both the first result (i.e., there is mutation) and the second result are positive (i.e., there is no homozygous deletion), then one may make a genotype call that there is heterozygous deletion; 2. If the first result is positive (i.e., there is mutation), but the second result is negative (i.e., there is homozygous deletion), then one may make a genotype call that there is homozygous deletion.

3. If the first result is negative (i.e., there is no mutation), but the second result is positive (i.e., there is no homozygous deletion), then one may make a genotype call that there is no DNA deletion.

In various embodiments the method is particularly relevant for detecting large DNA deletion found in alpha-thalassemia, as large DNA deletions (about 3 to 40kb deletion) with variable unknown deletion boundaries are frequently seen in alpha-thalassemia.

In order to comprehensively assess both large and small DNA mutations, in some embodiments of the method, the method may further comprise (d) hybridizing a second plurality of additional circularizing probes to target one or more small-scale DNA mutations (e.g., single-nucleotide polymorphism (SNP)) commonly seen in beta-thalassemia. In some embodiments of the method, the second plurality of additional circularizing probes are padlock probes. In some embodiments of the method, the second plurality of additional circularizing probes comprises a nucleotide sequence represented by any one of SEQ NOs: 18 to 26.

Kit

In accordance with another aspect of the invention, there is provided a kit for detecting a gene deletion in host species, comprising: at least a pair of pre- PCR primers adapted for amplifying a first DNA region surrounding the gene deletion to form a pre-PCR product, wherein one of the pair of pre-PCR primers comprises an adaptor sequence at 5 -end, wherein the adaptor sequence is not found in the host species' genome; at least one circularizing probe adapted for hybridizing to the pre-PCR product, wherein the circularizing probe is designed to have a ligation arm and an extension arm targeting a strand complementary to the adapter sequence. In some embodiments the kit is particularly relevant for detecting large DNA deletions, as the first DNA region, which surrounds the large DNA deletions, will be amplified as long as the large DNA deletion is present in at least one of the homologous chromosomes - if there is no large DNA deletion (e.g., in wildtype), the two pre-PCR primers are Iocated too far apart, due to the presence of the large gene sequence, to be amplified using conventional PGR. Accordingly, no pre-PCR product of the first DNA region will be generated. In contrast, if there is large DNA deletion in at least one of the homologous chromosomes (i.e., applicable to both homozygous and heterozygous mutations), the two pre-PCR primers are now Iocated close to each other to amplify the first DNA region, and pre-PCR products of the first DNA region will be generated. Thus, the kit is able to distinguish between wildtype and mutation (e.g., homozygous and heterozygous mutations) based on "positive read-out" (i.e., if the kit shows positive reading (i.e., the first DNA region amplified), it indicates that there is large DNA deletion). However, with just the pre-PCR primers and the at least one circularizing probe, the kit will not be able to distinguish between homozygous and heterozygous mutations, because as long as one of the homologous chromosomes carries the gene deletion, pre-PCR product of the first DNA region will still be generated by the kit.

As used in the context of the kit, the term "gene deletion" refers to a loss of DNA sequence in both complementary strands from a gene when compared to a wildtype gene indicative of a healthy condition. The loss of DNA sequence shall be construed to include both (i) the loss of the entire DNA sequence from the gene (i.e., the entire gene is deleted from the chromosome) and (ii) the loss of a portion of the DNA sequence from the gene. Examples of "gene deletion", include, but not limited to, large DNA deletions with variable or unknown deletion boundaries. The term "large DNA deletion" refers to deletions of large chromosomal regions, leading to loss of optimal gene functions within those regions. For example, mutations commonly found in aipha-thaiassemia are series of large DNA deletions, having sizes ranging from 3 to 40kb (kilo base pairs). The term "small DNA mutations" refers to DNA mutations such as substitution mutations and point mutations (e.g., silent mutations, missense mutations, nonsense mutations, insertions and deletions).

In the context of the kit, the term "host species" refers to an organism that carries the gene deletion. Examples of the "host species" include, but not limited to, animals, plants, bacteria, fungi, or viruses. In various embodiments the animals are vertebrates, preferably mammals such as humans, horses, cows, mice, rats or rabbits. In various embodiments the host species is human.

The phrase "a first DNA region surrounding the gene deletion" shall be construed to include, but not limited to, DNA sequences located in proximity to the gene in the wildtypes (but the gene is deleted in DNA mutations).

In the context of the kit, the term "pre-PCR" refers to a PGR reaction specifically adapted for amplifying the first DNA region surrounding the DNA deletion (i.e., the mutant allele carrying the DNA deletion is amplified). The purpose of the pre-PCR is to form a pre-PCR product (i.e., DNA sequences surrounding the gene deletion) that is subjected to padlock capture as downstream assay. Thus, when utilizing the kit. the pre-PCR reaction does not have to be completed with full PGR cycles (e.g., 30 cycles) - instead, fewer than 30 cycles or fewer than 25 cycles, or fewer than 20 cycles, or fewer than 18 cycles or fewer than 17 cycles or fewer than 16 cycles might be sufficient for the pre-PCR step. Furthermore, a "pair of pre-PCR primers" (e.g., a reverse primer and a forward primer) flanking the first DNA region is required for conducting the pre-PCR reaction.

In the context of the kit, the term "adapter sequence" is a DNA sequence located at the 5'-end of a pre-PCR primer. The adapter sequence should not be found in the genome of the host species - in other words, if amplification of the adapter sequence is observed, it is confirmed that the kit has worked (i.e., not due to the "noise" amplification of the host species' own DNA sequence). In the context of the kit, the term "hybridizing" refers to formation of a hybrid nucleic acid through base-pairing between complementary or at least partially complementary nucleotide sequences under define conditions (e.g., PGR).

In the context of kit, the term "circularizing probe" (refers to or includes a probe sequence complementary to a target sequence (comprising a ligation arm and an extension arm), and the probe sequence is adapted to hybridize to and capture the target sequence. Once the probe sequence hybridizes to the target sequence, the probe sequence circularizes. Examples of circularizing probes suitable for the present invention include, but not limited to, padlock probe, molecular inversion probe, connector inversion probe. In the context of the kit, the term "ligation arm" refers to a first DNA sequence located at 5'-end of the circularizing probe, and the term "extension arm" refers to a second DNA sequence located at 3 -end of the circularizing probe. Both the ligation arm and the extension arm bind to the same strand of the target sequence. In various embodiments the at least one circularizing probe is a padlock probe. In various embodiments the at least one circularizing probe comprises nucleotide sequence represented by any one of SEQ ID NOs: 8 to 17. In various embodiments the adapter sequence is designed to be at least 20 nucleotides in length. In various embodiments the adapter sequence comprises a nucleotide (nt) sequence represented by any one of SEQ ID NOs: 1 to 7. In some embodiments of the kit, the ligation arm may be designed to specifically hybridize a second DNA region adjacent to the pre-PCR primer, in order to avoid non-specific primer binding in PGR reaction. The term "a second DNA region" refers to a DNA sequence in the pre-PCR product, wherein the DNA sequence is located, for example, immediately downstream of the pre-PCR primer sequence. In various embodiments the ligation arm and/or the extension arm is at least 20 nucleotides in length. In various embodiments primer melting temperature (Tm) of the ligation arm and/or primer melting temperature (Tm) of the extension arm is close to 55°C. The phrase "close to 55°C" shall be construed to cover a temperature range from 50°C to 60°C (i.e., 55°C ± 5 °C); and the temperature range from 50°C to 60°C includes 50°C and 60°C, and may include 51 °C, 52 °C, 53 °C, 54 °C, 55 °C. 56 °C, 57 °C, 58 °C and 59 °C.

In some embodiments of the kit, the kit may further comprise a first plurality of additional circularizing probes adapted for hybridizing to the first DNA region. In the context of the kit, the term "first plurality of additional circularizing probes" refer to a series of circularizing probes adapted for covering the first DNA region where there may be gene deletion. As long as one of the homologous chromosomes still carries the gene (i.e., no gene deletion) (e.g., wildtypes and heterozygous mutants), the first plurality of additional circularizing probes is able to detect and amplify the gene. In other words, if the first plurality of additional circularizing probes is not able to detect the gene, one may then conclude that a homozygous mutation has occurred in the first DNA region (i.e., the gene has been deleted in both homologous chromosomes). However, it is not possible to use the first plurality of additional circularizing probes to distinguish between heterozygous mutation and wildtype.

In various embodiments a first plurality of additional circularizing probes adapted for hybridizing to the first DNA region. In various embodiments the first plurality of additional circularizing probes are kebab probes. In various embodiments the first plurality of additional circularizing probes comprise nucleotide sequence represented by any of SEQ NO: 27 to 43. In some embodiments of the kit, the first plurality of additional circularizing probes are a series of padlock probes designed to cover the deleted region (Fig. 1 B, the "Kebab" design). One can imagine that these padlock probes bind to the template DNA and form a "Kebab" shape. Therefore, these padlock probes are named "Kebab probes" in the context of the present kit. As discussed above, Kebab probes return negative results from homozygous mutants - i.e., no gene amplification detected by the kit since the gene is missing on both strands of the first DNA region.

As discussed above, with just the pre-PCR primers and the at least one circularizing probe, kit is only able to distinguish between mutations (both homozygous and heterozygous) and wildtype, but not between homozygous and heterozygous mutations (i.e., "the first result"). However, incorporating the first plurality of additional circularizing probes (e.g., kebab probes), the kit is then able to distinguish between homozygous mutation and other genotypes (i.e., "the second result"). Taken together, one may rely on the kit to make a genotype call on the gene deletion:

1 . If both the first result (i.e., there is mutation) and the second result are positive (i.e.. there is no homozygous deletion), then one may make a genotype call that there is heterozygous deletion;

2. If the first result is positive (i.e., there is mutation), but the second result is negative (i.e., there is homozygous deletion), then one may make a genotype call that there is homozygous deletion.

In order to comprehensively assess both large and small DNA mutations, in some embodiments of the kit, the kit may further comprise a second plurality of additional circularizing probes (e.g., padlock probes) targeting one or more small-scale DNA mutations (e.g., single-nucleotide polymorphism (SNP)) commonly seen in beta-thalassemia. In various embodiments the kit further comprises a second plurality of additional circularizing probes targeting one or more small-scale DNA mutations. In various embodiments the second plurality of additional circularizing probes are padlock probes. In various embodiments the second plurality of additional circularizing probes comprises nucleotide sequence represented by any one of SEQ NOs: 18 to 26.

With respect to some embodiments of the invention, the invention method and kit are described below with further details:

Results

Experimental design of Cat-D. In some embodiments of the invention, we developed a method of using padlock probes to positively "catch a large deletion" (Fig. 1 B, the "Cat-D" design). The method does not rely on a negative readout to "detect" the deletion. It also does not rely on using sequencing data to reveal the "gene copy number variation". In Cat-D, the first step is a PGR reaction (Fig. 1 B, pre-PCR). A pair of PGR primers is designed to amplify the DNA region surrounding the deletion. Because of the flexible PGR amplicon length, designing the PGR primers does not depend on knowing the exact deletion boundaries. Only the mutant allele carrying the large DNA deletion can be amplified. The wild type allele is not PCR-amplified because the deletion size is too large to allow the primer pair to work along the wild type allele. The basic concept of the pre-PCR in Cat-D is the same as a commonly used technique called gap PCR. In contrast to gap PCR, one of the two pre-PCR primers in Cat-D carries an adaptor sequence on its 5'-end (Fig. 1 B, labelled in light grey). The adapter sequence is artificially designed to ensure the sequence does not exist in the human genome. The adaptor complementary strand is produced only if the PCR works. Because padlock capture is strand-specific, a special padlock probe, the "Cat-D probe" (Fig. 1 B), can be designed to capture the pre-PCR product with its extension arm targeting the adaptor complementary strand. The Cat-D probe only works if the PGR works. To avoid detecting the noise associated with non-specific primer binding, which may occur during a PGR reaction, the ligation arm of the Cat-D probe is designed to capture the DNA region immediately downstream of the pre-PCR primer. In summary, genotype calls for large deletions can be made by the padlock capture results from Cat-D probes together with Kebab probes (Fig. 1 C).

To catch multiple large deletions, multiple primer pairs targeting different deletions can be included in one pre-PCR reaction. Each primer pair targets one deletion and provides one unique adaptor sequence for designing the corresponding Cat-D probe. There is no restriction to the amplicon size of each primer pair. The amplicon sizes of different primer pairs can be similar or different. The pre-PCR product is subjected to the padlock capture of a probe library, which includes Cat-D probes and other padlock probes targeting a comprehensive panel of DNA mutations.

Pre-PCR cycle optimization and test run setup. The pre-PCR product is subjected to padlock capture as a downstream assay. Therefore, the pre-PCR does not have to be completed with full PGR cycles. We first performed gap PGR and successfully detected two thalassemia deletions from patient genomic DNA samples (Fig. 2A). Interestingly, the PGR amplicon sizes from the patient sample (Coriell Biorepository GM10796) were ~1 kb longer than the PGR amplicon sizes estimated based on a previous publication 14 (Fig. 2B). This result further confirmed that the deletion boundaries vary among patient samples. The number of pre-PCR cycles required for Cat-D was then tested.— FIL was successfully detected by Cat-D with a minimum of 16 pre- PCR cycles (Fig. 2C).

We generated a padlock probe library containing 5 padlock probes targeting the Cat-D product of -FIL, 5 padlock probes targeting the Cat-D product of - SEA, 17 Kebab probes targeting the commonly deleted regions in -FIL and - SEA, and 9 padlock probes targeting 10 different small β-thalassemia DNA mutations (see the "Method-Padlock probe library design" for details of these probes).

We performed a test run on a collection of 10 human genomic DNA samples (Fig. 2D). This study was approved by the Ethics Committee of Nanyang Technological University. Padlock capture was performed on each sample in duplicate. Two genomic DNA samples from two commonly used human cancer cell lines (293 T and HeLa) are regarded as "wild type" samples, as the samples were tested as "wild type" for all the thalassemia mutations included in this study (data not shown). Six a-thalassemia genomic DNA samples and one β-thalassemia genomic DNA sample were included. A special human DNA sample was purchased from Prom eg a (Cat# G304A).

The sample was originally included in this study as a wild type control. However, we later realized that Promega (Cat# G304A) is prepared from human whole blood from multiple anonymous donors. The blood samples are only tested as negative for HIV and Hepatitis B. There is no information available regarding the samples' genotypes for thalassemia mutations. Therefore, G304A should be regarded as a special DNA sample without a clear genotype. We included G304A in this study just for the test run. Moreover, our padlock capture duplicates on the sample (G304A.1 and G304A.2) were performed on G304A from two different lots (G304A.1 LOT0000189195; G304A.2 LOT0000219766). Therefore, G304A.1 and G304A.2 should be considered two different DNA samples.

On average, -184 K reads were obtained from each sample. To confirm the experimental consistency of the method, we calculated the correlation coefficients between the duplicates in each sample. The correlation coefficient of the eight experimental duplicates was 0.98 ± 0.01 (Figure 6). This result confirmed the high experimental consistency of the method. Large α-thalassemia DMA deletions detected by Cat-D. The raw data (Fig. 3A) clearly showed that the padlock capture products from the Cat-D probes are significantly higher in the samples carrying the corresponding deletions. The headcounts of the Kebab probe capture products are also significantly lower in the samples containing the compound heterozygous deletion (— FIL/— SEA). To provide a mathematical justification and generate a computational method to make genotype calls, we established a mathematical method to calculate the genotype scores and make the genotype calls for each sample (Fig. 3B; Methods). The results are nearly picture-perfect for -FIL and Kebab (Fig. 3C,E). Negative genotype calls were accurately made on all the wild type samples and the samples expected to be wild type; for example, the β- thalassemia samples (Beta.1 and Beta.2) are expected to be wild type for the a-thalassemia mutations. Positive genotype calls were also accurately made on all the mutant samples. Clear genotype calls were also made for -SEA (Fig. 3D). All the mutant and wild type samples were accurately genotyped. For the samples "expected" to be wild type, G304A.Lot2 and Beta.1 were genotyped as positive for -SEA (Fig. 3D). G304A is a mixture of genomic DNA isolated from multiple donors, and no information is available regarding the sample's genotype regarding thalassemia mutations. Based on our genotyping results, it is highly likely that one or more G304A.

Lot2 donors are carriers of -SEA. We further confirmed this conclusion by gap PGR (Figure 7). Interestingly, all the genomic DNA samples were subjected to gap PGR before Cat-D to confirm the samples' genotype for the a-thalassemia mutations (Figure 7 ). Each PGR, which contained 100 ng of genomic DNA, was performed for 35 cycles. -SEA was not detected in G304A.Lot2. When gap PGR was repeated with 38 cycles and 200 ng genomic DNA, a clear PGR product for -SEA was detected in G304A.Lot2. This result confirmed the Cat-D genotyping results and showed that Cat-D is more sensitive than gap PGR. For Beta.1 , the genotype call is a false positive result. This false positive result can be dealt with by comparing it with the genotype call made on the duplicate sample (Beta.2). β-thalassemia point mutations detected by padlock probes. The Cat-D and Kebab probes only occupy a small fraction of the padlock probe library, which also includes other padlock probes targeting small DNA mutations, such as SNPs. In this study, we included padlock probes targeting small β- thalassemia DNA mutations. One of the 10 DNA samples included in this study is a heterozygous mutant in β-thalassemia codon 17 (A > T). The raw data (Fig. 4A) clearly showed that the mutant headcounts are significantly higher in the samples carrying the corresponding mutation. To provide a mathematical justification and to generate a computational method to make the genotype calls, we established a mathematical method to calculate the genotype call (Fig. 4B). In this case, we simply choose 5% as the threshold to make the genotype call for a "minor allele" (Fig. 4B; Methods). The 5% minor allele frequency was determined by analyzing the padlock capture data (Fig. 4C). We calculated the genotype scores and made the genotype calls on all the samples (Fig. 4D). The results show that the method is sensitive and precise for β-thalassemia point mutations. We also included padlock probes targeting other β-thalassemia small mutations in the padlock probe library. Because we do not have mutant genomic DNA samples for these mutations, we expected that all the samples included in this study are wild type for these mutations. Our genotyping results clearly confirmed our expectations (Figures 8 and 9).

Discussion

In summary, the test run yielded highly satisfying results and a strong proof of concept for Cat-D. These results demonstrate that the method is sensitive (0% false negative rate) and precise (very low false positive rate, -5% for --SEA mutation). From a clinical point of view, a low false positive rate is more "acceptable" than a low false negative rate. When genetic testing is performed on a large population, the majority of the samples are wild type. With a 0% false negative rate, ail the wild type samples can be accurately genotyped and patients can be informed of their testing results with confidence. Regardless of the false positive rate of the experimental method, for the minority of the samples that tested positive for a certain mutation, this is a feasible approach for clinical labs to experimentally validate testing results before "bad news" is released to patients. Taken together, Cat-D is a comprehensive (a single test covers a comprehensive panel of genetic disorders) and high-throughput (one sequencing run contains multiple samples) method suitable for population- based carrier screens.

Commercial Applications

The commercial application of this invention is obvious. Cat-D and the established padlock probe designs might be applied to replace the current DNA diagnostics for thalassemia mutations. Cat-D is cost-effective and time- saving compared to the current methods.

Moreover, the Cat-D is a high-throughput and comprehensive method. All the known mutations of thalassemia and many known mutations of other inherited disorders can be included in one test. One sequencing run is able to include up to a hundred patients' samples. Therefore, the method is suitable for population-based carrier screen. Currently, nearly all the DNA diagnostics of thalassemia mutations are carried out only to provide final diagnosis on clinical patients, who have already suffered from thalassemia-related syndromes. Since the thalassemia mutation carrier percentage is high in Southeast Asia, the Mediterranean region, the Middle East and sub-Saharan Africa, premarital screen for thalassemia mutation carriers would be greatly beneficial for these regions.

Methods

Primers design. The primer portion of the pre-PCR primers was designed according to the criteria for designing a regular PGR primer. The primers do not bind to repetitive DNA regions in the genome. The primer pairs were confirmed to be able to amplify the target DNA region using a mutant genomic DNA sample carrying the corresponding deletion. For each pre-PCR primer pair, one of the two primers carries the Cat-D adapter on its 5'-portion. The adaptor sequence does not exist in the human genome. The adapter sequence was designed to be at least 20 nt (nucleotides) in length to achieve sequence specificity and to allow for the design of multiple Cat-D padlock probes.

The primers used in the study are listed below (the adapter sequences of the primers are indicated with underlines):

SEQ ID N0.1

SEA850F-ADAPTOR

(5'-CGATCGTGCGACGCGTATCGGT

CCCTTCACCCTCCCACAGTTCCTGC-3') ;

SEQ ID NO.2

SEAR1 K

(5'-TTTCACCCAGTACAGCGAGTCCTTCC-3');

SEQ ID N0.1 and SEQ ID NO.2 form the primer pair for SEA.

SEQ ID NO.3

FIL2KR-ADAPTOR

(5'-TATGCGTCGCGTGTCGCGCGTAGATCTGCACCTCTGGGTAGGTTC-3') ;

SEQ ID NO.4

FILF2K

(5'-TCTCAGGCATGGAAGAATGAGGGC-3');

SEQ ID NO.3 and SEQ ID NO.4 form the primer pair for FIL. SEQ ID N0.5

FILF1 K

(5'-GAGTTGTAAGATATTTTGGGCCAAGCACG-3');

SEQ ID N0.6

FILR1 K

(5'-CTAGAACGTGGATCCAAGAGGGG-3'); SEQ ID NO.7

FILR2K

(5'-GATCTGCACCTCTGGGTAGGTTC-3').

Padlock probe library design. The two arms of each padlock probe were 20 nt (nucleotides) or longer. The Tm (primer melting temperature) of each arm was optimized to be close to 55 °C. The possibility of each padlock capture target forming complicated secondary structures was minimized using UNAFold

(http://homepages.rpi.edu/~zukerm/download/UNAFold _download.html).

For each Cat-D padlock probe, the extension arm binds to the complementary sequence of the Cat-D adapter. The ligation arm carries the same DNA sequence as the primer extension product of the pre-PCR primer carrying the Cat-D adapter and is located closely downstream to the 3'-end of the pre-PCR primer carrying the Cat-D adapter.

5 padlock probes targeting that Cat-D product of

21 GCCAGCTCCCTCCAACCTCC

SEQ ID al pha-

F! L CTTCAGCTTCCCGATATCCGACGGTAGTGT NO.10 Tha !essemia

cgtcgcgtgtcgcgcgtaga

CAGCTCCCTCCAACCTCCAC

SEQ ID al pha-

F! L CTTCAGCTTCCCGATATCCGACGGTAGTGT NO.ll Tha !essemia

tcgcgtgtcgcgcgtagatc

AG CTCCCTCCA ACCTCCACA

SEQ ID al pha-

F! L CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.12 Tha iessemia

cgcgtgtcgcgcgtagatct

The ligation arms of padlock probes are indicated with underlines.

The extension arms of padlock probes are indicated in italic. padlock probes targeting the Cat-D product of

The ligation arms of padlock probes are indicated with underlines.

The extension arms of padlock probes are indicated in italic. padlock probes targeting 10 different small β-thalassemia DNA mutations:

Codon l5(T>C),codonl7(A>T

G G CAGTAACG G C AG ACTTCT

)..iVS~l-l (G~>T) , iVS 1-5 (G-

SEQ ID beta- CTTCAGCTTCCCGATATCCGAC

>T),Codons 27/28 (+C); GCC

N0.19 Thaiessemia GGTAGTGT7

CTG(A!a Ser)->GCC C CTG

AAA CCTGTCTTG TAA CCTT

betaO

Codon l5(T>C),codonl7(A>T

CAGTAACG G CAG ACTTCTCC

),!VS-I -1 ( G · ) , iVS 1-5 (G-

SEQ ID beta- CTTCAGCTTCCCGATATCCGAC

>T),Codons 27/28 (+C); GCC

NO.20 Thaiessemia GGTAGTGT

CTG (A! a Ser)->GCC C CTG

AACCTGTCTTGTAACCTTGA

betaO

Codon l5(T>C),codonl7(A>T

G G G C AG TAACG G C AG ACTTC

),IVS-I-1 (G->T) , iVS 1-5 (G-

SEQ ID beta- CTTCAGCTTCCCGATATCCGAC

>T),Codons 27/28 (+C); GCC

N0.21 Thaiessernia GGTAGTGT

CTG (A! a Ser)->GCC C CTG

TTAAA CCTG TCTTG TAA CCT

betaO

37/38/39 (-7 nts)

betaOdelGACCCAG,Codons

38/39 (-C); ACC CAG(Th r

G i n)->ACC -AG

G GTAG ACCACC AG CAG CCTA

betaO,Codons 38/39 (-CC);

SEQ ID beta- CTTCAGCTTCCCGATATCCGAC

ACC CAG(Th r-G! u)->A- - CAG

N0.22 Thaiessernia GGTAGTGT

betaO,Codons 40/41 (+T);

CCTT A GGGTTGCCCA TAA CA AGG TTC(Arg-Phe)->AGG T

TTC betaO,Codons 41/42 (- TTCT); TTCTTT( P h e - P h e ) ->- - - -TT betaO

37/38/39 (-7 nts)

betaOdelGACCCAG,Codons

38/39 (-C); ACC CAG(Th r

G i n)->ACC -AG

G ACCACCAG CAG CCTAAG G G

betaO,Codons 38/39 (-CC);

SEQ ID beta- CTTCAGCTTCCCGATATCCGAC

ACC CAG(Th r-Gl u)->A- - CAG

N0.23 Thaiessemia GGTAGTGT

betaO,Codons 40/41 (+T);

AGGG TTG CCCA TAACAGCAT AGG TTC(Arg-Phe)->AGG T

TTC betaO,Codons 41/42 (- TTCT); TTCTTT( P h e - P h e ) ->- - - -TT betaO

37/38/39 (-7 nts)

betaOdelGACCCAG,Codons

38/39 (-C); ACC CAG(Th r AGGGTAGACCACCAGCAGCC

SEQ ID beta- G i n)->ACC -AG CTTCAGCTTCCCGATATCCGAC N0.24 Thaiessemia betaO,Codons 38/39 (-CC); GGTAGTGT

ACC CAG(Th r-Gl u)->A- - CAG CACCTTAGGG GCCCATAA betaO,Codons 40/41 (+T);

AGG TTC(Arg-Phe)->AGG T TTC betaO,Codons 41/42 (- TTCT); TTCTTT(Phe-Phe)->- - - -TT betaO

37/38/39 (-7 nts)

betaOdelGACCCAG,Codons

38/39 (-C); ACC CAG(Th r

G i n)->ACC --AG

ACCAG CAG CCTAAG G GTG G G

betaO,Codons 38/39 (-CC);

SEQ ID beta- CTTCAGCTTCCCGATATCCGAC

ACC CAG(Th r-Gl u)->A- - CAG

N0.25 Thaiessemia GGTAGTGT

betaO,Codons 40/41 (+T);

TTGCCCA TAA CAGCA TCAGG AGG TTC(Arg-Phe)->AGG T

TTC betaO,Codons 41/42 (- TTCT); TTCTTT(Phe-Phe)->- - - ~TT betaO

37/38/39 (-7 nts)

betaOdelGACCCAG,Codons

38/39 (-C); ACC CAG(Th r

G i n)->ACC --AG

CAG CAG CCTAAG G GTG G G AA

betaO,Codons 38/39 (-CC);

SEQ ID beta- CTTCAGCTTCCCGATATCCGAC

ACC CAG(Th r-G! u)->A- - CAG

N0.26 Thaiessemia GGTAGTGT

betaO,Codons 40/41 (+T);

GCCCATAACAGCATCAGGAG AGG TTC(Arg-Phe)->AGG T

TTC betaO,Codons 41/42 (- TTCT); TTCTTT(Phe-Phe)->- - - -TT betaO

The extension arms of padlock probes are indicated in italic.

The ligation arms of padlock probes are indicated with underlines. 7 Kebab probes targeting the commonly deleted regions in --FIL and --SEA

GGAAGGGAGTGCCTTGGCCT

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.27

TTGTCTGAAAAGCCTGGGGT

GTG CCAG G CCTGGTCCAGTG

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.28

CGA CTCA CA G TCA GGGCTCC

G TC ACTG G C ACTG ACTG CTG

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.29

GGGGA TGTA GA TAA CGTGGG

CCTC AG CATG G G ATG G G G CC

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT NO.30

GTATCTACAGTATGATGGTA

CTG ACTCTG CCCAC AG CCTG

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.31

TAGCTCCGACCAGCTTAGCA G GTCAG CACCCTTCAG CCTG

SEQ ID

wildtype kebab probes CTTCAG CTTCCCG ATATCCG ACG GTAGTGT N0.32

A CA G CCTG A GAAA TCA CTGA

ACCC ACAG G CTG CG G G AAG G

SEQ ID

wildtype kebab probes CTTCAGCTTCCCG ATATCCG ACGGTAGTGT N0.33

TACL I ! lAGGTCAGACC! CC

ACCCACCCTGTGTTATGATT

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.34

GGGCACCTGCAGAGATTGAG

I I 1 ! CC ! CAGCCC I A ! I

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.35

TCCCCACACAGACCCAGGAT

I C I CC I AC I 1 I AAG I AACAC

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.36

TGGGCTGAGTTCCAAACCCT

GAATAGGAAGTTGTACACAG

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.37

TCAGTGAGACTGTGGAATGG

G CCTTG G G C AG AG AAG G AAG

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.38

CTCCCTGCCCTGTCTCCCCA

G G G ATG GTACTG AG G AG AAA

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.39

TCTGGGGAAGGGTGGGAGGT

TG AG G AAG G AAG G G GTG G AC

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT NO.40

A CAA GGGCCCTGTGGTTGGA

CTCAG G G G AG CTG AGTG G GT

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.41

AGAAGGGACCTTCTAGCCAG

AG AG AAAACACACACC AG G G

SEQ ID

wildtype kebab probes CTTCAG CTTCCCG ATATCCG ACG GTAGTGT N0.42

GCCAGGGCTTTA TGGCTA CC

GATATTCCTATCAGTTGAGG

SEQ ID

wildtype kebab probes CTTCAGCTTCCCGATATCCGACGGTAGTGT N0.43

A CA TCA CAAACGCAGGCA GA

The extension arms of padlock probes are indicated in italic.

The ligation arms of padlock probes are indicated with underlines.

Pre-PCR. The Herculase I I Fusion DNA Polymerases kit (Cat#600675, Agilent) and 100 ng genomic DNA was used in a 25 μΙ PGR reaction containing 0.8 μΜ of each PCR primer and amplified according to the following the PGR program: (1 ) 95 °C for 3 min;

(2) 18 to 20 cycles of 95 °C for 30 sec, 63 °C for 30 sec, and 68 °C for 90 sec:

(3) 68 °C for 5 min and

(4) a 4 °C hold.

The pre-PCR products were purified with the QIAquick PGR Purification Kit (Cat#28104, QIAGEN) and eluted into a 25 μΙ volume.

Padlock capture. Padlock capture was performed as previously described (Zhang, K. et al. Nat Methods 6, 613-618 (2009)). Briefly, each reaction was performed in 20 μ! volume containing 1 unit Ampligase (A3210K, Epicentre), 1 unit Phusion High-Fidelity DNA Polymerase (M0530, New England BioLabs), 1 x Phusion High-Fidelity DNA Polymerase buffer, 10 nM dNTP and 1 ng padlock probe. Two microliters of the purified pre-PCR product and 800 ng genomic DNA were used in each reaction. Nicotinamide adenine dinucleotide (NAD+) was provided in each reaction at a final concentration of 0.5 mM.

Illumina sequencing. The sequencing libraries were PCR-amplified in a realtime PGR system (CFX Connect, Bio-Rad) using the following primers: (1 ) CA2-RA.MiSecret

(5'-AATGATACGGCGACCACCGAGATCTACACGCTA

CACGCCTATCGGGAAGCTGAAG-3') :

(2) CA-2-FA.Indx3Sol

(5'-CAAGCAGAAGACGGCATACGAGATGCC

TAACGGTCTGCCATCCGACGGTAGTGT-3');

(3) CA-2-FA.Indx4Sol

(5'-CAAGCAGAAGACGGCATACGAGATTGGTCACGGTCTGCCA

TCCG ACGGTAGTGT-3') :

(4) CA-2-FA.Indx5Sol (5'-CAAGCAGAAGACGGCATACGAGATCA

CTGTCGGTCTGCCATCCGACGGTAGTGT-3'):

5) CA-2-FA.Indx7Sol

5'-CAAGCAGAAGACGGCA

TACGAGATGATCTGCGGTCTGCCATCCGACGGTAGTGT-3');

6) CA-2-FA.Indx10Sol

5'-CAAGCAGAAGACGGCATACGAGATAAGC

TACGGTCTGCCATCCGACGGTAGTGT-3') ;

7) CA-2-FA.Indx12Sol

5'-CAAGCAGAAGACGGCATACGAGATTACAAGCGG

TCTGCCATCCGACGGTAGTGT-3') ;

8) CA-2-FA.Indx13Sol

5'-CAAGCAGAAGACGGCATACGAGATTTGACTCGGTCTGCCA TCCGACGGTAGTGT -3');

9) CA-2-FA.Indx14Sol

5'-CAAGCAGAAGACGGCATACGAGATGGAACT

CGGTCTGCCATCCGACGGTAGTGT-3') ;

10) CA-2-FA. Indx15Sol

5'-CAAGCAGAAGACGGCATACGAGATTGACATCGGTC TGCCATCCGACGGTAGTGT-3') ;

1 1 ) CA-2-FA. Indx16Sol

5'-CAAGCAGAAGACGGCATACGAGATGGACGGCGG

TCTGCCATCCGACGGTAGTGT-3') ;

12) CA-2-FA. Indx18Sol

5'-CAAGCAGAAGACGGCATACGAGATGCGGACCGG TCTGCCATCCGACGGTAGTGT-3') ;

(13) CA-2-FA. Indx19Sol

(5'-CAAGCAGAAGACGGCATACGAGATTTTCACCGGTCTGCCA

TCCG ACGGTAGTGT-3') ;

(14) CA-2-FA. Indx25Sol (5'-CAAGCAGAAGACGGCATACGAGATATCA GTCGGTCTGCCATCCGACGGTAGTGT-3') ; (15) CA-2-FA. Indx45Sol

(5'-CAAGCAGAAGACGGCATACGAGATCGTAGTCGGTCT

GCCATCCGACGGTAGTGT-3') ;

(16) CA-2-FA. Indx76Sol

(5'-CAAGCAGAAGACGGCATACGAGATAATAGGC

GGTCTGCCATCCGACGGTAGTGT-3') ;

(17) CA-2-FA. Indx91 Sol

(5'-CAAGCAGAAGACGGCATACGAGATACATCGCGGTCTGCCA

TCCG ACGGTAGTGT-3') ;

(18) CA-2-FA. Indx92Sol

(5'-CAAGCAGAAGACGGCATACGAGATTCAAGTCG

GTCTGCCATCCG ACGGTAGTGT-3') ; and

(19) CA-2-FA. Indx93Sol (5'-CAAGCAGAAGACGGCAT

ACGAGATATTGGCCGGTCTGCCATCCGACGGTAGTGT-3').

Each padlock capture product was assigned a unique barcode. The sequencing libraries for each sample were combined. The following sequencing primers were used:

(1 ) Readl .Misecret (5'-ACACGCTACACGCCTATCGGGAAGCTGAAG-3') and (2) IndexRead

(5'-ACACTACCGTCGGATGGCAGACCG-3'). Sequencing was performed on an ii!umina MiSeq system using the MiSeq Micro flow ceil (2 ^χ 1 50 cycles). FASTQ files were generated from the sequencer's output using the lllumina bcl2fastq2 software (v.2.17.1 .14) with the default chastity filter set to select the sequence reads for the subsequent analysis.

Data analysis. We wrote a perl script to find the exact match between the first 88 nt of a sequencing read and an expected padlock probe capture product. To make the genotype calls on large DNA deletions using the padlock capture data from the Cat-D and Kebab probes, a "standard weight" was calculated for each mutation by taking the average headcounts from the four wild type samples (293 T.1 , 293 T.2, HeLa.1 and HeLa.2). The raw genotype score of each sample was then calculated as the headcount of the sample divided by the standard weight. Because Kebab probes "negatively" report the corresponding mutation (homozygous deletion), low headcounts indicate the detection of a mutation. Therefore, the raw genotype scores of the Kebab probes were calculated in reverse (standard weight divided by the headcount of each sample). To make the genotype scores more sensible for interpretation, the sample with the highest raw genotype score in the panel was scored as 1 00. The rest of the samples were scored proportionally to the raw genotype scores. The threshold was then calculated (Fig. 3B). A sample with a genotype score higher than the threshold was positive for the corresponding mutation. The corresponding mutation with the Cat-D probes is a corresponding large DNA deletion. The corresponding mutation with the Kebab probes is a "homozygous" large DNA deletion. To make the genotype calls on the point mutations, we used 5% as the threshold to make the genotype call on a "minor allele" (Fig. 4B). The above is a description of embodiment(s) of a method and a kit for detecting a gene deletion in a host species. It is to be further appreciated that technical features from one or more embodiments as described may be permutated and/or combined to form further embodiments without departing from the scope of the present invention.

Claims

CLAIMS:

1 . A method for detecting a gene deletion in a host species, comprising:

(a) amplifying a first DNA region surrounding the gene deletion with at least a pair of pre-PCR primers to form a pre-PCR product, wherein one of the pair of pre-PCR primers carries an adaptor sequence at 5'- end, wherein the adaptor sequence is not found in the host species' genome;

(b) hybridizing the pre-PCR product to at least one circularizing probe, wherein the at least one circularizing probe is designed to have a ligation arm and an extension arm targeting a strand complementary to the adapter sequence.

2. The method of claim 1 , wherein the host species is human.

3. The method of claim 1 or claim 2, wherein the adapter sequence is at least 20 nucleotides in length.

4. The method of any one of the preceding claims, wherein the adapter sequence comprises a nucleotide sequence represented by any one of SEQ

ID NOs: 1 to 7.

5. The method of any one of the preceding claims, wherein the at least one circularizing probe is a padlock probe.

6. The method of any one of the preceding claims, wherein the at least one circularizing probe comprises a nucleotide sequence represented by any one of SEQ ID NOs: 8 to 17. 7. The method of any one of the preceding claims, wherein the ligation arm hybridizes to a second DNA region adjacent to the pre-PCR primer.

8. The method of any one of the preceding claims, wherein the ligation arm and/or the extension arm is designed to be at least 20 nucleotides in length.

9. The method of any one of the preceding claims, wherein Tm of the ligation arm and/or the extension arm ranges from 50 °C to 60°C.

10. The method of any one of the preceding claims, wherein the first DNA region is amplified in (a) for fewer than 30 cycles. 1 1 . The method of claim 10, wherein the first DNA region is amplified in (a) for 16 cycles.

12. The method of any one of the preceding claims, further comprises (c) hybridizing a first plurality of additional circularizing probes to the first DNA region.

13. The method of claim 12, further comprises comparing a first result of gene deletion detection obtained from (a) and (b), and a second result of gene deletion detection obtained from (c) to determine a genotype of the gene deletion of the host species.

14. The method of claim 12 or 13, where the first plurality of additional circularizing probes are kebab probes. 15. The method of claim 14, wherein the first plurality of additional circularizing probes comprise a nucleotide sequence represented by any one of SEQ NOs: 27 to 43.

16. The method of any one of the preceding claims, wherein the gene deletion is a large-scale DNA mutation.

17. The method of claim 16, wherein the large-scale DNA mutation is a deletion of 3 to 40 kb (kilo-base pairs).

18. The method of any one of the preceding claims, wherein the gene deletion is a mutation found in alpha-thalassemia.

19. The method of any one of the preceding claims, further comprises (d) hybridizing a second plurality of additional circularizing probes to target one or more small-scale DNA mutations in the host species.

20. The method of claim 19, wherein the one or more small-scale DNA mutations comprise single-nucleotide polymorphism (SNP).

21 . The method of claim 19, wherein the one or more small-scale DNA mutations is a small beta-thalassemia DNA mutation.

22. The method of claim 19, wherein the second plurality of additional circularizing probes are padlock probes. 23. The method of claim 19, wherein the second plurality of additional circularizing probes comprises a nucleotide sequence represented by any one of SEQ NOs: 18 to 26.

24. A kit for detecting a gene deletion in host species, comprising:

at least a pair of pre-PCR primers adapted for amplifying a first DNA region surrounding the gene deletion to form a pre-PCR product, wherein one of the pair of pre-PCR primers comprise an adaptor sequence at 5'-end, wherein the adaptor sequence is not found in the host species' genome;

at least one circularizing probe adapted for hybridizing to the pre-PCR product, wherein the circularizing probe is designed to have a ligation arm and an extension arm targeting a strand complementary to the adapter sequence.

25. The kit of claim 24, wherein the host species is human.

26. The kit of claim 24 or claim 25, wherein the adapter sequence is designed to be at least 20 nucleotides in length.

27. The kit of any one of claim 24 to claim 26, wherein the adapter sequence comprises a nucleotide sequence represented by any one of SEQ ID NOs: 1 to 7. 28. The kit of any one of claim 24 to claim 27, wherein the at least one circularizing probe is a padlock probe.

29. The kit of any one of claim 24 to claim 28, wherein the at least one circularizing probe comprises nucleotide sequence represented by any one of SEQ ID NOs: 8 to 17.

30. The kit of any one of claim 24 to claim 29, wherein the ligation arm is adapted to hybridize a second DNA region adjacent to the pre-PCR primer. 31 . The kit of any one of claim 24 to claim 30, wherein the ligation arm and/or the extension arm is at least 20 nucleotides in length.

32. The kit of any one of claim 24 to claim 31 , wherein Tm of the ligation arm and/or the extension arm ranges from 50 °C to 60°C .

33. The kit of any one of claim 24 to claim 32, further comprising a first plurality of additional circularizing probes adapted for hybridizing to the first DNA region. 34. The kit of claim 33, wherein the first plurality of additional circularizing probes are kebab probes.

35. The kit of claim 33, wherein the first plurality of additional circularizing probes comprise nucleotide sequence represented by any of SEQ NO: 27 to

43. 36. The kit of any one of claim 24 to claim 35, further comprises a second plurality of additional circularizing probes targeting one or more small-scale DNA mutations.

37. The kit of claim 36, wherein the second plurality of additional circularizing probes are padlock probes.

38. The kit of claim 36, wherein the second plurality of additional circularizing probes comprises nucleotide sequence represented by any one of SEQ NOs: 18 to 26.