AU2016403554A1 - Method for enriching target nucleic acid sequence from nucleic acid sample - Google Patents

Method for enriching target nucleic acid sequence from nucleic acid sample Download PDF

Info

Publication number
AU2016403554A1
AU2016403554A1 AU2016403554A AU2016403554A AU2016403554A1 AU 2016403554 A1 AU2016403554 A1 AU 2016403554A1 AU 2016403554 A AU2016403554 A AU 2016403554A AU 2016403554 A AU2016403554 A AU 2016403554A AU 2016403554 A1 AU2016403554 A1 AU 2016403554A1
Authority
AU
Australia
Prior art keywords
nucleic acid
bait
sequence
sequences
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
AU2016403554A
Inventor
Wanshi CAI
Xingyi HANG
Wubin QU
Ruichao WANG
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Igenetech Biotech (beijing) Co Ltd
Original Assignee
Igenetech Biotech Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Igenetech Biotech Beijing Co Ltd filed Critical Igenetech Biotech Beijing Co Ltd
Publication of AU2016403554A1 publication Critical patent/AU2016403554A1/en
Pending legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Abstract

Provided is a method for enriching a target nucleic acid sequence from a nucleic acid sample, the method comprising: providing a nucleic acid sample containing a target nucleic acid sequence and a bait sequence identical to the target nucleic acid sequence or having characteristics against the target sequence; preparing a nucleic acid analogue by in vitro transcription using the bait sequence as a template, with the nucleic acid analogue carrying a binding portion; fragmenting the nucleic acid sample; hybridising the nucleic acid analogue with the nucleic acid sample, such that the nucleic acid analogue and the target nucleic acid sequence form a nucleic acid analogue/DNA hybrid complex; and isolating the nucleic acid analogue/DNA hybrid complex from a non-specific hybrid nucleic acid via the binding portion, and removing the non-target nucleic acid sequence. In a preferred embodiment, the method also comprises amplifying the nucleic acid analogue/DNA hybrid complex so as to achieve the object of enriching the target nucleic acid sequence.

Description

METHOD FOR ENRICHING TARGET NUCLEIC ACID SEQUENCES FROM A NUCLEIC ACID SAMPLE
TECHNICAL FIELD
The present disclosure relates to nucleic acid sequence capturing, enriching and analyzing. More specifically, the present disclosure relates to a method for enriching target sequences based on liquid phase capture.
BACKGROUND .0 Whole genome sequencing allows for obtaining the mutations, insertions, deletions, and structural variations at the genome-wide level. However, due to the large genomic capacity, sequencing at 30x coverage would yield a data volume of approaching 100G. Sequencing of low-mutation frequency sequences associated with such as tumors requires at least 1000x coverage; and if genome.5 wide sequencing is performed, it would generate a data volume of up to 3000G.
Such level of data volume would produce enormous difficulties in data analysis, as well as huge cost in sequencing. At this time, a technology for capturing target area emerges.
The technology for target region capturing refers to a technique of ’0 directional capturing the nucleic acid sequence of a target region by a specific technical means, followed by library construction and sequencing, so as to achieve the purpose of deeply sequencing the target region while greatly reducing the costs of sequencing. PCR is a common technique for enriching a target region, and multiplex PCR technique is a more common technique for ’5 capturing multiple target regions at one time. The multiplex PCR technique is suitable for capturing hotspot regions or target regions of short length. For the target regions of long length, such as the target regions longer than 100K, the multiplex PCR technique is no longer suitable in terms of cost and technical complexity.
Therefore, there is a need in the art for a new method suitable for capturing target regions of long length.
SUMMARY OF THE INVENTION
In order to solve the above problems, the disclosure provides a method for 35 enriching target sequences based on liquid phase capture.
In a first aspect, the disclosure provides a method for enriching target nucleic acid sequences from a nucleic acid sample, the method comprising steps of:
a) providing a nucleic acid sample comprising target nucleic acid 40 sequences, and bait sequences that are identical to or that are characteristic of
2016403554 22 Nov 2018 the target nucleic acid sequences;
b) performing in vitro transcription with the bait sequences as templates to prepare nucleic acid analogs, wherein the nucleic acid analogs each have a binding moiety;
c) fragmenting the nucleic acid sample, and preferably preparing a library;
d) hybridizing said nucleic acid analogs to said nucleic acid sample, such that said nucleic acid analogs form nucleic acid analog/DNA hybrid complexes with said target nucleic acid sequences; and
e) isolating the nucleic acid analog/DNA hybrid complexes from non0 specifically hybridized nucleic acids by the binding moiety, to remove nontarget nucleic acid sequences.
In an embodiment, in preparing a library of step c), the fragments of the nucleic acid sample each are linked to a linker sequence at both ends, and following step e), the method further comprises step f) of amplifying the nucleic .5 acid analog/DNA hybrid complexes based on the linker sequence, so as to enrich the target nucleic acid sequences.
In an embodiment, the bait sequences each have characteristics selected from the group consisting of the following: i) no hairpin structure is formed by itself and no dimer is formed with each other; ii) copy numbers are ’0 compensated according to the GC content and/or spatial structure of the target nucleic acid sequence; iii) where the target region has a very high or very low GC content or where the target region is a region of low complexity, the bait sequences are designed for the region flanking the target region as the replacement region with the same design method for the target region; and iv) ’5 the bait sequences do not specifically bind to other sequences than the target nucleic acid sequences in the nucleic acid sample.
In an embodiment, the copy number of the bait sequence is compensated according to the concern about the target nucleic acid sequence.
In an embodiment, the nucleic acid sample is genomic DNA, RNA, cDNA, 30 or mRNA, and where the nucleic acid sample is RNA or mRNA, the method further comprises the step of subjecting the RNA or mRNA to reverse transcription into DNA before step c).
In an embodiment, the bait sequences are on a solid support, such as on a microarray slide.
In an embodiment, the solid support is also a plurality of beads or a microarray.
In an embodiment, some or all of the nucleic acid analogs each comprise a binding moiety.
In an embodiment, in step b), the in vitro transcription is carried out with nucleic acid analogs GNA, LNA, PNA, TNA or morpholino nucleic acid, to prepare the nucleic acid analogs, and preferably the nucleic acid analogs each comprise a binding moiety.
In an embodiment, the binding moiety is a biotin binding moiety.
2016403554 22 Nov 2018
In an embodiment, the copy number of the bait sequence is compensated according to the GC content of the target sequence, and the lower or the higher the GC content is, the more the copy number of the bait sequence corresponding to the target sequence is increased.
In an embodiment, the copy number is compensated according to the GC content of the target nucleic acid sequence, which means that, the copy number coefficient for the bait sequence with a GC content of 50% is set to the baseline of 1, and the copy number coefficient for the bait sequence with a GC content of 10%-90% is increased by 0.08-0.12 per 1% of the GC content deviating from .0 50%.
In a specific embodiment, the method for the copy number compensation of the bait sequence is as follows: the GC content of the target sequence is classified into 6 levels from high to low, namely level 1: 10%-30%, level 2: 30% -40%, level 3: 40%-60%, level 4: 60%-70%, level 5: 70%-90%, and level .5 6: less than 10% or greater than 90%, wherein the copy number of the bait sequence at level 3 is set to the baseline copy number, the copy number of the bait sequence at levels 2 and 4 is higher than the copy number at level 3, such as 2.2-2.8 times of the copy number at level 3, the copy number of the bait sequence at levels 1 and 5 is further higher than the copy number at level 3, for !0 example 3-4 times of the copy number at level 3, and for level 6 at which the GC content is less than 10% or greater than 90% and for a target region having a sequence of low complexity, the method for designing the bait sequence comprises designing the probe for the region flanking the target region as the replacement region; and generally, the region within 300 bp, preferably the ’5 region within 150 bp, flanking the target region is selected as the replacement region.
In an embodiment, the bait sequence is 60-150 bp in length, preferably 80120 bp in length.
In an embodiment, the expression “identical to or characteristic of the 30 target nucleic acid sequence” means that the bait sequence has a thermodynamic stability that is significantly weaker when binding to the non-target region as compared with when binding to the target region. Preferably, Tm between the bait sequence and the target region - Tm between the bait sequence and the non-specific region > 5°C, and more preferably Tm between the bait sequence 35 and the target region - Tm between the bait sequence and the non-specific region > 10°C. Preferably, the value of Tm is calculated by Nearest-Neighbor method based on Table of SantaLucia 2007 thermodynamic parameters.
In an embodiment, no dimer is formed, which means that the Tm of the dimer formed from any two bait sequences is <47°C, preferably < 37°C. 40 Preferably, the value of Tm is calculated by Nearest-Neighbor method based on Table of SantaLucia 2007 thermodynamic parameters.
In an embodiment, no hairpin structure is formed, which means that the Tm of the hairpin structure formed from any bait sequence by itself is <47°C,
2016403554 22 Nov 2018 preferably <37°C. Preferably, the value of Tm is calculated by NearestNeighbor method based on Table of SantaLucia 2007 thermodynamic parameters.
In an embodiment, for each target region, the bait sequence is one or more bait sequences with an optimal comprehensive score in terms of specificity, dimer, hairpin structure, and relative position to the target region, the comprehensive score is obtained by a scoring function of S = a*Specificity + bxSdimer Έ CxShairpin structure + dXSrelative distance, where a = 0.26-0.34, b = 0.08-0.12, c = 0.170.23, d = 0.35-0.45, and the particular scoring methods are as follows:
θ ^specificity scoring: any of newly designed bait sequences is aligned to the genome and Tm between the bait sequence and each hit sequence is calculated, where the difference of Tm between the bait sequence and the target region Tm between the bait sequence and any hit sequence >5°C, preferably/0°C, the mean of Tms between the bait sequence and all hit sequences is calculated, and
Figure AU2016403554A1_D0001
10), where Tmmean is the mean of Tms between the bait sequence and all hit sequences in non-specific regions, and Tmtarget is Tm between the bait sequence and the target region;
Sdimer scoring: any of newly designed bait sequences is aligned to each of !0 the already designed bait sequences for dimer analysis, and Tm between the newly designed bait sequence and each of the hit bait sequences is calculated, where Tm<47°C, the mean of Tms between the newly designed bait sequence and each of the hit bait sequences is calculated, and Sdimer = (47 - Tmmean) / 47; and preferably, where Tm<37°C, the mean of Tms between the newly designed ’5 bait sequence and each of the hit bait sequences is calculated, and Sdimer = (37 Tmmean) / 37;
Shairpin structure scoring: the optimal structure of any bait sequence upon selfalignment is determined, and the Tm of the structure is calculated, where
Tm<47°C, Shairpin structure = (47 - Tm) / 47; and where Tm· 'hairpin structure
-Tm)/37;and
Sreiative distance scoring: the coordinate difference 5DiStance between any newly designed bait sequence and the target region is calculated, and where 5DiStance is
Figure AU2016403554A1_D0002
In a second aspect, the disclosure also provides specific bait sequences for carrying out the method of the first aspect of the disclosure, wherein the specific bait sequences are the bait sequences used in the first aspect of the disclosure.
In an embodiment, the specific bait sequences are identical to or have characteristics of the target nucleic acid sequences, and i) do not form any hairpin structure by itself and do not form any dimer between each other, ii) have a copy number compensated according to the GC content and/or spatial structure of the target nucleic acid sequences, iii) are designed for the region flanking the target region as the replacement region with the same designing method for the target region, where the target region has a very high or very low
2016403554 22 Nov 2018
GC content or where the target region is a region of low complexity, and iv) do not specifically bind to other sequences than the target nucleic acid sequences in the nucleic acid sample.
In an embodiment, the copy number of the bait sequence is further 5 compensated according to the concern about the target nucleic acid sequence.
In a third aspect, the present disclosure also provides a kit comprising the bait sequences of the second aspect of the disclosure, and the kit further comprises, but not limited to, a double-stranded linker molecule, and a plurality of different oligonucleotide probes.
.0 In an embodiment, the kit comprises a composition and reagents for carrying out the method of the first aspect of the disclosure. The kit comprises, but not limited to, a double-stranded linker molecule, a plurality of different oligonucleotide probes, and bait sequences identical to or characteristics of the target nucleic acid sequences, and the bait sequences i) do not form any hairpin .5 structure by itself and do not form any dimer between each other, ii) have a copy number compensated according to the GC content of, the spatial structure of and/or the concern about the target nucleic acid sequences, iii) are designed for the region flanking the target region as the replacement region with the same designing method for target region, where the target region has a very high or ’0 very low GC content or where the target region is a region of low complexity, and iv) do not specifically bind to other sequences than the target nucleic acid sequences in the nucleic acid sample. In certain embodiments, the kit comprises two different double-stranded linker molecules. The kit may further comprise at least one or more additional components selected from DNA polymerase, T4 ’5 polynucleotide kinase, T4 DNA ligase, a hybridization buffer, a washing buffer and/or an elution buffer. In certain embodiments, the kit comprises a magnet. In certain embodiments, the kit comprises one or more enzymes, as well as corresponding reagents, buffers, and the like, such as restriction enzymes, such as Mlyl, and a buffer/reagent for restriction enzyme digestion with Mlyl.
DETAILED DESCRIPTION OF THE INVENTION
The disclosure provides a method for enriching target sequences based on liquid phase capture, which comprises steps of: designing a bait sequence, synthesizing the nucleic acid of the bait sequence (using a conventional primer 35 synthesis method or a solid phase synthesis method); preparing nucleic acid analogs comprising a binding moiety by in vitro transcription; pretreating the nucleic acid sample (by a library preparation method), which may be genomic
2016403554 22 Nov 2018
DNA, RNA, cDNA, mRNA, etc.; forming a nucleic acid analog/DNA hybrid complex via complementary pairing of the nucleic acid analog with the target nucleic acid sequence; removing via elution the nucleic acid analog/DNA hybrid complexes of low complementary pairing, so as to remove the non-target 5 nucleic acid sequences; and specifically amplifying the complementarily paired nucleic acid analog/DNA based on the linker sequence added in the pretreatment of the nucleic acid sample, so as to enrich the target nucleic acid sequences.
In the disclosure, the term sample is used in its broadest meaning and is .0 intended to include a sample or culture obtained from any source, preferably from a biological source. The biological sample can be obtained from an animal, including human, and includes liquid, solid, tissue, and gas. The biological sample includes a blood product such as plasma, serum, and the like. Thus, the term nucleic acid sample comprises nucleic acids of any source .5 (e.g., DNA, RNA, cDNA, mRNA, tRNA, miRNA, etc.). Where the nucleic acid sample is RNA or mRNA, the method further comprises a step of subjecting the RNA or mRNA to reverse transcription into DNA prior to step c). In the present disclosure, the nucleic acid sample is preferably derived from a biological source, such as a human or non-human cell, tissue, and the like. The term non!0 human refers to all non-human animals and entities, and includes, but not limited to, a vertebrate such as a rodent, a non-human primate, sheep, cattle, ruminant, rabbit, pig, goat, hors, dog, cat, bird, etc. Non-human also includes an invertebrate and a prokaryote, such as bacterium, plant, yeast, virus, and the like. Thus, the nucleic acid sample for use in the method and system of the ’5 disclosure is derived from any organism, either eukaryotic or prokaryotic.
In the disclosure, the inventors have found that the GC content of a target sequence has a significant effect on the capturing efficiency of the target sequence based on liquid phase capture. In order to effectively capture multiple target sequences, it is preferred to compensate the copy umber of the bait 30 sequences according to the GC content of the target sequence. The lower or the higher the GC content is, the more the copy number of the bait sequence corresponding to the target sequence is increased.
The inventors have found that, for a target sequence with a GC content of about 50%, for example ± 10%, a good target sequence capturing efficiency can 35 be obtained; and for a target sequence with any other GC content, the copy number of the bait sequence needs to be compensated to obtain a good target sequence capturing efficiency. After thorough testing on the human genome sequence, the inventors have found that, in order to achieve better target sequence capturing efficiency, the copy number coefficient of the bait sequence 40 with a GC content of 50% may be set to 1, and the copy number coefficient of the bait sequence with a GC content within 10%-90% may be increased by 0.08-0.12 per 1% of the GC content deviating from 50%. For example, when the GC content is 68%, i.e. the deviation is 18%, the copy number coefficient of
2016403554 22 Nov 2018 .0 .5 !0 !5 the bait sequence is 2.44-3.16.
For a target region that has a GC content of less than 10% or greater than 90% or that belongs to a region of low complexity, the corresponding method for designing a bait sequence is as follows: when a target region has a very high or very low GC content or when a target region is a region of low complexity, the probe is designed with the region flanking the target region as the replacement region. Generally, the region within 300 bp, preferably within 150 bp, that flanks the target region is selected as the replacement region.
In the present disclosure, a region of low complexity refers to a region composed of a small variety of elements such as oligonucleotides, such as a simple repeat sequence like microsatellites.
In the present disclosure, it is preferred to construct a library of DNA fragments of the sample after fragmentation.
In an embodiment, the method for compensating the copy number of a bait sequence can be simply expressed as the following: the GC content of a target sequence is classified into 6 levels from high to low, namely level 1: 10%-30%, level 2: 30% -40%, level 3: 40%-60%, level 4: 60%-70%, level 5: 70%-90%, and level 6: less than 10% or greater than 90%, wherein the copy number of the bait sequence at level 3 is set to the baseline copy number, the copy number of the bait sequences at levels 2 and 4 need to be increased, for example to 2.2-2.8 times of the copy number at level 3, and the copy numbers of the bait sequences at levels 1 and 5 need to be increased, for example to 3-4 times of the copy number at level 3. In an embodiment, for level 6, where the GC content is less than 10% or greater than 90%, or involves a sequence of low complexity, the method for designing a bait sequence is as follows: the probe is designed for the region flanking the target region as the replacement region. Generally, the region within 300 bp, preferably within 150 bp, that flanks the target region is selected as the replacement region.
In an embodiment, for each target region, the bait sequence is one or more bait sequences with an optimal comprehensive score in terms of specificity, dimer, hairpin structure, and relative position to the target region, the comprehensive score is obtained by a scoring function of S = axSspeCifiCity + bxSdimer + CXShairpin structure + dxSrelative distance? where a 0.26-0.34, b 0.08-0.12, c = 0.17-0.23, d = 0.35-0.45. The Sspecificity scoring and the like obtain a value of 0-1, and the particular scoring methods are as follows:
The scoring rule for Sspecificit% Any of newly designed bait sequences is aligned to the genome using the software BLAT with default parameters, and the thermodynamic parameter Tm for each hit sequence is calculated. If Tm between the bait sequence and the target region - Tm between the bait sequence and the non-specific region < 5°C, preferably < 10°C, the bait sequence is discarded and another bait sequence is redesigned. Otherwise, the mean of Tms between the bait sequence and all of the hit sequences is calculated, and ^specificity 1 - Trnmean / (Tmtarget- 5), preferably S specificity 1 - Tmmean / (Tm^ggt 40
2016403554 22 Nov 2018
10), wherein Tmmean is the mean of the Tms between the bait sequence and the hits in all non-specific regions, and Tmtarget is the Tm between the bait sequence and the target region.
The scoring rule for Sdimer: any of newly designed bait sequences is aligned 5 to each of the already designed bait sequences for dimer analysis using the software BLAT with default parameters, and the thermodynamic parameter Tm for each hit is calculated. If Tm>47°C, the bait sequence is discarded and another bait sequence is redesigned. Otherwise, the mean of Tms for all hits is calculated, and Sdimer = (47 - Tmmean) / 47. Preferably, if Tm>37°C, the bait .0 sequence is discarded and another bait sequence is redesigned. Otherwise, the mean of Tms for all hits is calculated, and Sdimer = (37 - Tmmean) / 37.
The scoring rule for Shairpin structure: The optimal structure of any of the bait sequences is determined using Smith-Waterman algorithm, and the thermodynamic parameter Tm of the structure is calculated. If Tm>47°C, the .5 bait sequence is discarded and another bait sequence is redesigned; otherwise Shairpin structure = (47 - Tmmean) / 47. Preferably, if Tm>37°C, the bait sequence is discarded and another bait sequence is redesigned; otherwise Shairpin structure = (37 -Tmmean)/37.
The scoring rule for Sreiative distance: where the coordinate for a target region ’0 to be designed is known, the coordinate difference 5Distance between any of newly designed bait sequences and the target region is calculated. The acceptable coordinate difference is set to 150, which is an empirical value. If the difference is greater than 150, the bait sequence is discarded and another bait sequence is redesigned; otherwise, Sreiative distance = (150 -distance) / 150. If no suitable bait ’5 sequence can be designed within the coordinate difference of 150 from the target region, the coordinate difference may be set to 300, and Sreiative distance = (300-5uistance) /300.
In the present disclosure, the calculation of the Tm of a sequence is not limited to a specific method, and various methods for calculating a Tm value 30 can be used in the present disclosure. The Tm values obtained by different methods cannot substantially reverse the effect of the present disclosure, but cause the degree of the effect. Although a Tm may be calculated by NearestNeighbor method based on Table of SantaLucia 2007 thermodynamic parameters, the Tm values calculated by other methods can correspond to the 35 same. Those skilled in the art can compare the Tms calculated by various methods through simple tests, and thereby appropriately screen the Tms calculated by various methods.
Empirically, for coding regions of the human genome, a bait sequence suitable for the present disclosure can be designed out for more than 99% of the target regions, indicating that previously GC region rating and Tm value filtering are reasonable.
In certain embodiments, the hybridization between a nucleic acid analog and the target nucleic acid is conducted preferably under a stringent condition
2016403554 22 Nov 2018 sufficient to enable the hybridization between the nucleic acid analog and DNA, wherein the nucleic acid analog comprises a linker compound and a region complementary to the target nucleic acid sample, to provide the nucleic acid analog/DNA hybrid complex. The complex is then captured via the linker 5 compound and washed in a condition sufficient to remove non-specifically bound nucleic acids, and the hybridized target nucleic acid sequence is then eluted from the captured nucleic acid analog/DNA complexes.
In certain embodiments, the nucleic acid analog comprises a chemical group or a linker compound, such as a binding moiety, like biotin, digoxin, or .0 the like, which is capable of binding to a solid support. The solid support may comprise a corresponding capture compound, such as streptavidin for biotin or a digoxin antibody for digoxin. The disclosure is not limited to the used linker compound, and alternative linker compounds are equally suitable for use in the method, bait sequences and kits of the disclosure.
.5 In the present disclosure, the chemical group or the linker compound, such as a binder moiety like biotin, digoxin or the like, may be linked to any base in a nucleic acid analog (glycerol nucleic acid GNA, locked nucleic acid LNA, peptide nucleic acid PNA, threose nucleic acid TNA or morpholine nucleic acid). Preferably, the nucleic acid analog chain may comprise a ribose and/or ’0 deoxyribose, and the chemical group or the linker compound, such as a binder moiety like biotin, digoxin or the like, may be linked to a base in the ribose and/or deoxyribose. For example, the synthesis of the nucleic acid analogs comprises using labeled ATP, CTP, GTP and/or UTP. The method for labeling nucleotides with Cydye, DIG, biotin, rhodamine, fluorescein, etc. is known in !5 the art. For example, biotin can be used as a label for nucleic acid probe, and it can bind to C atom at position 5' of UTP or dUTP in a nucleic acid molecule, and can be detected upon binding to avidin. The present invention is not limited to the known labels and labeling methods, however, the labels and labeling methods developed in the future are also contemplated within the scope of the 30 present invention.
In an embodiment of the disclosure, the plurality of target nucleic acid molecules preferably comprise the whole genome or at least one chromosome or a nucleic acid of any molecular size of an organism. Preferably, the nucleic acid molecule has a size of at least about 200 kb, at least about 500 kb, at least 35 about 1 Mb, at least about 2 Mb, or at least about 5 Mb, more preferably has a size of from about 100 kb to about 5 Mb, from about 200 kb to about 5 Mb, from about 500 kb to about 5 Mb, from about 1 Mb to about 2 Mb or from about 2 Mb to about 5 Mb.
In certain embodiments, the target nucleic acid is derived from an animal, plant or microorganism, and in a preferred embodiment, the target nucleic acid molecule is derived from a human. If the amount of a nucleic acid sample is relatively low (e.g., a sample of human nucleic acid, e.g. fetal genome in development, as obtained in some cases), the nucleic acid can be amplified,
2016403554 22 Nov 2018 such as by whole genome amplification, before the method of the disclosure is carried out. Pre-amplification may be necessary for performing the method of the disclosure, such as in forensic applications (e.g., for the purpose of genetic characterization in forensic applications).
In certain embodiments, the plurality of target nucleic acid molecules are a set of genomic DNA molecules. The bait sequences may be selected, for example, from a plurality of bait sequences defining a plurality of exons, introns or regulatory sequences from a plurality of genetic loci; a plurality of bait sequences defining the complete sequence of at least one single genetic locus of .0 any size, preferably at least 1 Mb, or of at least one of the above specified sizes; a plurality of bait sequences defining a single nucleotide polymorphism (SNP); or a plurality of bait sequences defining an array, for example a chimeric array designed for capturing the complete sequence of at least one complete chromosome.
.5 As used herein, the term “hybridize”, “hybridizing”, “hybridization” or any other grammatical form refers to the pairing between complementary nucleic acids. Hybridization and hybridization intensity (e.g., the intensity of binding between nucleic acids) are affected by a number of factors, such as the degree of complementarity between nucleic acids, the stringency of the used !0 hybridization condition, the melting temperature (Tm) of the formed hybrid, and the GC content value of nucleic acids. Although the disclosure is not limited to a specific hybridization condition, it is preferred to use a stringent hybridization condition. The stringent hybridization condition depends on the sequence and varies as a function of hybridization parameters (e.g., salt concentration, ’5 presence of organics, etc.). Generally, the stringent condition is selected to be a temperature of about 5°C to about 20°C lower than the Tm of a particular nucleic acid sequence at a specified ionic strength and pH. Preferably, the stringent condition is a temperature of about 5°C to 10°C lower than the temperature of the melting point of a particular nucleic acid for binding the 30 complementary nucleic acid. The Tm is the temperature at which 50% the nucleic acids (e.g., the target nucleic acids) hybridize the perfectly matched probes (at a defined ionic intensity and pH).
As used herein, the term the stringent conditions may be, for example, hybridization in 50% formamide, 5 x SSC (0.75 M NaCl, 0.075 M sodium 35 citrate), 50 mM sodium phosphate (pH6.8), 0.1% sodium pyrophosphate, 5 x Denhardt solution, ultrasonic sperm DNA (50 mg/ml), 0.1% SDS, and 10% dextran sulfate at 42°C, washing in 0.2 x SSC (sodium chloride / sodium citrate) at 42°C and in 50% formamide at 55°C, and subsequent washing in 0.1 x SSC containing EDTA at 55°C. For example, the buffer containing 35% 40 formamide, 5 χ SSC, and 0.1% (w/v) sodium dodecyl sulfate (SDS) is expected to be suitable for hybridization under a moderately non-stringent condition at 45 °C for 16-72 hours.
As used herein, the term primer refers to an oligonucleotide, obtained io
2016403554 22 Nov 2018 either by purification and enzyme digestion of naturally occurring source or by a synthetic method. The primer can serve as an initiation site for synthesis when being placed into a condition inducing synthesis of a product obtained by extending the primer complementary to a nucleic acid strand (for example, in 5 the presence of nucleotides and inducible agents such as DNA polymerase at a suitable temperature and pH). The primer is preferably a single strand having the highest amplification efficiency Preferably, the primer is an oligodeoxynucleotide. The primer must be long sufficient to initiate synthesis of the extension product in the presence of inducing agent(s). The exact length of .0 the primer depends on many factors, including temperature, source of primer and the method used.
As used herein, the term bait or bait sequence refers to an oligonucleotide (e.g., a nucleotide sequence), obtained either by purification and enzyme digestion of naturally occurring sources or alternatively by synthesis, .5 recombination or PCR amplification. It can hybridize at least a portion of another target oligonucleotide, such as a target nucleic acid sequence. The probe can be single-stranded or double-stranded. The probe can be used for detection, identification and isolation of a specific gene sequence.
As used herein, the term target nucleic acid molecule refers to a molecule ’0 or sequence from the target genomic region. The preselected probe defines the target nucleic acid molecule. Thus, the wording target is intended to distinguish it from other nucleic acid sequences. One segment is defined as one nucleic acid region in the target sequence, such as one segment or one portion of the nucleic acid sequence.
’5 As used herein, the term “isolate” “isolating”, “isolated”, “isolation” or any other grammatical form when used in reference to a nucleic acid, such as when used in expression isolate a nucleic acid, refers to the identification and isolation of the nucleic acid sequence from at least one other component or contaminant with which it is normally associated of its natural source. An 30 isolated nucleic acid exists in a form different from its naturally occurring form.
In contrast, un-isolated nucleic acids such as DNA and RNA are present in their naturally occurring form. The isolated nucleic acid, oligonucleotide or polynucleotide may be present in a single-stranded form or in a double-stranded form.
As used herein, the expression a bait sequence that is identical to the target nucleic acid sequence refers to a sequence whose complementary sequence can hybridize the target nucleic acid sequence. Preferably, the hybridization is conducted under a stringent condition. When the target region has a very high or very low GC content or when the target region has a low 40 complexity, since a bait sequence cannot be designed for such region, that is, the bait sequence has a coverage being zero, the region flanking the target region may be used to look for a suitable region to design a bait sequence. Generally, a bait sequence is designed for the region within 300bp, preferably within 150bp,
2016403554 22 Nov 2018 from both ends of the target region.
In an embodiment of the disclosure, the transcription primer for a bait sequence used in the capture methods and the kits described herein comprises a linker compound, such as a binding moiety. The binding moiety comprises any 5 portion for linking or introducing 5’ end of the amplification primer for subsequently capturing the nucleic acid analog/target nucleic acid hybrid complex. The binding moiety is any sequence for introducing 5' end of the primer sequence for amplification, such as 6><histidine (6HIS) sequence capable of being captured. For example, the primer comprising 6HIS sequence can be .0 captured by nickel, such as in tubes, micropores or purification columns which are nickel-coated beads, or contain nicked-coated beads, particles, etc., wherein the beads are packaged into the column and the sample is loaded into and passed through the column, so as to capture complexes with reduced complexity (and subsequently elute the target, for example). Another example of the .5 binding moiety for use in embodiments of the disclosure includes hapten, such as digoxin, ligated to, for example, 5’ end of the amplification primer. Digoxin can be captured by digoxin antibody, such as the substrate coated with or comprising an anti-digoxin antibody.
In certain embodiments, the binding moiety is biotin, and streptavidin is ’0 used to coat the capture matrix, such as beads, e.g. paramagnetic particles, in order to isolate the target nucleic acid/transcription product complex from the non-specifically hybridized target nucleic acids. For example, when biotin is the binding moiety, a streptavidin (SA) coated matrix, such as SA coated beads (e.g., magnetic beads/particles), is used to capture the biotin-labeled nucleic ’5 acid analogs/target complexes. The SA-bound complex is washed and the hybridized target nucleic acid is eluted from the complex and sequenced.
The bait sequence corresponding to at least one region of the genome in sequence can be provided in parallel on a solid support using the maskless array synthesis technique. Alternatively, the probes can be obtained continuously 30 using a standard DNA synthesizer and be applied to the solid support, or can be obtained from an organism and be fixed to the solid support. After hybridization, the nucleic acids that do not hybridize or that non-specifically hybridize the nucleic acid analogs are isolated by washing from the nucleic acid analogs bound to the support. The remaining nucleic acids specifically bind to 35 the nucleic acid analogs, and are eluted from the solid support, for example, in hot water or in a nucleic acid elution buffer containing, for example, TRIS buffer and/or EDTA, to produce an elute in which the target nucleic acid molecules are enriched.
Alternatively, the bait sequences for a target molecule can be synthesized 40 on a solid support as described above, and be released from the solid support as a collection of bait sequences and be amplified. The collection of the released transcription nucleic acid analogs can be covalently or non-covalently immobilized to a support, such as glass, metal, ceramic, or polymeric beads or
2016403554 22 Nov 2018 other solid supports. The nucleic acid analog can be designed to be conveniently released from the solid support, for example by providing an acid- or base-labile nucleic acid sequence at or near the end of the nucleic acid analog closest to the support, which enables the release of the nucleic acid analog at a low or high 5 pH, respectively. A variety of cleavable linker compounds are known in the art.
The support can be provided, for example, as a cylinder having a liquid inlet and outlet. The method of immobilizing a nucleic acid to a support is well known in the art, for example, by binding biotin-labeled nucleotides to the nucleic acid analog, and coating the support with streptavidin, whereby the .0 coated support non-covalently attracts and immobilizes the nucleic acid analogs in the collection. The sample passes through the support comprising the nucleic acid analogs under hybridization conditions, whereby the target nucleic acid molecules that hybridize the immobilization support can be eluted for subsequent analysis or other use.
.5 The term nucleic acid may include, for example, but not limited to, deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and artificial nucleic acid such as peptide nucleic acid (PNA), morpholino and lock nucleic acid (LNA), glycol nucleic acid (GNA) and threose nucleic acid (TNA). As used herein, the term nucleic acid, nucleic acid sequence or nucleic acid !0 molecule shall be used in its broad meaning, and, for example, may refer to an oligomer or polymer of ribonucleic acid (RNA) or deoxyribonucleic acid (DNA) or mimics thereof. The term comprises the molecule constructed by natural nucleobases, saccharides, and covalent internucleoside linkages (backbone), and the molecule with similar functions constructed by non-natural ’5 nucleobases, saccharides, and covalent intemucleoside linkages (backbone), or the combination thereof. For the desired properties, such as enhanced affinity for target nucleic acid molecules and increased stability in the presence of nucleases and other enzymes, such modified or substituted nucleic acids may be more preferred than the native form, and are described as the term nucleic acid 30 analog or nucleic acid mimic herein. The prefer examples of the nucleic acid mimic are molecules comprising peptide nucleic acid (PNA), locked nucleic acid (LNA), Uylo-locked nucleic acid (Uylo-LNA), thiophosphoric acid, 2'methoxy, 2'-methoxyethoxy, morpholino nucleic acid and phosphoramidate, or nucleic acid derivatives with similar function.
Example
Example 1: Design of bait sequences
1000 loci on the exons and introns of the human genome were randomly selected (the distribution of these loci is shown in the table below) for testing 40 the method of the present disclosure. Bait sequences were designed for these 1000 random target sequences for subsequent testing.
Table 1: Chromosome distribution of the randomly selected 1000 loci
2016403554 22 Nov 2018
Chromosome number Chromosome number
chrl 92 chrl 2 73
chr2 67 chrl 3 23
chr3 53 chrl 4 15
chr4 43 chrl 5 29
chr5 45 chrl 6 41
chr6 124 chrl 7 36
chr7 42 chrl 8 14
chr8 46 chrl 9 31
chr9 34 chr20 21
chrlO 61 chr21 9
chrl 1 80 chr22 21
The design of bait sequences comprised the following steps:
1. Firstly, the characteristic analysis of the target sequences was conducted, which comprised the following sub-steps of:
a) dividing the target sequences into 5 levels from high to low according to 5 the GC content, namely level 1: 10%-30%, level 2: 30% -40%, level 3: 40%-
60%, level 4: 60%-70%, and level 5: 70%-90%; and
b) analyzing the target sequences for spatial structure, and marking the target sequences that can form stable spatial structure.
2. Secondly, criteria were set for the bait sequences and the bait sequences .0 were scored, as follows:
a) The length of a target sequence is set within the range of 60-150 bp.
b) Specificity is maintained, and the principle of such specificity is that the bait sequence has a thermodynamic stability which is significantly weaker when binding to a non-target region as compared with when binding to the target .5 region, wherein the general analysis index is set as Tm between the bait sequence and the target region - Tm between the bait sequence and a nonspecific region > 5°C, and for some data, the difference of Tm between the bait sequence and the target region - Tm between the bait sequence and a nonspecific region > 10°C is taken for alignment (strict specificity cutoff). Different 20 methods for calculating thermodynamics have great effect on the calculation results, and Nearest-Neighbor method based on Table of SantaLucia 2007 thermodynamic parameters was used herein.
c) No secondary structure forms, wherein the secondary structure includes the dimer and hairpin structure. That is, the designed bait sequences do not form any dimer or hairpin structure. The dimers formed between any two bait sequences have Tms of < 47°C, and for some data, Tm < 37°C is used for alignment (strict dimer cutoff). The hairpin structures formed by any one bait sequence itself have Tms of <5 47°C, and for some data, Tm < 37°C is used for alignment (strict hairpin cutoff). Different methods for calculating 30 thermodynamics have great effect on the calculation results, and Nearest14
2016403554 22 Nov 2018 .0 .5 !0 !5
Neighbor method based on Table of SantaLucia 2007 thermodynamic parameters was used here.
d) The candidate bait sequences are analyzed for each target region, and calculated for the overall score according to the specificity, dimer, hairpin structure for each candidate sequence and its relative position to the target region, and then one or more bait sequences with the highest overall scores are selected according the scoring results (i.e. the scoring function has the highest value). S axSSpec;fiChy + bxSdmier + cx Shairpin structure + dxSre|atjve distance? wherein a 0.26-0.34, b = 0.08-0.12, c = 0.17-0.23, and d = 0.35-0.45. The scoring is performed by an in-house software, and the principles are as follows:
The scoring rule for Sspecificity: Any newly designed bait sequence is aligned to the genome using the software BLAT with default parameters, and the thermodynamic parameters Tm for each hit is calculated. If the difference of Tm between the bait sequence and the target region - Tm between the bait sequence and the non-specific region < 5°C, the bait sequence is discarded and another bait sequence is redesigned, wherein for some data, the difference < 10°C is taken for alignment; otherwise, the mean of Tms for all hits is calculated, and finally SSpecjf1Chy 1 - Tmmean / (Tmjargej - 5), and for some data, SSpecjf1Chy 1 Tmmean / (Tmtarget- 10) is taken for alignment, wherein Trnmean was the mean of the Tms between the bait sequence and all hits in the non-specific regions, and Tmtarget is Tm between the bait sequence and the target region.
The scoring rule for Sdimer: Any newly designed bait sequences is aligned to each of the already designed bait sequences for dimer analysis using the software BLAT with default parameters, and the thermodynamic parameter Tm for each hit is calculated. If Tm > 47°C, the bait sequence is discarded and another bait sequence is redesigned; otherwise, the mean of Tms for all hits is calculated, and finally Sdimer = (47 - Tmmean) / 47, and for some data, Tm ^37°C is taken for alignment: if Tm>37°C, the bait sequence is discarded and another bait sequence is redesigned otherwise, the mean of Tms for all hits is calculated, and Sdimer = (37 - Trnmean) / 37.
The scoring rule for Shairpin structure: The optimal local alignment of any bait sequences is calculated using the Smith-Waterman algorithm, and the thermodynamic parameter Tm of the structure is calculated. If Tm>47°C, the bait sequence is discarded and another bait sequence is redesigned; otherwise, Shairpin structure = (47 - Tm) / 47, and for some data, if Tm>37°C is taken for alignment, the bait sequence is discarded and another bait sequence is redesigned; otherwise, Shairpin structure = (37 - Tm) / 37.
The scoring rule for Sreiative distance: Where the coordinate for a target region to be design is known, the coordinate difference 5Distance between any newly designed bait sequences and the target region is calculated. The acceptable coordinate difference is set to 150, which is an empirical value. If the difference is greater than 150, the bait sequence is discarded and another bait sequence is redesigned; otherwise, Sreiative distance = (150 - distance) / 150. If no suitable bait
2016403554 22 Nov 2018 sequence can be designed within the coordinate difference of 150 from the target region, the acceptable coordinate difference may be set to 300 for comparison, and SreiatiVedistance (300 - bDiStance) / 300.
3. Thirdly, the copy number of the bait sequences was compensated 5 according to the specific target region:
a) According to the stability classification of the target sequence, where the copy number of the bait sequence at level 3 is set to the baseline copy number (i.e. baseline of 1), the bait sequences at levels 1 and 5 need to increase more copy numbers, namely 3.5 times of the copy number at level 3, and the bait .0 sequences at levels 2 and 4 also need to increase more copy number, namely
2.5 times of the copy number at level 3.
b) For the target sequence forming stable spatial structure, the copy number of the bait sequences is double.
c) Where the target sequence might be a region of great interest, for .5 example, a region where a fusion event occurs, the copy number of the bait sequences is double.
d) In addition, the parallel experiments in which the copy numbers of the bait sequences are not compensated were conducted under the same condition as control.
!0 4. Finally, where no probe can be designed for the target sequence, for example, where the target region has a very high or very low GC content, or where the target region has a low complexity region (the low complexity region refers to a region composed of elements of a few types, e.g., oligonucleotides, such a simple sequence repeat as microsatellites), since a bait sequence cannot ’5 be designed for the region, that is, the bait sequence coverage is zero, the suitable region for which the bait sequence is designed would be found within the region flanking the target region. Generally, the bait sequence is designed for the region within 300bp from the left and right ends of the target region. If the suitable bait sequence can be designed for the region within 150 bp, the bait 30 sequence is recorded as a control. In this example, 138 randomly selected target sequences belong to this situation, and among them, there are 68 randomly selected target sequences, for the region within 150 bp flanking them, the bait sequences can be successfully designed, there are 22 randomly selected target sequences, for the region within 150-300 bp flanking them, the bait sequences 35 can be successfully designed, and there are still 48 randomly selected target sequences, for which no probes can be designed.
5. The finally designed bait sequences are shown in Table 2. Table 2: an overview of the design of the bait sequences
Comparative conditions The number of the bait sequences
Strict specificity cutoff, strict dimer cutoff, strict hairpin structure cutoff and strict scoring function cutoff 9800
2016403554 22 Nov 2018
The regions within 15 Obp 68
The regions within 150-300bp 22
The regions incapable of designing a probe 48
The criteria for strict scoring function cutoff is as follows: Tm between the bait sequence and the target region - Tm between the bait sequence and the non-specific region > 10°C, Sspecificity = Tmmeail / 37; Tm<37°C, Sdimer = (37 Tmmean) / 37; Tm<37°C, Shairpin structUre = (37 - Trnmean) / 37.
Example 2: Preparation of the bait sequences
The sequences were prepared according to the bait sequences designed in Example 1. The method for preparing a bait sequence was as follows:
.0 1. A specific sequence of 20 bases was added to 5' and 3' ends of the bait sequence. The design principles for the specific sequence are as follows: 1) no non-specific amplification products are generated for the target genome (to be captured); 2) the GC content is between 30% and 70%, preferably between 40% and 60%; and 3) no dimer forms from any two of the specific sequences, or the .5 formed dimer has a free energy < 47°C, preferably < 37°C. Thus, the sequence to be synthesized was formed. All bait sequences used the same pair of the specific sequences, and an example for the specific sequences was as follows: 5'-end specific sequence - bait sequence (from 60 bp to 150 bp) - 3'-end specific sequence is shown as (SEQ ID NO. 1):
>o ATATAGATGCCGTCCTAGCG-NNNNNNNNNN.. .NNNNNNNNNNTGGGCACAGGAAAGATACTT, wherein, NNNNNNNNNN...NNNNNNNNNN represents the bait sequence.
2. The specific sequence was generated by the liquid phase hybridization capture sequencing probe design software developed by the inventors on their ’5 own.
3. Oligonucleotides were synthesized on a large scale from the sequence to be synthesized using a chip-based method well known in the art, and then the oligonucleotides were eluted from the chip with ammonia water and, upon purification, were dissolved in double distilled water, so as to form an oligonucleotide pool.
4. The polymerase chain reaction was conducted for amplification with the oligonucleotide pool as template, 5' end primer and 3' end primer respectively complementary to the 5' end specific sequence and the 3' end specific sequence as primers, and Taq polymerase (JumpStart Taq DNA Polymerase, purchased from Sigma, Catalog No. D6558), so as to obtain a double-stranded DNA pool in high yield. The specific steps were as follows:
1) The reaction system was as follows:__________________
Reagents volume
Water 37μ1
10 x PCR Buffer 5μ1
2016403554 22 Nov 2018
10 mM dATP 1 μΐ
10 mM dCTP 1 μΐ
10 mM dGTP 1 μΐ
10 mM TTP 1 μΐ
5’end primer (10μΜ) 1 μΐ
3’ end primer (10μΜ) 1 μΐ
Jump Start Taq DNAPolymerase 1 μΐ
oligonucleotide pool 1 μΐ
2) The reaction conditions were as follows:
Temperature Time Cycles
94°C lmin 1
94°C 30s 15
68°C 30s
72°C lmin
72°C lmin 1
4°C Holding 1
3) The PCR product was purified using a QIAGEN PCR Purification Kit (QIAGEN, Cat No./ID 28104) according to the operating instructions.
4) The polymerase chain reaction was carried out for amplification with 5' 5 end primer having a T7 sequence (TAATACGACTCACTATAGGG) at 5' end as the forward primer, the 3' end primer as the reverse primer, and Taq polymerase (JumpStart Taq DNA Polymerase, purchased from Sigma, Catalog No. D6558), so as to obtain a double-stranded DNA pool with T7 sequence at 5' end. The operation was as follows:
o 5) The reaction system was as follows:___________________
Reagents Volume
Water 37μ1
10 x PCR Buffer 5μ1
10 mM dATP 1 μΐ
10 mM dCTP 1 μΐ
10 mM dGTP 1 μΐ
10 mM TTP 1 μΐ
BAITS 5 PRIMER N-T7 (10μΜ) 1 μΐ
BAITS 3 PRIMER N (10μΜ) 1 μΐ
Jump Start Taq DNAPolymerase 1 μΐ
oligonucleotide pool 1 μΐ
6) The reaction conditions were as follows:
Temperature Time Cycles
94°C lmin 1
94°C 30s 4
68°C 30s
72°C lmin
2016403554 22 Nov 2018
72°C lmin 1
4°C holding 1
The PCR reaction products obtained from the previous step were separated by gel electrophoresis to remove non-specific bands, and the fragments within 120-210 bp were recovered and purified by Qiagen Gel Extraction Kit (Cat No./ID 28704).
7) The gel-recovered and purified products in the previous step were in vitro transcribed with NTPs of nucleic acid analogs (glycerol nucleic acid GNA, locked nucleic acid LNA, peptide nucleic acid PNA, threose nucleic acid TNA or morpholine nucleic acid) and biotin-labeled UTP as the substrates by using a T7 High Yield RNA Transcription Kit (Vazyme, TRI 01-01/02), so as to prepare o a biotin-labeled nucleic acid analog pool:
Reagents Volume (μΐ)
ATP analogs (GNA, LNA, PNA, TNA or morpholine nucleic acid, 10 mM) 2
CTP analogs (GNA, LNA, PNA, TNA or morpholine nucleic acid, 10 mM) 2
GTP analogs (GNA, LNA, PNA, TNA or morpholine nucleic acid, 10 mM) 2
UTP analogs (GNA, LNA, PNA, TNA or morpholine nucleic acid, 10 mM) 1.6
biotin-UTP (1 mM) 3
lOxBuffer 2
Reaction buffer (10x) 2
Gel-recovered and purified product comprising T7 sequence from the previous step 5.4
The reaction was incubated at 37°C for 8-12 hours to obtain a nucleic acid analog pool with highest yield, which was diluted to 500 ng/μΐ after purification and stored in a -80°C refrigerator.
Also, the parallel experiments under the same conditions were conducted with standard nucleic acids ATP, CTP, GTP and UTP and Biotin-UTP as the control.
Example 3: Target Region Library Capture
1. Preparation of DNA library for high throughput capture sequencing:
1) 1 pg of genomic DNA from the tested species was randomly broken into 150-250 bp small fragments using a sonicator Bioruptor pico; and
2) A pre-capture small-fragment library was prepared using the Illumina TruSeq DNA library preparation kit.
2. The target region library was captured through hybridization with the prepared nucleic acid analog pool and a small-fragment library from the target
2016403554 22 Nov 2018 species:
1) Preparation for blocking primers:
*primer sequences (5’—3’) *primer name
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGAC GCTCTTCCGATCT (SEQ ID NO.2) MP 1.0
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTG GTCGCCGTATCATT (SEQ ID NO.3) Anti MP 1.0
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (SEQ ID NO.4) MP2.0-UN
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (SEQ ID NO.5) Anti MP2.0-UN
caagcagaagacggcatacgagat (SEQ ID NO.6) MP2.0-I
atctcgtatgccgtcttctgcttg (SEQ ID NO.7) Anti MP2.0-I
The primers were synthesized according to the above primer sequences with 100 OD for each primer, and each were diluted to 1000 μΜ and mixed in equal volume. The resultant mixture was marked as Block 1.
2) cot-1 DNA and salmon sperm DNA each were diluted to 100 ng/μΐ, and mixed in equal volume. The resultant mixture was marked as Block 2.
3) 6 μΐ of Block 1 and 5 μΐ of Block 2 were mixed, and the resultant .0 mixture was marked as Block Mix.
4) 1 μg of the small-fragment genomic library was mixed with 11 μΐ of Block Mix and concentrated to 9 μΐ using a cryo-freeze-drying centrifuge. The resultant concentrate was marked as Reagent SI, and kept on ice for further use.
6) 20μ1 of the hybridization solution (20><SSPE, 2x Dennard’s, 1 mM .5 EDTA, and 1% SDS) was preheated on a 65 °C metal bath, and was marked as S2.
7) 2 μΐ of the nucleic acid analog pool at 500 ng/μΐ was added into 5 μΐ of pure water, and mixed by gentle pipetting for several times. The resultant mixture was marked as S3, and kept on ice for further use.
8) The parameters of the PCR instrument were set to 95°C for 5 min, 65°C for 16 h and 65°C for holding, with the hot lid at 105°C.
9) S1 was put in the PCR module, and the PCR program was started; after the program was run to 65 °C for 5 min, S2 was put into the PCR instrument module, and incubated for another 5 min; and then S3 was put into the PCR instrument module, and incubated for another 2 min.
10) 13 μΐ of S2 was transferred to S3 by a pipette set to 13μ1, 9μ1 of SI was transferred to S3, and the resultant mixture was mixed thoroughly by gentle pipetting for several times. The tube cap was sealed, the PCR hot lid was covered, and the incubation was performed for 16 hours for probe-library hybridization.
11) 50 μΐ of Dynabeads MyOne Streptavidin T1 (Invitrogen, Catalog No. 65601) was put in a 1.5 ml low adsorption centrifuge tube and 200 μΐ of the binding solution [0.5 M NaCl (Ambion, Catalog No. AM9760G), 2 mM TrisHC1, pH8.0 (Ambion, Item No.: AM9855G) and 0.2mM EDTA (Ambion,
2016403554 22 Nov 2018
Catalog No.: AM9260G)] were added. The resultant mixture was mixed by pipetting and put in a magnetic separator for 1 min, and then the supernatant was removed.
12) The centrifuge tube was removed from the magnetic separator, and then 200 μΐ of the binding solution was added. The mixture was mixed by pipetting and put in a magnetic separator for lmin, and then the supernatant was removed.
13) Step 11) was repeated twice, i.e. the magnetic beads were washed for three times in total, and finally the magnetic beads were re-suspended with 200 .0 μΐ of the binding solution.
14) The hybridization mixture of probes and library (the product of step 9)) was transferred into the magnetic bead re-suspension solution. The tube cap was sealed, and the resultant mixture was mixed in a rotary mixer for 30 min for binding.
.5 15) The centrifuge tube was put in the magnetic separator for 2 min, and the supernatant was removed.
16) The centrifuge tube was removed from the magnetic separator, into which 200μ1 of washing solution 1 [10 x SSC (Ambion, Catalog No.: AM9763), and 1% SDS (Invitrogen, Catalog No. 24730020)] was added to re-suspend the ’0 magnetic beads. The tube cap was sealed, and the centrifuge tube was placed in a rotary mixer for washing for 10 min.
17) The centrifuge tube was placed in the magnetic separator for 2 min, and then the supernatant was removed.
18) The centrifuge tube was removed from the magnetic separator, into ’5 which 200 μΐ of washing solution 2 which was pre-heated at 65°C [1 x SSC (Ambion, Catalog No. AM9763), and 5% SDS (Invitrogen, Catalog No. 24730020)] was added to re-suspend the magnetic beads. The centrifuge tube was placed in the PCR instrument module for incubation at 65°C for 10 min.
19) The centrifuge tube was placed in a magnetic separator for 2 min, and 30 then the supernatant was removed.
20) Steps 17) and 18) were repeated twice, i.e., washing was conducted for 3 times in total.
21) 200 μΐ of 80% ethanol solution was added to the centrifuge tube, and stood for 30 s, and then the alcohol was totally removed. The magnetic beads were dried at room temperature for 2 min, and then re-suspended by adding 20 μΐ of pure water and gentle pipetting for several times.
3. PCR enrichment of the target region capture product using the NEB High Fidelity PCR Kit (Phusion® High-Fidelity PCR Kit, New England Biolabs, Catalog #E0553S):
1) The reaction system was as follows:________________
Reagent name Volume
5X Phusion HF 10μ1
10 mM dNTPs 1 μΐ
2016403554 22 Nov 2018
Post Prmier Mix (each 10 μΜ) 1 μΐ
Re-suspended magnetic beads (step 20) 20μ1
Phusion DNA polymerase 0.5μ1
h2o 17.5μ1
2) The reaction conditions were as follows:
Temperature Time Cycles
98°C lmin 1
98°C 15s 12
65°C 30s
72°C 30s
72°C 5 min 1
4°C holding 1
3) The PCR product was purified using Beckman Agencourt AMPure XP Kit [Beckman (p/n A63880)].
4) The target region capture library was high-throughput sequnced with the
Illumina sequencing platform, and the sequencing read length of PEI 50 mode was recommended.
3. Results
1) The sequencing library was sequnced with the Illumina high-throughput sequencer Hiseq 4000, and the sequencing data of 1000 loci were obtained.
.0 2) The sequencing data was compared with human reference genome
HG19 by BWA MEM software with the following parameters: bwa mem -M -k 40 -t 8 -R @RG\tID:Hiseq\tPL:Illumina\tSM :sample, thereby the single nucleotide polymorphism, insertion or deletion different from the reference genome, i.e., the detected gene mutation, was obtained.
3) The data size, alignment rate, repetition rate and quality value were calculated with the samtools stats tool in software samtools-1.2, and then the sequencing depth of each locus in the target regions were calculated with the samtools depth tool in the software.
4) According to the sequencing depth of each locus in the target regions, the base numbers with sequencing depth >1, >4, >10 and >20 were counted , respectively, and then the base numbers were divided by the total base numbers in the target regions, thereby obtaining parameters of 1 x coverage, 4x coverage, 10x coverage, and 20x coverage.
Table 3: The capture sequencing results for 1000 loci
Detection method sequencing results for 1000 loci
LNA GNA PNA TNA morpholine nucleic acid
The coverage of the target region 95.09% 94.29% 93.07% 94.36% 92.17%
the average depth of the target region 451.53 420.02 428.49 430.83 410.09
2016403554 22 Nov 2018
The ratio of the target region with 4 x coverage 94.35% 93.12% 92.47% 92.45% 90.28%
The ratio of the target region with 1 Oxcoverage 94.25% 92.98% 92.03% 91.72% 89.35%
The ratio of the target region with 20 x coverage 93.64% 92.02% 90.87% 90.34% 87.34%
As can be seen from Table 3 above, taking LNA as an example, it has an average depth of 451.53 layers, a 4* coverage rate of 94.35%, and a 20* coverage rate of also 93.64%, and thus has good coverage rate and uniformity, while the total data volume is only 8.52 Mb reads. Such results hve the 5 beneficial effects of: 1) small sequencing amount and effectively reduced cost,
2) high average sequencing depth, that is, each target site is sequenced for multiple times, thus having high data accuracy; 3) high coverage and less missing sites; and 4) good uniformity, that is, most sites have similar covering depths.
.0 According to the analysis of the data subsets for comparison and the control data, as compared with the LNA, when the copy number of the bait sequence was not compensated, the coverage and uniformity decreased by 4.5 and 5.1 percentages, respectively. When strict specificity cutoff, strict hairpin structure cutoff and strict scoring function cutoff were applied, the coverage and 5 uniformity increased by 6.3 and 7.8 percentages, respectively. For the region within 150bp and the region within 150-300bp, the coverage and uniformity incresed 2.3 and 3.8 percentages, respectively. For the parallel experiments conducted with the same amounts of standard nucleic acids ATP, CTP, GTP and UTP, and Biotin-UTP, the coverage and uniformity decreased by 5.3 and 4.8 ’0 percentages, respectively.
Although the present disclosure has been described in connection with the preferred embodiments, it is understood that the scope of the invention is not limited to the embodiments described herein. In view of the description and practice of the present disclosure as disclosed herein, other embodiments of the 25 disclosure will be easily envisaged and understood by those skilled in the art. The description and the examples are to be considered as illustrative only, and the true scope and spirit of the invention are defined by the claims.

Claims (10)

  1. WHAT IS CLAIMED IS:
    1. A method for enriching target nucleic acid sequences from a nucleic acid sample, the method comprising steps of:
    a) providing a nucleic acid sample comprising target nucleic acid sequences, and bait sequences that are identical to or that are characteristic of the target nucleic acid sequences;
    b) performing in vitro transcription with the bait sequences as templates to prepare nucleic acid analogs, wherein the nucleic acid analogs each have a binding moiety, such as a biotin binding moiety;
    c) fragmenting the nucleic acid sample, preferably preparing a library;
    d) hybridizing said nucleic acid analogs to said nucleic acid sample, such that said nucleic acid analogs form nucleic acid analog/DNA hybrid complexes with said target nucleic acid sequences; and
    e) isolating the nucleic acid analog/DNA hybrid complexes from nonspecifically hybridized nucleic acids by the binding moiety, to remove nontarget nucleic acid sequences.
  2. 2. The method according to claim 1, wherein the method further comprises step f) of amplifying the nucleic acid analog/DNA hybrid complexes, so as to enrich the target nucleic acid sequences.
  3. 3. The method according to claim 1, wherein the in vitro transcription in step b) is performed with nucleic acid analogs GNA, LNA, PNA, TNA or morpholino nucleic acid, so as to prepare the nucleic acid analogs.
  4. 4. The method according to claim 1, wherein the nucleic acid sample is genomic DNA, RNA, cDNA, or mRNA, and where the nucleic acid sample is RNA or mRNA, the method further comprises the step of subjecting the RNA or mRNA to reverse transcription into DNA before step c).
  5. 5. The method according to claim 1, wherein the bait sequences have characteristics selected from the group consisting of: i) neither forming any hairpin structure by itself nor forming any dimer with each other, ii) each having a copy number compensated according to the GC content and/or spatial structure of the target nucleic acid sequences, iii) being designed for the region flanking a target region as the replacement region with the same design method for the target region, where the target region has a very high or very low GC content or where the target region is a region of low complexity, and iv) not specifically binding to other sequences than the target nucleic acid sequences in the nucleic acid sample.
  6. 6. The method according to claim 4, wherein the copy number is compensated according to the GC content of the target nucleic acid sequences in ii), which means that, the copy number coefficient for the bait sequence with a GC content of 50% is set to the baseline of 1, and the copy number coefficient for the bait sequences with a GC content between 10% and 90% increases by
    2016403554 22 Nov 2018
    0.08-0.12 per 1% of the GC content deviating from 50%.
  7. 7. The method according to claim 1, wherein the bait sequences are on a solid support, such as on a microarray slide.
  8. 8. The method according to claim 1, wherein for each target region, the bait sequence is one or more bait sequences with an optimal comprehensive score in terms of specificity, dimer, hairpin structure, and relative position to the target region, the comprehensive score is obtained by a scoring function of S = θ·ΧSpecificity T bxSdimer T CxShairpin structure + dXSrelative distance, where a = 0.26-0.34, b = 0.080.12, c = 0.17-0.23 and d = 0.35-0.45, and the particular scoring methods are as follows:
    Sspecificity scoring: any newly designed bait sequence is aligned to the genome and Tm between the bait sequence and each hit sequence is calculated, where the difference of Tm between the bait sequence and the target region Tm between the bait sequence and any hit sequence >5°C, preferably > 10°C, the mean of Tms between the bait sequence and all hit sequences is calculated, and S specificity 1 - Tmmean / (Tmtarget - 5), preferably S specificity 1 ~ Tmmean / (Tmtarget- 10), where Tmmean is the average of Tms between the bait sequence and all hits in non-specific regions, and Tmtarget is Tm between the bait sequence and the target region;
    Sdimer scoring: any newly designed bait sequence is aligned to each of the already designed bait sequences for dimer analysis, and Tm between the newly designed bait sequence and each of the hit bait sequences is calculated, where the Tm <47°C, the mean of Tms between the newly designed bait sequence and each of the hit bait sequences is calculated, and Sdimer = (47 - Tmmean) / 47, and preferably, where the Tm < 37°C, the mean of Tms between the newly designed bait sequence and each of the hit bait sequences is calculated, and Sdimer = (37 Tmmean) / 37;
    Shairpin structure scoring: the optimal structure of any bait sequence upon selfalignment is determined, and the Tm of the structure is calculated, where the Tm < 47°C, then S hairpin structure (47 - Tm) / 47, and where the Tm<47°C, then S hairpin structure = (37 - Tm) / 37; and
    Sreiative distance scoring: the coordinate difference 5DiStance between any newly designed bait sequence and the target region is calculated, and where 5DiStance is less than 150, then Sreiative distance (150 -5ijiStarice) /150.
  9. 9. The bait sequences as defined in any of claims 1-8.
  10. 10. A kit comprising the bait sequences according to claim 9, the kit further comprises, but not limited to, a double-stranded linker molecule, and a plurality of different oligonucleotide probes.
AU2016403554A 2016-04-22 2016-11-21 Method for enriching target nucleic acid sequence from nucleic acid sample Pending AU2016403554A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201610250133.3 2016-04-22
CN201610250133.3A CN105925671B (en) 2016-04-22 2016-04-22 A method of target sequence nucleotides are enriched with from nucleic acid samples
PCT/CN2016/106595 WO2017181670A1 (en) 2016-04-22 2016-11-21 Method for enriching target nucleic acid sequence from nucleic acid sample

Publications (1)

Publication Number Publication Date
AU2016403554A1 true AU2016403554A1 (en) 2018-12-13

Family

ID=56839769

Family Applications (2)

Application Number Title Priority Date Filing Date
AU2016403554A Pending AU2016403554A1 (en) 2016-04-22 2016-11-21 Method for enriching target nucleic acid sequence from nucleic acid sample
AU2016102398A Active AU2016102398A4 (en) 2016-04-22 2016-11-21 Method for enriching target nucleic acid sequence from nucleic acid sample

Family Applications After (1)

Application Number Title Priority Date Filing Date
AU2016102398A Active AU2016102398A4 (en) 2016-04-22 2016-11-21 Method for enriching target nucleic acid sequence from nucleic acid sample

Country Status (3)

Country Link
CN (1) CN105925671B (en)
AU (2) AU2016403554A1 (en)
WO (1) WO2017181670A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021195324A1 (en) * 2020-03-26 2021-09-30 Integrated Dna Technologies, Inc. Hybridization capture methods and compositions

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105925671B (en) * 2016-04-22 2019-07-23 艾吉泰康(嘉兴)生物科技有限公司 A method of target sequence nucleotides are enriched with from nucleic acid samples
CN106676169B (en) * 2016-11-15 2021-01-12 上海派森诺医学检验所有限公司 Hybridization capture kit for detecting breast cancer susceptibility genes BRCA1 and BRCA2 mutation and method thereof
CN108546739A (en) * 2018-04-20 2018-09-18 曹顺 A method of the nucleic acid target sequence enrichment for NGS sequencings
CN111723261B (en) * 2019-03-22 2021-08-13 昆明逆火科技股份有限公司 Search engine-based DNA comparison algorithm
CN110343756B (en) * 2019-06-25 2023-02-24 广西识远医学检验实验室有限公司 Group of probes for detecting thalassemia, related kit and application

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003093509A1 (en) * 2002-05-01 2003-11-13 Seegene, Inc. Methods and compositions for improving specificity of pcr amplication
US8192937B2 (en) * 2004-04-07 2012-06-05 Exiqon A/S Methods for quantification of microRNAs and small interfering RNAs
CN103602658A (en) * 2013-10-15 2014-02-26 东南大学 Novel capture and enrichment technology for targeting nucleic acid molecules
CN105925671B (en) * 2016-04-22 2019-07-23 艾吉泰康(嘉兴)生物科技有限公司 A method of target sequence nucleotides are enriched with from nucleic acid samples

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021195324A1 (en) * 2020-03-26 2021-09-30 Integrated Dna Technologies, Inc. Hybridization capture methods and compositions

Also Published As

Publication number Publication date
CN105925671B (en) 2019-07-23
CN105925671A (en) 2016-09-07
WO2017181670A1 (en) 2017-10-26
AU2016102398A4 (en) 2019-05-02

Similar Documents

Publication Publication Date Title
AU2016102398A4 (en) Method for enriching target nucleic acid sequence from nucleic acid sample
EP3981884B1 (en) Single cell whole genome libraries for methylation sequencing
CA2810931C (en) Direct capture, amplification and sequencing of target dna using immobilized primers
US8986958B2 (en) Methods for generating target specific probes for solution based capture
EP3377625A1 (en) Method for controlled dna fragmentation
AU2016102399A4 (en) Primer set for amplifying multiple target DNA sequences in sample and use thereof
JP2006523465A5 (en)
US11401543B2 (en) Methods and compositions for improving removal of ribosomal RNA from biological samples
JP2020505045A (en) Barcoded DNA for long range sequencing
US20230056763A1 (en) Methods of targeted sequencing
CN114729349A (en) Method for detecting and sequencing barcode nucleic acid
US9708603B2 (en) Method for amplifying cDNA derived from trace amount of sample
CN106191256B (en) Method for DNA methylation sequencing aiming at target region
WO2020132316A2 (en) Target enrichment
CN108103052B (en) Single cell whole genome amplification and library construction method for improving genome coverage
US20230105642A1 (en) Method and compositions for preparing nucleic acid libraries
US9315807B1 (en) Genome selection and conversion method
CN110546275A (en) Method and kit for removing unwanted nucleic acids
US20220136042A1 (en) Improved nucleic acid target enrichment and related methods

Legal Events

Date Code Title Description
DA3 Amendments made section 104

Free format text: THE NATURE OF THE AMENDMENT IS: APPLICATION IS TO PROCEED UNDER THE NUMBER 2016102398