WO2023066255A1 - 测序方法、测序数据处理方法、设备和计算机设备 - Google Patents

测序方法、测序数据处理方法、设备和计算机设备 Download PDF

Info

Publication number
WO2023066255A1
WO2023066255A1 PCT/CN2022/125967 CN2022125967W WO2023066255A1 WO 2023066255 A1 WO2023066255 A1 WO 2023066255A1 CN 2022125967 W CN2022125967 W CN 2022125967W WO 2023066255 A1 WO2023066255 A1 WO 2023066255A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
nucleic acid
read
template
reads
Prior art date
Application number
PCT/CN2022/125967
Other languages
English (en)
French (fr)
Inventor
樊济才
金欢
陈美容
陈方
孙雷
Original Assignee
深圳市真迈生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市真迈生物科技有限公司 filed Critical 深圳市真迈生物科技有限公司
Priority to CN202280070809.4A priority Critical patent/CN118139990A/zh
Publication of WO2023066255A1 publication Critical patent/WO2023066255A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6834Enzymatic or biochemical coupling of nucleic acids to a solid phase
    • C12Q1/6837Enzymatic or biochemical coupling of nucleic acids to a solid phase using probe arrays or probe chips
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • C12Q1/6874Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis

Definitions

  • the present disclosure relates to the field of biotechnology, specifically, the present disclosure relates to the field of sequencing technology, and more specifically, the present disclosure relates to a sequencing method, a sequencing data processing method, a device, a computing device, and a computer-readable medium.
  • High-throughput sequencers use imaging systems such as total internal reflection fluorescent CCD (Charge coupled Device, also known as CCD image sensor), TIRF (Total Internal Reflection Fluorescence, total internal reflection fluorescence) to detect the incorporated nucleosides acid, so as to achieve the purpose of sequencing.
  • CCD Charge coupled Device
  • TIRF Total Internal Reflection Fluorescence, total internal reflection fluorescence
  • the present disclosure aims to solve one of the technical problems in the related art at least to a certain extent.
  • the present disclosure provides a sequencing method in one aspect.
  • the sequencing method comprises:
  • nucleic acid template is directly or indirectly linked to the surface of the solid phase carrier
  • the synthetic fragment corresponds to a continuous portion that overlaps or does not overlap with the nucleic acid template.
  • the present disclosure is based on the fact that the limited sequencing read length of the sequencing platform, especially the short read length (such as 15-50 bp sequencing length), is not conducive to the assembly and analysis of the sequence, or the sequencing can be improved by increasing the amount of sequencing when the amount of template is certain. Analyzed for accuracy.
  • the length of the reads is not shorter than the length of the synthetic fragments.
  • the length of the synthetic fragment is greater than or equal to 1 bp.
  • the length of the synthetic fragment is greater than or equal to 10 bp.
  • the length of the synthetic fragment is greater than or equal to 10 bp and less than or equal to 20 bp.
  • the length of the nucleic acid template is less than or equal to 600bp.
  • the nucleic acid template is greater than or equal to 75 bp and less than or equal to 400 bp.
  • the 3'-OH of the sugar of the first nucleotide and/or the second nucleotide is reversibly blocked.
  • the 3'-OH of the sugar of the first nucleotide and/or the second nucleotide is in a natural state, and the first nucleotide and/or the second nucleotide Nucleotides have cleavable blocking groups attached to their bases.
  • the detectable label is a fluorescent molecule.
  • the sequencing-by-synthesis reaction and/or the polymerization reaction are carried out under the action of a DNA polymerase selected from Klenow fragment, Bst, 9°N, Pfu, KOD and Vent at least one of .
  • a DNA polymerase selected from Klenow fragment, Bst, 9°N, Pfu, KOD and Vent at least one of .
  • the sequencing-by-synthesis reaction and the polymerization reaction are performed under the action of the same DNA polymerase, which is a Klenow fragment mutant.
  • the sequencing-by-synthesis reaction and the polymerization reaction are performed under the action of the same DNA polymerase, which is a 9°N mutant.
  • the read is a first read, the method comprising:
  • the first read, the synthetic fragment and the second read correspond to three non-overlapping contiguous portions of the nucleic acid template.
  • the read is a first read, the method comprising:
  • the first read, the synthetic fragment and the second read correspond to three non-overlapping contiguous portions of the nucleic acid template.
  • the synthetic fragment is the first synthetic fragment
  • the method further includes:
  • the second synthetic segment and the third read correspond to two contiguous portions of the nucleic acid template.
  • the method further comprises: repeating iii) and iv) at least once.
  • the method further comprises: repeating vi) and vii) at least once.
  • the length relationship between the first read, first synthetic segment, second read, second synthetic segment and third read is such that any non-terminal portion of the nucleic acid template Nucleotides at a position are determined at least once.
  • the method further includes blocking at least a part of the nucleic acid molecules on the surface of the solid support after iv) and before v).
  • the method further comprises, after v) and before vi), blocking at least a part of the nucleic acid molecules on the surface of the solid support.
  • the extension reaction blocking agent is bound to the first template to realize the blocking, and the extension reaction blocking agent is selected from at least one of ddNTP and its derivatives .
  • the extension reaction blocking agent is bound to the first template to realize the blocking, and the extension reaction blocking agent is selected from at least one of ddNTP and its derivatives .
  • the read is a first read
  • the synthetic fragment is a first synthetic fragment
  • the method includes:
  • the method further includes: repeating iii)-v) at least once, and making the length of the first synthetic fragment in each repetition not shorter than the length of the first synthetic fragment in the previous repetition and No longer than the sum of the lengths of the first synthetic fragment and the second read in the previous repeat.
  • the read is a first read
  • the synthetic fragment is a first synthetic fragment
  • the method includes:
  • the read is a first read, the method comprising:
  • the nucleic acid template is obtained by hybridizing a single-stranded nucleic acid molecule with a probe, and extending the probe based on a polymerization reaction, the probe being covalently linked on the surface of the solid-phase support , the 3' end of the single-stranded nucleic acid molecule is complementary to the probe.
  • the method further includes blocking at least a part of the nucleic acid molecules on the surface of the solid support after ii) and before iii).
  • the method further includes blocking at least a part of the nucleic acid molecules on the surface of the solid support after iii) and before iv).
  • the extension reaction blocking agent is bound to the first template to realize the blocking, and the extension reaction blocking agent is selected from at least one of ddNTP and its derivatives .
  • the method further includes blocking at least a part of the nucleic acid molecules on the surface of the solid support after iii) and before iv).
  • the method further includes blocking at least a part of the nucleic acid molecules on the surface of the solid support after iv) and before v).
  • the extension reaction blocking agent is bound to the first template to realize the blocking, and the extension reaction blocking agent is selected from at least one of ddNTP and its derivatives .
  • the nucleic acid template is dissociated from the first template by adding a denaturing reagent, so as to remove the nucleic acid template.
  • the first template is dissociated from the nucleic acid template by adding a denaturing reagent, so as to remove the first template.
  • the denaturing reagent comprises formamide
  • the sequencing data comprises a plurality of sets of reads
  • the set of reads comprises a plurality of reads obtained by performing multiple rounds of sequencing on the same insert, wherein The method includes performing the following processing on the plurality of reads of each of the groups of reads:
  • the preset position requirement is determined by the rules of the multiple rounds of sequencing,
  • the actual relative position meeting the preset position requirement is an indication that the read is the splicable read.
  • the fact that the actual relative position does not satisfy the preset position requirement is an indication that the read is the filtered read.
  • the sequencing data processing method further includes:
  • a secondary screen is performed on the filtered reads, the secondary screen comprising:
  • each of said reads of said set of reads is used as a primary read for said secondary screening.
  • the sequencing data processing method further includes:
  • the assembleable reads are assembled according to the rules of the multiple rounds of sequencing.
  • the rules of the multiple rounds of sequencing include at least one selected from the following: paired-end sequencing, Jumping sequencing, Overlap sequencing, paired-end Jumping sequencing, and a combination of these sequencing rules.
  • the rule of the multiple rounds of sequencing is paired-end sequencing
  • the read segment group includes two read segments
  • the preset position requirements include:
  • the matching regions of two of said reads are on the forward and reverse strands of said reference genome, respectively;
  • the predetermined threshold is determined based on the length of the inserted segment.
  • the rule of the multiple rounds of sequencing is Jumping sequencing
  • the preset position requirements include:
  • Matching regions of a plurality of said reads are on the same strand of said reference genome
  • the distance between two adjacent read segments in the matching region of the plurality of read segments on the reference genome does not exceed a predetermined distance threshold
  • the predetermined threshold is determined based on the length of the partial extension step.
  • the predetermined distance threshold is no more than 50 bp, preferably no more than 20 bp, more preferably between 5 and 20 bp.
  • the rule of the multiple rounds of sequencing is Overlap sequencing
  • the preset position requirements include:
  • Matching regions of a plurality of said reads are on the same strand of said reference genome
  • the length of the overlapping region of two adjacent reads on the reference genome is within a predetermined distance range
  • the predetermined distance range is determined based on the length of the overlapping region in the sequencing process
  • the predetermined distance range is between 5 and 10 bp.
  • the rule of the multiple rounds of sequencing is paired-end Jumping sequencing
  • the preset position requirements include:
  • a portion of the matching region of a plurality of said reads is on the forward strand of said reference genome and another portion is on the reverse strand of said reference genome;
  • the length of the overlapping region of two adjacent reads on the reference genome is within a predetermined distance range
  • the predetermined distance range is determined based on the length of the partial extension step in the sequencing process
  • the predetermined distance threshold is no more than 50 bp, preferably no more than 20 bp, more preferably between 5 and 20 bp.
  • the Jumping sequencing includes:
  • nucleic acid template is directly or indirectly linked to the surface of the solid phase carrier
  • said first nucleotide is a detectably labeled reversible terminator and is used to obtain a plurality of reads by said extension reaction;
  • the second nucleotide is a reversible terminator without a detectable label, and is used to obtain at least one synthetic fragment of a preset length through the extension reaction.
  • the Overlap sequencing includes:
  • the nucleic acid template is directly or indirectly connected to the surface of the solid phase carrier
  • the second sequencing adapter generation is performed first by performing an extension reaction with the second nucleotide, followed by a plurality of the extension reactions with the first nucleotide to obtain the second read.
  • the paired-end Jumping sequencing includes:
  • first nucleotide and the second nucleotide based on the multiple rounds of extension reactions between the first primer and the nucleic acid template, and obtaining an extended chain of the first primer;
  • said first nucleotide is a detectably labeled reversible terminator and is used to obtain a plurality of reads by said extension reaction;
  • the second nucleotide is a reversible terminator without a detectable label, and is used to obtain at least one synthetic fragment of a preset length through the extension reaction.
  • the sequencing data processing device includes: a plurality of reads obtained by performing multiple rounds of sequencing on the same insert, and the device includes The plurality of reads of the segment group are subjected to a plurality of modules for the following processing:
  • a global alignment module for globally aligning the plurality of reads with a reference genome so as to determine a plurality of matching regions corresponding to the plurality of reads on the reference genome
  • a screening module configured to perform a screening on the plurality of reads based on the comparison between the actual relative positions between the plurality of matching regions and the preset position requirements, so as to obtain spliceable reads and filtered reads,
  • the preset position requirement is determined by the rules of the multiple rounds of sequencing,
  • the actual relative position meeting the preset position requirement is an indication that the read is the splicable read.
  • the fact that the actual relative position does not satisfy the preset position requirement is an indication that the read is the filtered read.
  • the sequencing data processing device further includes a secondary screening module for performing secondary screening on the filtered reads, the secondary screening comprising:
  • the computing device includes: a processor and a memory;
  • the memory is used to store computer programs
  • the processor is configured to execute the computer program to implement the sequencing data processing method described above.
  • the computer-readable storage medium includes computer instructions, and when the instructions are executed by a computer, the computer implements the aforementioned method for processing sequencing data.
  • FIG. 1 is a schematic flowchart of a sequencing data processing method according to an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart of a sequencing data processing method according to another embodiment of the present disclosure
  • FIG. 3 is a schematic flow diagram of secondary screening according to another embodiment of the present disclosure.
  • FIG. 4 is a schematic flowchart of a sequencing data processing method according to another embodiment of the present disclosure.
  • Fig. 5 is a schematic structural diagram of a sequencing data processing device according to an embodiment of the present disclosure.
  • Fig. 6 is a schematic structural diagram of a sequencing data processing device according to an embodiment of the present disclosure.
  • Fig. 7 is a schematic structural diagram of a sequencing data processing device according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic flow diagram of paired-end sequencing according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic flow diagram of Jumping sequencing according to an embodiment of the present disclosure.
  • Figure 10 is a schematic flow chart of Overlap sequencing according to one embodiment of the present disclosure.
  • Fig. 11 is a schematic flow chart of paired-end jumping sequencing according to an embodiment of the present disclosure.
  • first and second are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features.
  • the features defined as “first” and “second” may explicitly or implicitly include at least one of these features.
  • “plurality” means at least two, such as two, three, etc., unless otherwise specifically defined.
  • connection and “fixation” should be interpreted in a broad sense, for example, it can be a fixed connection, a reversible connection, a direct connection, or a Intermediaries are indirectly connected, etc., unless expressly qualified otherwise.
  • connection and “fixation” should be interpreted in a broad sense, for example, it can be a fixed connection, a reversible connection, a direct connection, or a Intermediaries are indirectly connected, etc., unless expressly qualified otherwise.
  • nucleic acid template refers to a nucleic acid molecule to be detected, which means a polymer of nucleotides of a certain length, and the nucleotides may include ribonucleotides, deoxyribonucleotides, ribonucleotides or deoxyribonucleotides.
  • nucleotides may include ribonucleotides, deoxyribonucleotides, ribonucleotides or deoxyribonucleotides.
  • One or more compositions of analogs or derivatives of ribonucleotides includes single-stranded or double-stranded nucleic acid molecules.
  • sequencing may also be referred to as “nucleic acid sequencing” or “gene sequencing”, which refers to the determination of the sequence of bases in a nucleic acid sequence; including paired-end sequencing, single-end sequencing and/or paired-end sequencing, etc.
  • paired-end sequencing or paired-end sequencing may refer to the reading of any two segments or parts of the same nucleic acid molecule that do not completely overlap; the so-called sequencing includes combining nucleotides (including nucleotide analogs) into Template and collect the corresponding reaction signal process.
  • reversible terminator refers to four kinds of natural nucleotides (dATP, dCTP, dGTP, dTTP) or their derivatives with reversible modification.
  • Derivatives of natural nucleotides refer to compounds formed by replacing atoms or atomic groups of nucleotides with other atoms or atomic groups. Derivatives of natural nucleotides can be incorporated into nucleic acids under the action of polymerase or terminal transferase 3' end of the chain.
  • the 3' end of the nucleotide whose 3' end is reversibly modified can continue to undergo phosphoester reaction with the nucleotide after the 3' end is demodified, and the modification group can be selected as an alkyl group containing an azide group, etc.
  • the modification group can be selected as an alkyl group containing an azide group, etc.
  • nucleotide refers to the four natural nucleotides (dATP, dCTP, dGTP, dTTP) or derivatives thereof, unless otherwise clearly defined.
  • the term "sugar of nucleotides” refers to ribose or deoxyribose.
  • the chemical formula of ribose is C 5 H 10 O 5 .
  • Ribose has two configurations: L-ribose and D-ribose.
  • the chemical structure of L-ribose is shown below, and the 3' position of L-ribose is marked as follows:
  • D-ribose The chemical structure of D-ribose is shown below, and the 3' position of D-ribose is marked as follows:
  • deoxyribose is also known as D-deoxyribose, 2-deoxy-D-ribose, thymus, its chemical formula is C 4 H 9 O 3 CHO (C 5 H 10 O 4 ), and its chemical structure is shown below, The 3' position of deoxyribose is marked as follows:
  • base also known as nucleobase, nitrogenous base
  • natural bases include adenine (A), guanine (G), cytosine (C), thymine (T), uracil (U); unnatural bases include locked nucleic acid (LNA) and bridging nucleic acid (BNA); base analogs include such as hypoxanthine, deazaadenine, deazaguanine, deazahypoxanthine, 7-methylguanine, 5,6-dihydrouracil, 5-methyl Cytosine, 5-hydroxymethylcytosine.
  • the base type since the nucleotide type is determined by the base type, the base type may be used to represent the nucleotide type in the present disclosure.
  • primer refers to: an oligonucleotide or nucleic acid molecule that can hybridize to a target sequence of interest; a primer is a single-stranded oligonucleotide or polynucleotide.
  • detectable label refers to a label or group capable of producing a detectable signal under suitable conditions.
  • linker refers to a nucleotide sequence containing a known sequence, which may be single-stranded nucleic acid or double-stranded nucleic acid.
  • Adapters can be used as primers and can also be used to ligate at one or both ends of nucleic acid fragments.
  • the term "Jumping sequencing” refers to a sequencing method.
  • the sequencing method includes: providing a nucleic acid template, the nucleic acid template is directly or indirectly linked to the surface of a solid phase carrier; multiple rounds of extension reactions occur with the nucleic acid template using the first nucleotide and the second nucleotide, wherein the first nucleotide
  • the acid is a reversible terminator with a detectable label and is used to obtain multiple reads by an extension reaction
  • the second nucleotide is a reversible terminator without a detectable label and is used to obtain at least one read by an extension reaction Synthetic clips of preset length.
  • the term "Overlap sequencing” refers to a sequencing method.
  • the sequencing method includes: the nucleic acid template is directly or indirectly connected to the surface of the solid phase carrier; multiple rounds of extension reactions are performed with the nucleic acid template by using the first sequencing adapter and the second sequencing adapter to obtain multiple read segments, wherein the first sequencing adapter There is an overlapping region of at least one base between the generated first read segment and the second read segment generated by the second sequencing adapter, and optionally, the first sequencing adapter uses the first nucleotide to perform an extension reaction, so as to obtain the first read segment ; second sequencing adapter generation first performing an extension reaction with a second nucleotide, followed by multiple extension reactions with the first nucleotide to obtain a second read.
  • the present disclosure proposes a sequencing method, comprising:
  • (11) providing a solid phase carrier surface, the solid phase carrier surface is connected with a nucleic acid complex formed by a nucleic acid template and a first primer, at least a part of the first primer is configured to hybridize with at least a part of the 3' end of the nucleic acid template, and the nucleic acid template is connected On the surface of the solid phase carrier or the first sequencing primer is connected to the surface of the solid phase carrier.
  • step (11) the first primer and the nucleic acid template are complementary to form a nucleic acid complex, and the nucleic acid complex is connected to the surface of the solid-phase carrier, so as to realize the immobilization of the nucleic acid template on the surface of the solid-phase carrier.
  • the nucleic acid template in the nucleic acid complex is attached to the surface of the solid phase carrier.
  • the connection of the nucleic acid template to the surface of the solid-phase carrier does not mean that the nucleic acid template is connected to the surface of the solid-phase carrier through the first primer, but the nucleic acid template is covalently bonded to molecules/groups on the surface of the solid-phase carrier, thereby realizing nucleic acid Attachment of the template to the surface of the solid support.
  • step (11) can be achieved by the following method: the nucleic acid template is covalently linked to the surface of the solid phase carrier, a first primer is added and the nucleic acid template is hybridized with the first primer, at least a part of the first primer is mixed with The 3' end of the nucleic acid template is complementary.
  • the first primer in the nucleic acid complex is attached to the surface of the solid phase carrier. That is, the first primer is connected to the surface of the solid phase carrier through a covalent bond, and the nucleic acid template is connected to the surface of the solid phase carrier through the first sequencing primer. At this time, the nucleic acid template is not directly connected to the surface of the solid-phase carrier, but indirectly connected to the surface of the solid-phase carrier through complementary connection with the first primer.
  • the first primer is linked to molecules or groups on the surface of the solid support through a covalent bond, so as to realize the connection of the first primer on the surface of the solid support.
  • step (11) can be achieved by the following method: the first primer is covalently linked to the surface of the solid phase carrier, the nucleic acid template is hybridized with the first primer, at least a part of the first primer is mixed with the The 3' end of the nucleic acid template is complementary.
  • the nucleic acid template is less than or equal to 600 bp in length. In one embodiment, the nucleic acid template is greater than or equal to 75 bp and less than or equal to 400 bp. Exemplarily, the nucleic acid template is 75-80bp, 80-90bp, 90-100bp, 100-120bp, 120-150bp, 150-180bp, 180-200bp, 200-220bp, 220-250bp, 250-280bp, 280-300bp , 300 ⁇ 320bp, 320 ⁇ 350bp, 350 ⁇ 380bp, 380 ⁇ 400bp, etc.
  • the nucleic acid template is used as a template
  • the first primer is used as a primer to perform an extension reaction to obtain a first extension fragment
  • the length of the first extension fragment is less than the nucleic acid The length of the template.
  • the first nucleotide is a reversible terminator without a detectable label.
  • the first nucleotides added in step (21) are 4 kinds of reversible terminators without detectable labels. Utilizing such nucleotides, on the one hand, the length of the first extension fragment can be effectively controlled by the blocking group in the reversible terminator, and no fluorescent dye group is introduced into the first nucleotide, so that the fluorescent dye can be effectively avoided. The effect of the group remaining on the base after excision on the extension reaction.
  • the conditions suitable for carrying out the polymerization reaction include DNA polymerase, that is, the synthetic polymerization reaction is carried out under the action of the DNA polymerase.
  • the DNA polymerase can be any enzyme that can perform DNA amplification, such as at least one of Taq enzyme, Klenow fragment, Bst, 9°N, Pfu, KOD and Vent. .
  • the length of the first extension fragment is not shorter than the length of the synthetic fragment. In some embodiments, the length of the first extension is greater than or equal to 1 bp. In some embodiments, the length of the first extension is greater than or equal to 10 bp. In some embodiments, the length of the first extension is greater than or equal to 10 bp and less than or equal to 20 bp. Exemplarily, the length of the first extension fragment is 10-12 bp, 12-14 bp, 14-16 bp, 16-18 bp, 18-20 bp and so on.
  • the second nucleotide is a reversible terminator with a detectable label.
  • the reversible terminator contains a blocking group that can block the reaction at the 3' position of the sugar of the nucleotide, thus enabling the sequencing reaction while synthesizing or the sequencing reaction while ligation, and only introducing a the second nucleotide.
  • a blocking group is introduced into the nucleotide to eliminate the reactivity of the 3' position of the sugar of the nucleotide.
  • the detectable label is a fluorescent label.
  • each first nucleotide participating in the extension reaction may carry a different fluorescent label, or at least two of the four first nucleotides participating in the extension reaction may carry different fluorescent labels. mark.
  • each of the four first nucleotides carries four different fluorescent labels; the four first nucleotides carry three fluorescent labels, wherein the first and third nucleotides carry different fluorescent groups group, the fluorescent group carried by the fourth nucleotide is the same as the fluorescent group carried by one of the first three first nucleotides, or the fourth nucleotide does not carry a fluorescent group, it should be understood that , the type of the fourth first nucleotide is not limited.
  • the four first nucleotides carry two kinds of fluorescent labels, for example, two kinds of first nucleotides carry one kind of the same fluorescent label, and the other two kinds of first nucleotides carry another kind of the same fluorescent label.
  • four nucleotides carry one fluorescent label.
  • a detectable label need not be a fluorescent label. Any detectable label that allows detection of the type of nucleotide incorporated in the DNA sequence will do.
  • the conditions suitable for performing a sequencing-by-synthesis reaction or a sequencing-by-ligation reaction include a DNA polymerase, that is, performing a sequencing-by-synthesis reaction or a sequencing-by-ligation reaction under the action of a DNA polymerase reaction.
  • the DNA polymerase can be any enzyme that can perform DNA amplification, such as at least one of Taq enzyme, Klenow fragment, Bst, 9°N, Pfu, KOD and Vent.
  • the polymerization reaction of step (21) and the sequencing-by-synthesis reaction or sequencing-by-ligation reaction of step (31) are performed under the action of the same DNA polymerase, wherein the DNA polymerase is a Klenow fragment mutant.
  • the polymerization reaction of step (21) and the sequencing-by-synthesis reaction or sequencing-by-ligation reaction of step (31) are carried out under the action of the same DNA polymerase, wherein the DNA polymerase is a 9°N mutant .
  • the first sequencing data can be obtained through step (31).
  • the nucleic acid template is used as the template, and the first primer is used as the primer to carry out the extension reaction to obtain the first extended fragment Step (21)
  • the second nucleotide under conditions suitable for a sequencing-by-synthesis reaction or a sequencing-by-ligation reaction, the nucleic acid template is used as a template and the first extension fragment is used as a primer to carry out an extension cycle to perform the second Once sequencing, the order of the step (31) of forming the first nascent sequencing strand can be reversed.
  • the sequencing-by-synthesis reaction can be carried out first to determine a part of the nucleic acid template, and then the second nucleotide can be used to carry out a polymerization reaction to synthesize a part of the nucleic acid template to obtain a synthetic fragment of a preset length;
  • the dinucleotides are subjected to a polymerization reaction to synthesize a part of the nucleic acid template to obtain a synthetic fragment with a predetermined length, and then a sequencing-by-synthesis reaction is performed to determine a part of the nucleic acid template.
  • the present disclosure proposes a sequencing method, including a first sequencing method, and the first sequencing method further includes:
  • the nucleic acid template is used as the template
  • the first primer is used as the primer to carry out the extension cycle to perform the second sequencing, forming A second nascent sequencing chain is used to obtain second sequencing data.
  • step (51) the conditions suitable for the sequencing-by-synthesis reaction or the sequencing-by-ligation reaction can be referred to above, and will not be repeated here in order to save space.
  • the length of the second nascent sequencing strand is not less than the length of the first extended fragment.
  • the first sequencing data and the second sequencing data have partially overlapping data.
  • the use of partially overlapping data for sequencing data analysis is more conducive to the assembly analysis of template sequences and the mutual proofreading of sequencing data, and improves the accuracy of sequencing data analysis.
  • the length of the second nascent sequencing strand is less than the combined length of the first nascent sequencing strand and the first extension.
  • the above method further includes: performing a first blocking treatment on the 3' end of the first nascent sequencing strand remaining on the surface of the chip. Blocking the 3' end of the remaining first nascent sequencing strand can effectively prevent interference signals generated by the continued extension of the first nascent sequencing strand during the second sequencing process. By reducing the interference of invalid data generated by interference signals on information analysis, the amount of effective data can be effectively increased, thereby improving the accuracy of sequencing data analysis.
  • the above-mentioned first blocking treatment can be performed by different methods, such as by removing the 3' terminal hydroxyl group and/or by linking the 3' terminal hydroxyl group with an extension reaction blocking agent.
  • the extension reaction blocking agent is used to block the reaction between the 3' terminal hydroxyl group and the phosphate group, and the extension reaction blocking agent can be an alkyl group, ddNTP or its derivatives, etc.
  • the extension reaction blocker is a ddNTP or a derivative thereof.
  • the above-mentioned first blocking treatment is performed using at least one of DNA polymerase and terminal transferase.
  • DNA polymerase uses the DNA chain as a template to add ddNTP to the 3' end of the nucleic acid chain to be blocked, so as to achieve the effect of blocking the 3' end.
  • Terminal transferase can directly add ddNTP to the 3' end of single-stranded nucleic acid to achieve the effect of 3' end blocking.
  • the above-mentioned first blocking treatment uses a polymerase to link ddNTPs or derivatives thereof.
  • the sequencing method proposed in the present disclosure includes a second sequencing method.
  • the second sequencing method is in the sequence method proposed in the second implementation manner of the present disclosure.
  • it further includes the following technical features:
  • step (11) After step (11) and before step (21), steps are included:
  • connection method between the nucleic acid template and the solid phase carrier refers to the above.
  • the nucleic acid template is covalently attached to the surface of the solid support.
  • the length of the third nascent sequencing strand is not less than the length of the first extended fragment.
  • the first sequencing data and the third sequencing data have partially overlapping data. Using partially overlapping data for data analysis is more conducive to the assembly analysis of template sequences and the mutual proofreading of sequencing data, and improves the accuracy of sequencing data analysis.
  • the third sequencing method further includes step (c) performing a second blocking treatment on the 3' end of the third nascent sequencing strand remaining on the surface of the chip .
  • Blocking the 3' end of the residual third nascent sequencing strand can effectively prevent the residual third nascent sequencing strand from continuing to extend during the first sequencing process to generate interference signals.
  • the amount of effective sequencing data can be effectively increased. Therefore, the accuracy of sequencing data analysis can be further improved by increasing the effective amount of sequencing data through the second blocking process.
  • the above-mentioned second blocking treatment can be performed by different methods, such as by removing the 3' terminal hydroxyl group and/or by linking the 3' terminal hydroxyl group with an extension reaction blocking agent.
  • the extension reaction blocking agent is used to block the reaction between the 3' terminal hydroxyl group and the phosphate group, and the extension reaction blocking agent can be an alkyl group, ddNTP or its derivatives, etc.
  • the extension reaction blocker is a ddNTP or a derivative thereof.
  • the above-mentioned second blocking treatment is performed using at least one of DNA polymerase and terminal transferase.
  • DNA polymerase uses the DNA chain as a template to add ddNTP to the 3' end of the nucleic acid chain to be blocked, so as to achieve the effect of blocking the 3' end.
  • Terminal transferase can directly add ddNTP to the 3' end of single-stranded nucleic acid to achieve the effect of 3' end blocking.
  • the above-mentioned second blocking treatment uses a polymerase to link ddNTPs or derivatives thereof.
  • Removal of the third nascent sequencing strand can be carried out by physical or chemical methods (such as using denaturing reagents), physical methods such as high temperature denaturation (such as 80°C-98°C), denaturing reagents such as NaOH, formamide, etc.
  • the third nascent sequencing strand is removed by dissociation of the third nascent sequencing strand from the nucleic acid template by a denaturing reagent such as formamide.
  • the nucleic acid templates in the above-mentioned first sequencing method and its examples, the third sequencing method and its examples are respectively obtained by the following steps:
  • the third blocking is used to block the nucleic acid molecules on the surface of the chip, and the nucleic acid molecules on the surface of the chip include adapters, nucleic acid templates, residual initial templates, and the like. Through the third blocking, it can effectively avoid the 3' end of the nucleic acid molecule on the surface of the chip to be connected with the nucleotide containing the detection signal to generate an interference signal during the sequencing process, and by reducing the interference of the invalid data pair generated by the interference signal, the effective amount of sequencing data. Thus, the third blocking process can further improve the accuracy of sequencing data analysis by increasing the effective amount of sequencing data.
  • the sequencing library is a DNA library
  • the library molecules in the DNA library contain multiple single-stranded DNA fragments.
  • the above-mentioned first sequencing method or the third sequencing method further includes:
  • the fourth block is used to block the 3' end of the complementary strand of the template strand, which can effectively prevent the complementary strand from continuing to extend during the sequencing process or the amplification process to generate interference signals.
  • the fourth block can effectively increase the amount of effective sequencing data. Therefore, the accuracy of sequencing data analysis can be further improved by increasing the effective amount of sequencing data through the fourth blocking process.
  • the above-mentioned third blocking treatment and fourth blocking treatment can be carried out by different methods, such as independently removing the 3' terminal hydroxyl group and/or linking the 3' terminal hydroxyl group with an extension reaction blocking agent And proceed.
  • the extension reaction blocking agent is used to block the reaction between the 3' terminal hydroxyl group and the phosphate group, and the extension reaction blocking agent can be an alkyl group, ddNTP or its derivatives, etc.
  • the elongation reaction blockers in the first sequencing method and its examples, the third sequencing method and its examples are ddNTPs or derivatives thereof, respectively.
  • the third blocking treatment and the fourth blocking treatment are independently performed using at least one of DNA polymerase and terminal transferase.
  • DNA polymerase uses the DNA chain as a template to add ddNTP to the 3' end of the nucleic acid chain to be blocked, so as to achieve the effect of blocking the 3' end.
  • Terminal transferase can directly add ddNTP to the 3' end of single-stranded nucleic acid to achieve the effect of 3' end blocking.
  • the fourth blocking treatment is independently linked to ddNTPs or derivatives thereof by polymerase, and the third blocking treatment is linked to ddNTPs or derivatives thereof by terminal transferase.
  • the sequencing method proposed in the present disclosure further includes:
  • the solid phase carrier surface is connected with a nucleic acid complex formed by a nucleic acid template and a first primer, at least a part of the first primer is configured to hybridize with at least a part of the 3' end of the nucleic acid template, and the nucleic acid template is connected On the surface of the solid phase carrier or the first sequencing primer is connected to the surface of the solid phase carrier.
  • step (12) the first primer and the nucleic acid template are complementary to form a nucleic acid complex, and the nucleic acid complex is connected to the surface of the solid-phase carrier, so as to realize the immobilization of the nucleic acid template on the surface of the solid-phase carrier.
  • the nucleic acid template in the nucleic acid complex is attached to the surface of the solid phase carrier.
  • the connection of the nucleic acid template to the surface of the solid-phase carrier does not mean that the nucleic acid template is connected to the surface of the solid-phase carrier through the first primer.
  • the nucleic acid template is covalently linked to molecules/groups on the surface of the solid support, thereby realizing the linking of the nucleic acid template to the surface of the solid support.
  • step (12) can be achieved by the following method: the nucleic acid template is covalently linked to the surface of the solid phase carrier, a first primer is added and the nucleic acid template is hybridized with the first primer, and at least a part of the first primer is mixed with the first primer.
  • the 3' end of the nucleic acid template is complementary.
  • the first primer in the nucleic acid complex is attached to the surface of the solid phase carrier. That is, the first primer is connected to the surface of the solid phase carrier through a covalent bond, and the nucleic acid template is connected to the surface of the solid phase carrier through the first sequencing primer. At this time, the nucleic acid template is not directly connected to the surface of the solid-phase carrier, but indirectly connected to the surface of the solid-phase carrier through complementary connection with the first primer.
  • the first primer is linked to molecules or groups on the surface of the solid support through a covalent bond, so as to realize the connection of the first primer on the surface of the solid support.
  • step (12) can be achieved by the following method: the first primer is covalently linked to the surface of the solid phase carrier, the nucleic acid template is hybridized with the first primer, at least a part of the first primer is mixed with the The 3' end of the nucleic acid template is complementary.
  • the nucleic acid template is less than or equal to 600 bp in length. In one embodiment, the nucleic acid template is greater than or equal to 75 bp and less than or equal to 400 bp. Exemplarily, the nucleic acid template is 75-80bp, 80-90bp, 90-100bp, 100-120bp, 120-150bp, 150-180bp, 180-200bp, 200-220bp, 220-250bp, 250-280bp, 280-300bp , 300 ⁇ 320bp, 320 ⁇ 350bp, 350 ⁇ 380bp, 380 ⁇ 400bp, etc.
  • the third nucleotide is a reversible terminator with a detectable label.
  • the third nucleotide is used as the substrate of the sequencing-by-synthesis reaction, and the third nucleotide is a reversible terminator with a detectable label.
  • the reversible terminator contains a blocking group that can block the reaction at the 3' position of the sugar of the nucleotide, so that each round of elongation reaction that forms the first nascent sequencing strand can only be introduced on the first nascent sequencing strand a third nucleotide.
  • the third nucleotide is detectably labeled.
  • the detectable label is a fluorescent label.
  • each trinucleotide participating in the extension reaction may carry a different fluorescent label, or at least two of the four third nucleotides participating in the extension reaction may carry different fluorescent labels .
  • each of the four third nucleotides carries four different fluorescent labels; the four third nucleotides carry three fluorescent labels, wherein the first and third third nucleotides carry different Fluorophore, the fluorescent group carried by the fourth third nucleotide is the same as the fluorescent group carried by one of the first three third nucleotides, or the fourth third nucleotide does not carry a fluorescent group Group, it should be understood that the type of the fourth third nucleotide is not limited.
  • the four third nucleotides carry two kinds of fluorescent labels, for example, two kinds of third nucleotides carry one kind of the same fluorescent label, and the other two kinds of third nucleotides carry another kind of the same fluorescent label.
  • four third nucleotides are labeled with one fluorescent label.
  • a detectable label need not be a fluorescent label. Any detectable label that allows detection of the type of nucleotide incorporated in the DNA sequence will do.
  • the third nucleotide is a reversible terminator with a detectable label
  • the third nucleotide is incorporated into the 3' end of the complementary strand of the nucleic acid template under the action of the polymerase, and at the same time, due to The reactivity of the 3' hydroxyl of the sugar of the third nucleotide is blocked, and further sequence extension cannot be carried out, so that each round of extension reaction can only introduce a third nucleotide on the complementary strand of the nucleic acid template; by detection
  • the detected label can be used to determine the type of nucleotide incorporated; by removing the blocking group at the 3' end, the nucleotide 3' can generate a free hydroxyl group and restore the reactivity.
  • the conditions suitable for performing the sequencing reaction include DNA polymerase, that is, the sequencing-by-synthesis reaction is performed under the action of the DNA polymerase.
  • the DNA polymerase can be any enzyme that can perform DNA amplification, such as at least one of Taq enzyme, Klenow fragment, Bst, 9°N, Pfu, KOD and Vent.
  • the nucleotide type and sequence of the first newly detected sequence can be read to obtain the sequence information of the first newly detected sequence.
  • a nascent sequencing strand whose sequence is determined is also called a read
  • a first nascent sequencing strand can also be called a first read
  • a second nascent sequencing strand can also be called a second read.
  • the sequence of a part of the nucleic acid template can be determined from the sequence of the first newly detected sequence.
  • the length of the first nascent sequencing strand is less than the length of the nucleic acid template.
  • the fourth nucleotide under conditions suitable for performing a polymerization reaction, use the first nascent sequencing strand as a primer, and use the nucleic acid template as a template to perform the first extension to obtain the first extended fragment, and the fourth nucleotide Nucleotides without a detectable label.
  • the fourth nucleotide is a nucleotide without a detectable label
  • the nucleotide can be selected from natural nucleotides (dATP, dCTP, dGTP, dTTP) or derivatives thereof, or A terminator without a detectable label is selected, for example, the fourth nucleotide is selected from a nucleotide with a 3' end reversibly modified without a detectable label.
  • the fourth nucleotide added in step (32) is a nucleotide with a 3' end reversibly modified without a detectable label.
  • the conditions suitable for carrying out the polymerization reaction include DNA polymerase, that is, the synthetic polymerization reaction is carried out under the action of the DNA polymerase.
  • the DNA polymerase can be any enzyme that can perform DNA amplification, such as at least one of Taq enzyme, Klenow fragment, Bst, 9°N, Pfu, KOD and Vent.
  • the sequencing-by-synthesis reaction of step (22) and the polymerization reaction of step (32) are performed under the action of the same DNA polymerase, wherein the DNA polymerase is a Klenow fragment mutant.
  • the sequencing-by-synthesis reaction of step (22) and the polymerization reaction of step (32) are performed under the action of the same DNA polymerase, wherein the DNA polymerase is a 9°N mutant.
  • the sequencing method proposed in the present disclosure includes a third sequencing method, wherein the third sequencing method is based on the sequencing method proposed in the second aspect of the present disclosure, and further includes: first The sequencing primer is covalently connected to the surface of the solid phase carrier, and the nucleic acid template is connected to the surface of the solid phase carrier through the first sequencing primer.
  • the above-mentioned fourth nucleotide is a natural nucleotide and/or a derivative thereof.
  • the above-mentioned third sequencing method further includes the steps of: (42) removing the nucleic acid template; (52) using the third nucleotide, in a reaction suitable for sequencing while synthesizing or ligation
  • the complementary strand of the nucleic acid template is used as a template
  • the second sequencing primer is used as a primer to perform an extension cycle to perform a second sequencing to form a second nascent sequencing chain and obtain second sequencing data; wherein, the nucleic acid template
  • the complementary strand is formed jointly by the first nascent sequencing strand and the first extended fragment.
  • the above-mentioned third sequencing method further includes: performing a fifth blocking treatment on the 3' end of the nucleic acid chain on the surface of the chip.
  • the fifth block is used to block the nucleic acid chains on the surface of the chip, and the nucleic acid molecules on the surface of the chip include adapters, complementary strands, residual initial templates, and the like.
  • the fifth sealing it can effectively prevent the 3' end of the nucleic acid molecule on the surface of the chip from being connected to the nucleotide containing the detection signal to generate an interference signal during the sequencing process, and by reducing the interference of invalid data pairs generated by the interference signal, the effective amount of sequencing data.
  • the fifth blocking process can further improve the accuracy of sequencing data analysis by increasing the effective amount of sequencing data.
  • the ends of the nucleic acid strands can be blocked in different ways, such as by removing the 3' terminal hydroxyl group and/or by attaching the 3' terminal hydroxyl group to an extension reaction blocking agent.
  • the above-mentioned fifth blocking is performed by linking the 3' terminal hydroxyl group with an extension reaction blocking agent.
  • the extension reaction blocking agent is used to block the reaction between the 3' terminal hydroxyl group and the phosphate group, and the extension reaction blocking agent can be an alkyl group, ddNTP or its derivatives, etc.
  • the above extension reaction blocking agent is ddNTP or its derivatives.
  • the fifth blocking is performed with terminal transferase.
  • Terminal transferase can directly connect ddNTP or its derivatives to the end of the nucleic acid chain to achieve the effect of blocking the 3' end.
  • Removal of nucleic acid templates can be carried out by physical methods or chemical methods (such as using denaturing reagents), physical methods such as high temperature denaturation (such as 80°C-98°C), denaturing reagents such as NaOH, formamide, etc.
  • the above-mentioned removal of nucleic acid Templating is performed by dissociation of the nucleic acid template strand from its complementary strand by the denaturing reagent formamide.
  • Removal of the nucleic acid template can be carried out by physical or chemical methods (such as using denaturing reagents), physical methods such as high temperature denaturation (such as 80°C-98°C), denaturing reagents such as NaOH, formamide, etc.
  • the template nucleic acid strand is removed by dissociation of the template nucleic acid strand from its complementary strand by a denaturing agent such as formamide.
  • the present disclosure proposes that the sequencing method includes a fourth sequencing method, wherein the fourth sequencing method is based on the sequencing method proposed in the second aspect above, and further includes: a fourth nucleoside Acid is a reversible terminator without a detectable label.
  • a fourth nucleoside Acid is a reversible terminator without a detectable label.
  • the above-mentioned fourth sequencing method further includes step (43): using the third nucleotide, under conditions suitable for the sequencing-by-synthesis reaction or the sequencing-by-ligation reaction, using the nucleic acid template as a template to
  • the first extension fragment is a primer that is extended for a cycle to perform a second sequencing to form a second nascent sequencing strand to obtain second sequencing data.
  • the above-mentioned fourth sequencing method further includes step (53): repeating steps (32) and (43) N-1 times to obtain the 1st to (N+1) newborn sequencing strands and the 1st to (N +1) Sequencing data, and the 1st to Nth extended fragments, the 1st to (N+1) nascent sequencing strands and the 1st to Nth extended fragments together form the first nascent strand;
  • the Nth extended fragment is obtained by using the fourth nucleoside acid, under the conditions suitable for the polymerization reaction, the nucleic acid template is used as a template, and the Nth nascent sequencing strand is used as a primer to extend;
  • the N+1th nascent sequencing strand and the N+1th sequencing data are obtained by using the first Nucleotides, under the conditions suitable for the sequencing reaction while synthesizing or the sequencing reaction while ligation, use the nucleic acid template as a template, and use the Nth extension fragment as a primer
  • the maximum value of N is related to the length of the nucleic acid template.
  • the size of N is determined according to the length of the nucleic acid template, the length of the new sequencing strand, and the length of the extension fragment.
  • the maximum value of N is the length of the nucleic acid template/(the length of the new sequencing strand+extension fragment
  • the length of the result) is an integer -1. For example, when the length of the nucleic acid template is 300 bp, the length of the new sequencing strand is 25 bp, and the length of the extended fragment is 15 bp, the maximum value of N is 6.
  • the lengths of the 1st to N extension fragments are respectively 10-20 bp.
  • the results of multiple experiments show that when the length of the extended fragment is 10-20bp, two new sequencing strands can be effectively separated, reducing the impact of the new sequencing strand on the molecular conformation during re-sequencing, thereby ensuring the sequencing length and sequencing efficiency of re-sequencing.
  • the length of the extended fragment is less than 10 bp, the molecular conformation is affected by the previous sequencing chain during re-sequencing, the length of the re-sequencing sequence becomes shorter and the sequencing efficiency decreases.
  • the sequencing cost will be increased.
  • the nucleic acid template can be directly immobilized on the surface of the solid phase carrier through a covalent bond, or can be fixed on the surface of the solid phase carrier by hybridizing with the first sequencing primer, wherein the first sequencing primer passes Covalently bonded to the surface of the solid phase support.
  • the nucleic acid template is directly immobilized on the surface of the solid phase carrier through a covalent bond, and the nucleic acid template is obtained by the following steps:
  • the sixth block is used to block the nucleic acid chains on the chip surface, and the nucleic acid molecules on the chip surface include linkers, nucleic acid templates, residual initial templates, and the like.
  • the sixth sealing can effectively prevent the nucleic acid molecules on the surface of the chip from generating interference signals during sequencing, and can further improve the accuracy of sequencing results.
  • Removal of nucleic acid templates can be carried out by physical methods or chemical methods (such as using denaturing reagents), physical methods such as high temperature denaturation (such as 80°C-98°C), denaturing reagents such as NaOH, formamide, etc.
  • the above-mentioned removal of nucleic acid Templating is performed by dissociation of the nucleic acid template strand from its complementary strand by the denaturing reagent formamide.
  • step (1-b) further comprising: (1-b-1) performing seventh blocking on the 3' end of the complementary strand in step (1-b) deal with.
  • the seventh block is used to block the 3' end of the complementary chain to avoid interference signals generated by the continued extension of the complementary chain during the sequencing process, thereby effectively increasing the amount of effective data and reducing the interference of invalid data on information analysis. Therefore, the accuracy of the sequencing result can be further improved through the seventh blocking treatment.
  • the sixth blocking treatment and the seventh blocking treatment are independently performed by linking the 3' terminal hydroxyl group with an extension reaction blocker.
  • the ends of the nucleic acid strands can be blocked in different ways, such as by removing the 3' terminal hydroxyl group and/or by attaching the 3' terminal hydroxyl group to an extension reaction blocking agent.
  • the above-mentioned fifth blocking is performed by linking the 3' terminal hydroxyl group with an extension reaction blocking agent.
  • the extension reaction blocking agent is used to block the reaction between the 3' terminal hydroxyl group and the phosphate group, and the extension reaction blocking agent can be an alkyl group, ddNTP or its derivatives, etc.
  • the above extension reaction blocking agent is ddNTP or its derivatives.
  • the sixth blocking treatment and the seventh blocking treatment are respectively independently performed using at least one of DNA polymerase and terminal transferase.
  • DNA polymerase uses the DNA strand as a template to add ddNTP to the 3' end of the nucleic acid strand to be blocked, so as to achieve the effect of blocking the 3' end.
  • Terminal transferase can directly add ddNTP to the 3' end of single-stranded nucleic acid to achieve the effect of 3' end blocking.
  • the above-mentioned fourth sequencing method further includes:
  • the nucleic acid template is used as a template, and the N+1 newborn sequencing strand is used as a primer to extend to form a complementary strand of the nucleic acid template, and the fifth core Nucleotides are natural nucleotides and/or derivatives thereof;
  • the complementary strand of the nucleic acid template is used as a template, and the third sequencing primer is used as a primer to perform an extension cycle to perform the second N+2 sequencing, forming the N+2th new sequencing chain, and obtaining the N+2th sequencing data;
  • the first sequencing primer is connected to the surface of the solid phase carrier through a covalent bond
  • the nucleic acid template is connected to the surface of the solid phase carrier through the first sequencing primer
  • the above-mentioned fourth sequencing method further comprises step (7-a): performing an eighth blocking treatment on the 3' end of the nucleic acid molecule on the chip surface.
  • the eighth block is used to block nucleic acid molecules on the surface of the chip.
  • Nucleic acid molecules on the surface of the chip include complementary strands of nucleic acid templates, first sequencing primers, residual templates, and the like.
  • interference signals generated by the complementary strand and the extension of the first sequencing primer can be avoided during the sequencing process, thereby effectively increasing the amount of effective data and reducing the interference of invalid data on information analysis.
  • the eighth blocking process can further improve the accuracy of the sequencing results.
  • the above-mentioned fourth sequencing method further comprises step (10): (10) using the third nucleotide, under the conditions suitable for the sequencing-by-synthesis reaction or the sequencing-by-ligation reaction, with the nucleic acid template
  • the complementary strand is used as a template
  • the N+2 extended fragment is used as a primer to carry out an extension cycle to perform N+3 sequencing, forming an N+3 nascent sequencing strand, and obtaining N+3 sequencing data.
  • the above-mentioned fourth sequencing method further comprises step (11): (11) repeating steps (9) and (10) N-1 times to obtain (N+2)-(2N+2) newborn sequencing Strand and (N+2) ⁇ (2N+2) sequencing data, and (N+2) ⁇ 2N+1 extension fragment; the 2N+1 extension fragment is obtained by using the fourth nucleotide, in the appropriate Under the conditions of the polymerization reaction, the complementary strand of the nucleic acid template is used as a template, and the 2N+1 nascent sequencing strand is used as a primer to perform extension; the 2N+2 nascent sequencing strand and the 2N+2 sequencing data are obtained by using the The trinucleotide is obtained by carrying out extension cycles using the complementary strand of the nucleic acid template as a template and the 2N+1 extension fragment as a primer under conditions suitable for a sequencing-by-synthesis reaction or a sequencing-by-ligation reaction.
  • the ends of the nucleic acid strands can be blocked in different ways, such as by removing the 3' terminal hydroxyl group and/or by attaching the 3' terminal hydroxyl group to an extension reaction blocking agent.
  • the eighth blocking in the fourth sequencing method described above is performed by linking the 3' terminal hydroxyl to an extension reaction blocker.
  • the extension reaction blocking agent is used to block the reaction between the 3' terminal hydroxyl group and the phosphate group, and the extension reaction blocking agent can be an alkyl group, ddNTP or its derivatives, etc.
  • the above extension reaction blocking agent is ddNTP or its derivatives.
  • the eighth blocking treatment is performed using terminal transferase.
  • Terminal transferase can directly add ddNTP to the 3' end of single-stranded nucleic acid to achieve the effect of 3' end blocking.
  • the sequencing data of different positions of the same template and/or its complementary chain are obtained through two or more times of sequencing.
  • Using this sequencing method can increase the amount of sequencing data on the one hand, and on the other hand, can use the same template Sequencing data at different positions of the complementary chain, especially using the sequencing data with overlapping data to assemble or proofread the template sequence, can improve the efficiency and accuracy of sequencing data assembly.
  • the sequencing method provided in one embodiment by blocking the ends of the complementary strands, and/or blocking the primers on the surface of the chip, and/or blocking the residual nascent sequencing strands, etc., it is possible to avoid the complementary strands, chips, etc.
  • the length of the extended fragment is controlled by using an unlabeled terminator, on the one hand to reduce the impact of the sequencing chain on the molecular conformation of the re-sequencing event, and on the other hand to control the cost of sequencing.
  • the length of the extended fragment When the length of the extended fragment is controlled at 10-20bp, it can effectively space two new sequencing strands, reducing the impact of the new sequencing strands on the molecular conformation during re-sequencing, thereby ensuring the sequencing length and sequencing efficiency of re-sequencing.
  • the length of the extended fragment is less than 10 bp, the molecular conformation is affected by the previous sequencing chain during re-sequencing, the length of the re-sequencing sequence becomes shorter and the sequencing efficiency decreases.
  • the sequencing cost will be increased.
  • the read length of single-molecule sequencing equipment such as HeliScope is relatively short.
  • the base side chain will leave residues (Scar) after the fluorescent dye is excised.
  • Scar residues
  • the accumulation of these Scars will affect the subsequent Therefore, the current status is that it is difficult to achieve long-read sequencing by using single-molecule sequencing equipment such as HeliScope, and the average read length is usually about 40bp.
  • the inventors proposed a scheme to perform multiple rounds of sequencing on the same insert at different positions, if necessary, by using a reversible terminator without a detectable label for an extension reaction, without a detectable label.
  • the labeled reversible terminator can synthesize a nucleic acid sequence as a spacer, which can weaken the interference of Scar accumulation on the fluorescent signal in the subsequent extension reaction. In this way, the actual sequencing efficiency for the same insert can be extended, achieving the effect of extending the read length.
  • the current read segment analysis strategy does not fully satisfy this new type of sequencing technology. Therefore, after proposing this type of sequencing technology, the inventors further researched and improved the corresponding read segment analysis strategy, thus completing the In the present disclosure, a novel sequencing data analysis method is proposed.
  • the present disclosure proposes a sequencing data processing method.
  • the sequencing data is generated by performing multiple rounds of sequencing on the same insert fragment respectively. Therefore, the The obtained sequencing data includes multiple read segments, and each read segment group corresponds to an insert fragment.
  • Each read segment group includes multiple read segments. For multiple read segments in the same read segment group, it is Obtained by multiple rounds of sequencing on the same insert, so each read actually corresponds to a round of sequencing, for example, for paired-end sequencing, each read group includes two reads, Read1 and Read2 respectively Corresponds to the sequencing results from each end.
  • those skilled in the art can group the reading segments in the sequencing data through conventional means, such as the site corresponding to each reading segment, so as to obtain multiple read segments groups, and each read group corresponds to the same insert. Further, read segments in each read segment group are analyzed and processed separately, and read segments that can be used for subsequent assembly are selected from a large number of read segments.
  • each read group corresponds to an insert, which should be understood in a broad sense, and can be obtained based on extension reactions at different positions of the nucleic acid template strand of the same insert. It can also be obtained based on the sequencing reaction of other nucleic acid strands associated with the insert. Examples of such other nucleic acid strands include but are not limited to complementary strands or multiple identical copies (such as multiple copies obtained by rolling circle replication) .
  • each insert corresponds to a specific position on the sequencing reaction chip
  • the grouping of reads can be achieved by distinguishing the chip positions corresponding to each read.
  • the reads in each read group are analyzed to obtain reads that can be assembled.
  • the following describes in detail the processing of multiple reads in each read group with reference to FIGS. 1-3 .
  • S110 Globally align the multiple reads with the reference genome, so as to determine multiple matching regions corresponding to the multiple reads on the reference genome.
  • each read segment is compared with the reference genome by using global alignment, and the matching position of each read segment on the reference genome sequence can be determined.
  • global alignment refers to the alignment of all characters in the two sequences participating in the alignment. In this context, of course, it refers to aligning reads to a reference genome or a portion thereof, and global alignment scores two sequences on a global scale to find the best alignment and is usually used primarily to find relationships close sequence.
  • a representative algorithm for global alignment is the Needleman-Wunsch algorithm.
  • the algorithm provided by the sequencing platform can also be used to perform global comparison, for example, referring to the content recorded in CN107403075A, the above-mentioned global comparison operation can be realized.
  • the matching (mapping) region of the reads on the reference genome sequence can be determined.
  • the read segment can only be aligned with one region of the reference genome sequence, that is, there is only one matching region, the read segment is called a uniquely aligned sequence (uniquely aligned read).
  • the preset position requirement is determined by the rules of multiple rounds of sequencing, and the actual relative position meeting the preset position requirement is an indication that the read is a splicable read; the actual relative position does not meet the preset A position requirement is an indication of a read as a filtered read.
  • reads from multiple rounds of sequencing of the same insert can be effectively screened to obtain reads that can be spliced, thereby effectively improving the efficiency of subsequent processing of sequencing data, Adverse effects caused by too short reads are avoided.
  • the filtered reads can be further screened for a second time.
  • the filtered reads that were filtered out in the first screening still contain useful reads, and thus can be picked up by performing a second screening.
  • the secondary screening process includes:
  • S210 Use at least one of the read segment group as a preliminary read segment, and determine a secondary alignment region on the reference genome based on the matching region corresponding to the preliminary read segment and a preset position requirement.
  • a read is used as a preliminary read, and this preliminary read is not limited to be a filtered read, and it can also be a read that has been selected as a spliceable read in a screening.
  • a secondary alignment area within a certain range around the initial read segment, for example, extend a certain length outward at both ends of the initial read segment, such as 100bp, 200bp, 300bp, 500bp, 1000bp Even 2000bp.
  • this secondary alignment region look for filtered reads that can be aligned. In this way, the accuracy of the sequencing results can be further improved, and in addition, the read segment information generated by the nucleic acid mutation of the sample can also be avoided.
  • the comparison results of the reads corresponding to these mutations and the reference genome usually cannot meet the previous preset position requirements.
  • S220 Locally align each read segment of the filtered read segment with the secondary alignment region, and classify the read segment meeting a predetermined threshold and the preliminary read segment as a read segment that can be spliced.
  • the predetermined thresholds mentioned here and the thresholds mentioned elsewhere in this paper can be obtained by statistical analysis of samples with known properties.
  • reads that can be used for splicing can be obtained from reads that do not meet the conditions after one alignment and need to be removed, thereby saving sequencing resources and improving sequencing efficiency. accuracy.
  • each read of the read set is used as a primary read for secondary screening.
  • screening of all reads can be done as far as possible.
  • S140 Assemble the splicable reads according to the rules of multiple rounds of sequencing.
  • the splicing here can follow the rules of multiple rounds of sequencing, and the reads that can be spliced can be spliced by adding N at unknown positions or merging overlapping regions. No longer.
  • the rules of multiple rounds of sequencing include at least one selected from the following: paired-end sequencing, Jumping sequencing, Overlap sequencing, paired-end Jumping sequencing, and combinations of these sequencing rules.
  • the rule of multiple rounds of sequencing is paired-end sequencing
  • the read segment group includes two read segments
  • the preset position requirements include: the matching regions of the two read segments are respectively located on the positive strand of the reference genome and on the antistrand; and the distance between the matched regions of the two reads on the reference genome is no more than a predetermined threshold, wherein the predetermined threshold is determined based on the length of the insert.
  • the method for analyzing the sequencing data of the paired-end sequencing specifically includes:
  • the paired-end sequence files Fa1 and Fa2 can be obtained respectively through the comparison algorithm, and the sequences in the two files are corresponding in position.
  • the so-called correspondence in position means that the read segments with the same sequence number in the file come from the same physical position on the sequencing reaction chip. Therefore, the read segments with the same sequence number in Fa1 and Fa2 correspond to read segment 1 and read segment 2 respectively, and correspond to the read segments sequenced twice in the paired-end sequencing schematic diagram.
  • the global alignment algorithm uses third-party mapping software or use the DirectAlignment algorithm software supporting GenoCare.
  • the sequences in Sam1 and Sam2 can be divided into three categories according to the alignment results of the paired-end sequences corresponding to each position. They are: 1. Both paired-end sequences are uniquely aligned to the genome; 2. There is only one paired-end sequence uniquely aligned to the genome; 3. No paired-end sequences are uniquely aligned to the genome.
  • the reads at the other end are locally aligned within 300 bp before and after the unique alignment position of the paired-end sequence (local alignment is also referred to as "fine alignment" in this paper), and if the corresponding position can be found for the reads at the other end , the position is considered to be an accurate paired-end sequencing position. If the paired-end sequence cannot find a matching position in the unique alignment position of the other end, the paired-end sequence is discarded.
  • the paired-end sequence For category 3, if the paired-end sequence can be compared to the genome but not uniquely compared to the genome, it will be treated as category 1; if the paired-end sequence has and only one end is compared to the genome but not uniquely compared to the genome , it will be processed according to category 2; if the paired-end sequence cannot be aligned to the genome, the paired-end sequence will be discarded.
  • the local alignment algorithms used in this paper include, but are not limited to, the Smith-Waterman algorithm.
  • “another read can find the corresponding position” means that the local optimal sequence length in the Smith-Waterman alignment result is greater than the preset threshold and the error rate is lower than the preset threshold, and the corresponding position is considered to be found.
  • the way of merging is: if read 1 and read 2 have overlapping regions, merge the overlapping regions and splice them into a longer sequence.
  • the splicing strategy may adopt a consistent base judgment strategy. If there is no overlapping region between read 1 and read 2, use N to mark the length of the middle deletion, and the length of N is the number of Bases between the reads at both ends. If the reads in Sam1 and Sam2 do not find the correct paired-end sequencing position, then output the reads in Sam1 or Sam2 that can be aligned (including uniquely aligned) to the genome.
  • the rule of multiple rounds of sequencing is Jumping sequencing
  • the preset position requirements include: the matching regions of multiple reads are located on the same strand of the reference genome; The distance of the reads on the reference genome does not exceed a predetermined distance threshold, wherein the predetermined threshold is determined based on the length of the partial extension step, for example, the predetermined distance threshold does not exceed 50 bp, such as not exceeding 20 bp, such as between 5 and 20 bp.
  • Jumping sequencing includes: providing a nucleic acid template, the nucleic acid template is directly or indirectly connected to the surface of the solid phase carrier; using the first nucleotide and the second nucleotide to generate multiple nucleotides with the nucleic acid template A round of extension reactions wherein the first nucleotide is a detectably labeled reversible terminator and is used to obtain multiple reads through the extension reaction; the second nucleotide is a non-detectably labeled reversible terminator , and used to obtain at least one synthetic fragment of a preset length through an extension reaction.
  • the rule of multiple rounds of sequencing is Overlap sequencing
  • the preset position requirements include: the matching regions of multiple reads are located on the same strand of the reference genome; The length of the overlapping region of the reads on the reference genome is within a predetermined distance range, wherein the predetermined distance range is determined based on the length of the overlapping region during the sequencing process, for example, the predetermined distance range is between 5 and 10 bp.
  • Overlap sequencing includes: the nucleic acid template is directly or indirectly linked to the surface of a solid phase carrier; multiple rounds of extension reactions occur with the nucleic acid template using the first sequencing adapter and the second sequencing adapter, so as to obtain multiple reads, wherein the first read generated by the first sequencing adapter and the second read generated by the second sequencing adapter have an overlapping region of at least one base, and optionally, the first sequencing adapter uses the first nucleotide An extension reaction is performed to obtain the first reads; second sequencing adapter generation is first performed with the second nucleotides, followed by multiple extension reactions with the first nucleotides to obtain the second reads.
  • the corresponding sequencing sequence file Fa can be obtained through the BaseCalling algorithm provided by GenoCare as before.
  • splicing of N Overlap sequencing sequences can be realized.
  • the results of the two sequencing are processed, so the sequence files Fa1 and Fa2 of the two sequencing can be obtained.
  • the average length of the overlap can be controlled at 5-10 bp through the parameter setting during the experiment, sometimes there will be no overlap.
  • the most locally similar region in the two sequences can be found using a local alignment algorithm (such as Smith-Waterman).
  • a preset threshold such as 5bp
  • the error rate of the similar region is greater than the preset threshold
  • step 1 if there are multiple Overlap sequencing, set the read segment obtained by pairwise splicing as read segment 1, and then repeat the operation in the previous step. Through iteration, longer read segments can be obtained and output to In the final Fa file.
  • the rule of multiple rounds of sequencing is paired-end Jumping sequencing
  • the preset position requirements include: a part of the matching regions of multiple reads is located on the forward strand of the reference genome, and the other part is located on the reverse strand of the reference genome;
  • the length of the overlapping region of two adjacent reads on the reference genome in the matching region of the plurality of reads is within a predetermined distance range, wherein the predetermined distance range is determined based on the length of the partial extension step in the sequencing process, for example, the predetermined distance
  • the threshold is not more than 50bp, for example not more than 20bp, for example between 5-20bp. Referring to FIG.
  • paired-end Jumping sequencing includes: hybridizing the nucleic acid template with a first primer, at least a part of the first primer is complementary to the 3' end of the nucleic acid template, and the first primer is covalently attached to a solid phase On the surface of the carrier; using the first nucleotide and the second nucleotide, based on the first primer and the nucleic acid template for multiple rounds of extension reactions, and obtaining the first primer extended chain; removing the nucleic acid template, and making the second primer and the second primer A primer extension chain hybridization; using the first nucleotide and the second nucleotide, multiple rounds of extension reactions occur based on the second primer and the first primer extension chain; wherein, the first nucleotide is reversible with a detectable label a terminator, and is used to obtain multiple reads through an extension reaction; the second nucleotide is a reversible terminator without a detectable label, and is used to obtain at
  • paired-end jumping sequencing can be performed by combining the rules of paired-end sequencing and jumping sequencing, and the analysis of the paired-end jumping sequencing results can be completed by referring to the analysis process described above.
  • N sequencing fragments are obtained through paired-end Jumping sequencing.
  • Different sequencing fragments for paired-end sequencing at the same position are represented as Reads1,1, Reads1,2, ..., Reads1,N, Reads2,1, Reads2,2, ..., Reads2,N, respectively.
  • the present disclosure proposes a sequencing data processing device, the sequencing data includes multiple read segments, the read segment groups include multiple read segments, and the multiple read segments are processed by the same insert fragment Acquired through multiple rounds of sequencing, the device includes multiple modules that perform the following processes on multiple reads per read group:
  • a global alignment module 110 for globally aligning multiple reads with the reference genome, so as to determine multiple matching regions corresponding to the multiple reads on the reference genome; and a screening module 120, for based on multiple A comparison of the actual relative positions between matched regions with preset position requirements, where multiple reads are screened once for assembly-able reads and filtered reads, where the preset position requirements are determined by rules for multiple rounds of sequencing Yes, the actual relative position meeting the preset position requirement is an indication that the read is a splicable read; and the actual relative position not meeting the preset position requirement is an indication that the read is a filtered read.
  • the sequencing data processing method described in the aforementioned first aspect can be effectively implemented.
  • the sequencing data processing method according to the embodiment of the present disclosure reads from multiple rounds of sequencing of the same insert can be effectively screened to obtain reads that can be spliced, thereby effectively improving the efficiency of subsequent processing of sequencing data, Adverse effects caused by too short reads are avoided.
  • the secondary screening module 130 is configured to perform secondary screening on the filtered reads.
  • the secondary screening includes: taking at least one of the read segment groups as a preliminary read segment, and determining a reference based on the matching region and preset position requirements corresponding to the preliminary read segment a secondary alignment region on the genome; and locally aligning each of the filtered reads to the secondary alignment region individually and classifying reads and preliminary reads that meet a predetermined threshold as splicable reads .
  • the splicing module 140 is configured to splice the reads that can be spliced according to the rules of multiple rounds of sequencing.
  • the rules of multiple rounds of sequencing include at least one selected from the following: paired-end sequencing, Jumping sequencing, Overlap sequencing, paired-end Jumping sequencing, and combinations of these sequencing rules.
  • the present disclosure proposes a computing device, according to an embodiment of the present disclosure, which includes: a processor and a memory; the memory is used to store a computer program; the processor is used to execute the computer program In order to realize the aforementioned sequencing data processing method.
  • the present disclosure proposes a computer-readable storage medium.
  • the storage medium includes computer instructions. When the instructions are executed by the computer, the computer can realize the aforementioned sequencing data processing method.
  • the Genocare single-molecule sequencing platform used in the examples is a platform for detecting incorporated nucleotide species using a TIRF imaging system.
  • the cleaning solution 1 component includes: 150mmol/L sodium chloride, 15mmol/L sodium citrate, 150mmol/L 4-hydroxyethylpiperazineethanesulfonic acid, and 0.1% sodium lauryl sulfate.
  • the components of cleaning solution 2 include: 150mmol/L sodium chloride, 150mmol/L 4-hydroxyethylpiperazineethanesulfonic acid.
  • Hybridization solution 3 ⁇ SSC buffer, prepared by diluting 20 ⁇ SSC buffer (Sigma, #S6639-1L) with nuclease-free water (Rnase-free water).
  • Cold-dNTP End-blocked nucleotides, including end-blocked adenine nucleotides (Cold-dATP), end-blocked thymine nucleotides (Cold-dTTP), end-blocked cytosine nucleotides (Cold-dATP) -dCTP), end-blocked guanine nucleotide Cold-dGTP.
  • the end-blocked nucleotides were purchased from MyChem, which were natural dATP, dTTP, dCTP, and dGTP whose 3'OH was blocked by a reversible blocking group.
  • the DNA library preparation kit (No. ND606-01) of Novizyme was used ( Universal DNA Library Prep Kit for Illumina V2) to connect the D7-S1-T/D9-S2 adapter with the DNA fragment (100-300bp), no need for PCR amplification after connection, directly use Novozyme N411-01 DNA purification magnetic beads (VAHTS DNA Clean Beads) were used for purification to obtain the target library.
  • the steps of library construction in this embodiment include:
  • the reaction conditions are: react at 20° C. for 15 minutes, and then react at 65° C. for 10 minutes.
  • the reaction conditions are as follows: after mixing, place at room temperature for 15 minutes.
  • the VAHTS DNA Clean Beads (N411-01) kit was used for purification and the purification was carried out according to the steps indicated in the kit manual, and 10 ⁇ L of the product was recovered to complete the construction of the sequencing library. Specific steps are as follows:
  • the chip used is an epoxy-modified chip, and the method for reacting the amino group on the probe with the epoxy group on the surface of the chip, for example, referring to the disclosure of the publication number CN109610006A, fixes the probe (sequence: TTTTTTTTTTTTCCTGATACCTGCGACCATCCAGTTCCACTCAGATGTGTATAAGAGACAG) (SEQ ID NO : 4).
  • the hybridization process between the library and the probe on the chip is as follows:
  • step 1 1) Take 3 ⁇ L of the sequencing library constructed in step 1 with a volume of 20 nmol/L, add 3 ⁇ L of deionized water, mix well, and heat denature at 95°C for 5 minutes;
  • step 4) Pass 30 ⁇ L of the volume-diluted hybridization library obtained from step 3) into one channel of the secondary chip, perform a hybridization reaction at 42° C. for 30 minutes, and then cool to room temperature;
  • the chip of the hybridizable library in Example 1 was placed in a Genocare single-molecule sequencer for sequencing.
  • the sequencing steps are as follows, and the schematic diagram of the sequencing process is shown in FIG. 8 .
  • the Genocare single-molecule sequencing platform is used for 80 cycles of sequencing. During the sequencing process, four nucleotides with two different fluorescent signals are used, and two nucleotides labeled with different fluorescent signals are added in each round of reaction for signal detection. Perform sequencing.
  • extension reagent components are: 120U/ml Bst DNA polymerase (NEB, #M0275M), 0.2mmol/L dNTP (dATP, dTTP, dCTP, dGTP each 0.2 ⁇ mol/L mixture), 1M betaine, 20mmol/L tris, 10mmol/L sodium chloride, 10mmol/L potassium chloride, 10mmol/L ammonium sulfate, 3mmol/L magnesium chloride , 0.1% Triton X-100, pH 8.3;
  • the extension reagent components are: 120U/ml Bst DNA polymerase (NEB, #M0275M), 0.2mmol/L dNTP (dATP, dTTP, dCTP, dGTP each 0.2 ⁇ mol/L mixture), 1M betaine, 20mmol/L tris, 10mmol/L sodium chloride, 10mmol/L potassium chloride, 10mmol/L ammonium sulfate, 3mmol
  • step 2) Repeat step 2) and step 3) once to complete the removal of the initial template.
  • blocking reagent 2 Pass through the blocking reagent 2 with a volume of 750 ⁇ L, and react for 15 minutes.
  • the components of blocking reagent 2 are: 100U/ml Terminal Transferase (NEB, M0315L), 1 ⁇ Terminal Transferase Buffer, 0.25mmol/L cobalt chloride, 100 ⁇ mol/L ddNTP mix (ddATP, ddTTP, ddCTP, ddGTP each 100 ⁇ mol /L mixture);
  • the diluted sequencing primer hybridization solution is cleaning solution 3 containing 0.1 ⁇ mol/L primer D7S1T-R2P;
  • sequence files Fa1 and Fa2 of the paired-end sequencing can be respectively obtained through the comparison algorithm, and the sequences in the two files are corresponding in position.
  • positional correspondence refers to the Reads with the same sequence number in the file, which comes from the same physical position in the sequencing.
  • mapping algorithm uses the mapping algorithm to compare Fa1 and Fa2 to the corresponding genomes, and obtain the compared result files Sam1 and Sam2 respectively.
  • the Mapping algorithm can choose a published method.
  • the sequences in Sam1 and Sam2 can be divided into three categories according to the alignment results of the paired-end sequences corresponding to each position. They are: 1.
  • the paired-end sequences are all Unique Mapped to the genome; 2.
  • the paired-end sequences have and only one-end sequence Unique Mapping to the genome; 3.
  • the paired-end sequences have no Unique Mapping to the genome.
  • the corresponding position can be found in the Reads at the other end, it is considered that the position is an accurate paired-end sequencing position. If there is no matching position for Reads at the other end at the Unique position of the paired-end sequence, the paired-end sequence is discarded.
  • the paired-end sequence is mapped but not Unique to the genome, it will be treated as class 1; if the paired-end sequence has and only one end is mapped but not Unique to the genome, it will be treated as class 2; If it is not mapped to the genome, the paired-end sequence is discarded.
  • the "meticulous alignment” mentioned above refers to the use of a finer local alignment algorithm, such as the Smith-Waterman algorithm. "Another Reads can find the corresponding position” means that the local optimal sequence length in the Smith-Waterman alignment result is greater than the preset threshold and the error rate is lower than the preset threshold, and the corresponding position is considered to be found.
  • the merging method is: if Reads1 and Reads2 have overlapping areas, then merge the overlapping areas and splice them into a longer sequence.
  • the splicing strategy is as follows. If there is no overlapping area between Reads1 and Reads2, use NS to mark the missing length in the middle, and N is the number of bases of the distance between Reads at both ends. If the Reads in Sam1 and Sam2 do not find the correct paired-end sequencing position, then output the Reads results that can be mapped (including Unique Mapping) to the genome in Sam1 or Sam2.
  • Splicing strategy align two corresponding Reads with each other to obtain a common consensus sequence.
  • the two sequences are registered using the Smith-Waterman algorithm, and the consistent sequence refers to the local best matching sequence obtained by adding, deleting or modifying part of the Base in the sequence after registration.
  • the inconsistent Base positions in the consensus sequence are judged one by one according to the constructed correction model. Calculate the probability of deletion or insertion at this position according to the base types before and after the Base position. If the probability of Deletion is greater than 50%, it is considered that the measured Base at this position should not appear, so the Base at this position is deleted. Otherwise, keep the Base at that position.
  • step 4 Make statistics on the deletion and insertion in step 4), and at the same time make statistics on the types of Base before and after the inconsistency. Therefore, the probability of causing Insertion or Deletion before or after different Base types is obtained.
  • Naive Bayesian model used in this example is as follows:
  • XY) represents the probability of Deletion when a certain base is preceded by X and Y bases, X, Y ⁇ [A, C, G, T].
  • P(D) represents the probability of deletion for a certain base;
  • P(I) represents the probability of Insertion for a certain base.
  • I) can be obtained by counting the occurrence frequency of bases before and after deletion or insertion under different bases, so that P(D
  • the chip with the hybrid library obtained in Example 1 was placed in a sequencer for sequencing.
  • the sequencing steps are as follows, and the schematic diagram of the sequencing process is shown in Figure 9:
  • the sequencing platform was used for 80 cycles of sequencing. During the sequencing process, four nucleotides with two different fluorescent signals were used, and two nucleotides labeled with different fluorescent signals were added to each round of reaction for signal detection.
  • the steps for partial extension of the complementary strand of the initial template include:
  • extension reagent 2 At a speed of 1250 ⁇ L/min, 440 ⁇ L of extension reagent 2 is passed into the read1 sequenced channel, and reacted for 2 minutes.
  • the components of extension reagent 2 are: tris of 50mmol/L, sodium chloride of 50mmol/L, ethylenediaminetetraacetic acid of 1mmol/L, magnesium sulfate of 3mmol/L, ammonium sulfate of 60mmol/L , 0.05% Tween 20, 5% dimethyl sulfoxide, 0.02mg/ml 9°N DNA polymerase (NEB company, product number M0260), 5 ⁇ mol/L of Cold-dNTPs (end-blocking nucleotides) ( Cold-dATP, Cold-dTTP, Cold-dCTP, Cold-dGTP each 5 ⁇ mol/L mixture), pH value 9.0.
  • excision reagent 1 Pump 400 ⁇ L of excision reagent 1 into the sequencing channel.
  • step 1) Repeat step 1) to step 7) for 10 to 20 cycles to complete partial extension of the complementary strand of the initial template.
  • the chip with the hybridization library obtained in Example 1 was placed in a Genocare single-molecule sequencer for sequencing.
  • the sequencing steps are as follows, and the schematic diagram of the sequencing process is shown in FIG. 10 .
  • extension reagent components are: 120U/ml Bst DNA polymerase (NEB, #M0275M), 0.2mmol/L dNTP (dATP, dTTP, dCTP, dGTP each 0.2 ⁇ mol/L mixture), 1M betaine, 20mmol/L tris, 10mmol/L sodium chloride, 10mmol/L potassium chloride, 10mmol/L ammonium sulfate, 3mmol/L Magnesium Chloride, 0.1% Triton X-100, pH 8.3;
  • step 2) Repeat step 2) and step 3) once to complete the removal of the initial template.
  • blocking reagent 2 Pass through the blocking reagent 2 with a volume of 750 ⁇ L, and react for 15 minutes.
  • the components of blocking reagent 2 are: 100U/ml Terminal Transferase (NEB, M0315L), 1 ⁇ Terminal Transferase Buffer, 0.25mmol/L cobalt chloride, 100 ⁇ mol/L ddNTP mix (ddATP, ddTTP, ddCTP, ddGTP each 100 ⁇ mol /L mixture);
  • the diluted sequencing primer hybridization solution is cleaning solution 3 containing 0.1 ⁇ mol/L primer D7S1T-R2P.
  • the components of cleaning solution 3 include: 450 mmol/L sodium chloride and 45 mmol/L sodium citrate;
  • the Genocare single-molecule sequencing platform is used for 80 cycles of sequencing. During the sequencing process, four nucleotides with two different fluorescent signals are used, and two nucleotides labeled with different fluorescent signals are added in each round of reaction for signal detection. Perform sequencing.
  • step 2) Repeat step 2) and step 3) once to complete the removal of the initial template.
  • blocking reagent 1 Pump 750 ⁇ L of blocking reagent 1 into the sequencing channel and react for 10 minutes.
  • the components of blocking reagent 1 are: 100U/ml Klenow DNA polymerase large fragment (3′ ⁇ 5′exo-, NEB, #M0212M) 12.5 ⁇ mol/L ddNTP mix (ddATP, ddTTP, ddCTP, ddGTP each 12.5 ⁇ mol/L mixture), 5mmol/L manganese chloride, 20mmol/L tris, 10mmol/L sodium chloride, 10mmol/L potassium chloride, 10mmol/L ammonium sulfate, 3mmol/L Magnesium Chloride, 0.1% Triton X-100, pH 8.3;
  • the hybridization process of the sequencing primers is the same as step 4 of this embodiment.
  • Some extended steps include:
  • extension reagent 2 At a speed of 1250 ⁇ L/min, 440 ⁇ L of extension reagent 2 is passed into the read1 sequenced channel, and reacted for 2 minutes.
  • the components of extension reagent 2 are: tris of 50mmol/L, sodium chloride of 50mmol/L, ethylenediaminetetraacetic acid of 1mmol/L, magnesium sulfate of 3mmol/L, ammonium sulfate of 60mmol/L , 0.05% Tween 20, 5% dimethyl sulfoxide, 0.02mg/ml 9°N DNA polymerase (NEB company, product number M0260), 5 ⁇ mol/L Cold-dNTPs (Cold-dATP, Cold-dTTP , Cold-dCTP, Cold-dGTP each 5 ⁇ mol/L mixture), pH value 9.0.
  • excision reagent 1 Pump 400 ⁇ L of excision reagent 1 into the sequencing channel.
  • step 1) Repeat step 1) to step 7) for 10 to 20 cycles to complete partial extension of the complementary strand of the initial template.
  • the corresponding sequencing sequence file Fa can be obtained through the BaseCalling algorithm supported by GenoCare.
  • splicing of N overlapping sequencing sequences can be realized.
  • the results of the two sequencing are processed, so the sequence files Fa1 and Fa2 of the two sequencing can be obtained.
  • the average length of the overlap can be controlled at 5-10bp through the parameter setting during the experiment, but there is no guarantee that there will be an overlap.
  • the most locally similar region in the two sequences can be found using a local alignment algorithm (such as Smith-Waterman). In the comparison result, if the length of the similar region is less than a preset threshold (such as 5bp) or the error rate of the similar region is greater than the preset threshold, the splicing result is considered untrustworthy.
  • two sequences can be spliced through similar regions.
  • the specific operation for the selection of inconsistent Bases in similar regions is as follows: two corresponding Reads are registered with each other to obtain a common consistent sequence part.
  • the two sequences are registered using the Smith-Waterman algorithm, and the consistent sequence refers to the local best matching sequence obtained by adding, deleting or modifying part of the Base in the sequence after registration.
  • the consensus sequence is obtained, according to the constructed correction model (see the correction model in 2.2.4 for details)
  • the inconsistent Base positions in the consensus sequence are judged one by one. Calculate the probability of deletion or insertion at this position according to the base types before and after the Base position. If the probability of Deletion is greater than 50%, it is considered that the measured Base at this position should not appear, so the Base at this position is deleted. Otherwise, keep the Base at that position.
  • the splicing results obtained in step 2 are integrated and output into the same Fa file.
  • the longer Reads among Reads1 and Reads2 are output to the final Fa file.
  • step 1 if there are multiple overlapping sequences, set the Reads obtained by pairwise splicing as Reads1, and then repeat the operation of step 2 and the next sequence splicing. By iteration, longer read length Reads can be obtained and output to In the final Fa file.
  • the chip with the hybrid library in Example 1 was placed in a Genocare single-molecule sequencer for sequencing.
  • the sequencing steps are as follows, and the schematic diagram of the sequencing process is shown in FIG. 11 .
  • a two-color single-molecule sequencing platform is used for 80 cycles of sequencing. During the sequencing process, four nucleotides with two different fluorescent signals are used, and two nucleotides labeled with different fluorescent signals are added in each round of reaction for signal detection. Perform sequencing.
  • Some extended steps include:
  • extension reagent 2 is passed into the read1 sequenced channel, and reacted for 2 minutes.
  • the components of extension reagent 2 are: Tris at 50mmol/Lmmol/Lol/L, sodium chloride at 50mmol/Lmmol/Lol/L, EDTA at 1mmol/Lmmol/Lol/L, 3mmol
  • excision reagent 1 Pump 400 ⁇ L of excision reagent 1 into the sequencing channel.
  • step 1) Repeat step 1) to step 7) for 10 to 20 cycles to complete partial extension of the complementary strand of the initial template.
  • step 2) Repeat step 2) and step 3) once to complete the removal of the initial template.
  • blocking reagent 2 Pass through the blocking reagent 2 with a volume of 750 ⁇ L, and react for 15 minutes.
  • the components of blocking reagent 2 are: 100U/ml Terminal Transferase (NEB, M0315L), 1 ⁇ Terminal Transferase Buffer, 0.25mmol/Lmmol/Lol/L cobalt chloride, 100 ⁇ mol/L ddNTP mix (ddATP, ddTTP, ddCTP , ddGTP each 100 ⁇ mol/L mixture);
  • the diluted sequencing primer hybridization solution is cleaning solution 3 containing 0.1 ⁇ mol/L primer D7S1T-R2P, and the components of cleaning solution 3 include: sodium chloride at 450mmol/Lmmol/Lol/L, citric acid at 45mmol/Lmmol/Lol/L sodium;
  • sequencing steps are the same as steps 1-3 of this embodiment.
  • step 4.2.1 get N sequencing fragments for paired-end sequencing.
  • Different sequencing fragments for paired-end sequencing at the same position are represented as Reads1,1, Reads1,2, ..., Reads1,N, Reads2,1, Reads2,2, ..., Reads2,N, respectively.
  • step 4.2.3 output the sequence spliced in step 5.2.2 to the final Fa file.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • Microbiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

涉及测序数据处理方法、设备、计算设备和计算机可读介质。测序数据包括多个读段组,读段组包括多个读段,多个读段是通过对同一插入片段进行多轮测序而获得的,测序数据处理方法包括将多个读段与参考基因组进行全局比对,以便在参考基因组上确定与多个读段对应的多个匹配区域;和基于多个匹配区域之间的实际相对位置与预设位置要求的比较,对多个读段进行一次筛选,以便获得可拼接读段和过滤读段,其中,预设位置要求是由多轮测序的规则确定的,实际相对位置满足预设位置要求是读段作为可拼接读段的指示;和实际相对位置不满足预设位置要求是读段作为过滤读段的指示。能够有效地对来自同一插入片段多轮测序的读段进行筛选。

Description

测序方法、测序数据处理方法、设备和计算机设备
优先权信息
本申请请求2021年10月18日向中国国家知识产权局提交的专利申请202111209946.5的优先权和权益,并且通过参照将其全文并入此处。
技术领域
本公开涉及生物技术领域,具体的,本公开涉及测序技术领域,更具体的,本公开涉及测序方法、测序数据处理方法、设备、计算设备和计算机可读介质。
背景技术
DNA测序和随之而来的基因操作从根本上改变了生命科学,人类基因组序列的完成是这项工作的一个主要里程碑。据报道,二十世纪八十年代人们就提出了单分子测序的概念。2008年,Helicos公司的第一台测序仪HeliScope上市。
高通量测序仪采用全内反射的荧光CCD(Charge coupled Device,电荷耦合元件,又称为CCD图像传感器)、TIRF(Total Internal Reflection Fluorescence,全内反射荧光)等成像系统检测掺入的核苷酸,从而实现测序的目的。测序长度越长越有利于序列组装和分析,而在测序过程中,受荧光染料切除后碱基侧链留下的残基(Scar)累积等因素的影响,很难实现长读长测序。
因此,现有的测序技术及其相应的测序数据分析手段仍有待改进。
发明内容
本公开旨在至少在一定程度上解决相关技术中的技术问题之一。
为此,本公开一方面提供一种测序方法。根据本公开的实施方案,所述测序方法包括:
提供核酸模板,所述核酸模板直接或者间接连接在固相载体的表面;
利用第一核苷酸进行合成测序反应,以测定所述核酸模板的一部分,获得读段,所述第一核苷酸为带有可检测标记的可逆终止子;
利用第二核苷酸进行聚合反应,以合成所述核酸模板的一部分,获得预设长度的合成片段,所述第二核苷酸为不带有可检测标记的可逆终止子,所述读段和所述合成片段对应所述核酸模板上有重叠或者没有重叠的连续的部分。
本公开是发明人基于测序平台有限的测序读长尤其是短读长(如15~50bp的测序长度)不利于序列的组装和分析,或在模板量一定的情况下通过增加测序量可提高测序分析准确度的情况而作出的。
根据本公开的实施方案,所述读段的长度不短于所述合成片段的长度。
根据本公开的实施方案,所述合成片段的长度大于或等于1bp。
根据本公开的实施方案,所述合成片段的长度大于或等于10bp。
根据本公开的实施方案,所述合成片段的长度大于或等于10bp并且小于或等于20bp。
根据本公开的实施方案,所述核酸模板的长度小于或等于600bp。
根据本公开的实施方案,所述核酸模板大于或等于75bp且小于或等于400bp。
根据本公开的实施方案,所述第一核苷酸和/或所述第二核苷酸的糖的3'-OH被可逆阻断。
根据本公开的实施方案,所述第一核苷酸和/或所述第二核苷酸的糖的3'-OH为天然状态,并且所述第一核苷酸和/或所述第二核苷酸的碱基连接有可切割的阻断基团。
根据本公开的实施方案,所述可检测标记为荧光分子。
根据本公开的实施方案,在DNA聚合酶的作用下进行所述合成测序反应和/或所述聚合反应,所述DNA聚合酶选自Klenow片段、Bst、9°N、Pfu、KOD和Vent中的至少一种。
根据本公开的实施方案,在相同DNA聚合酶的作用下进行所述合成测序反应和所述聚合反应,所述DNA聚合酶为Klenow片段突变体。
根据本公开的实施方案,在相同DNA聚合酶的作用下进行所述合成测序反应和所述聚合反应,所述DNA聚合酶为9°N突变体。
根据本公开的实施方案,所述读段为第一读段,所述方法包括:
i)使所述核酸模板与第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述第一引物共价连接在所述固相载体的表面上;
ii)利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第一引物合成所述核酸模板的互补链以测定所述核酸模板的第一部分,获得所述第一读段,定义所述核酸模板的互补链为第一模板;
iii)利用所述第二核苷酸进行所述聚合反应,包括继续延伸所述第一模板,获得所述合成片段;以及
iv)利用所述第一核苷酸进行所述合成测序反应,包括继续延伸所述第一模板以测定所述核酸模板的第二部分,获得第二读段,
所述第一读段、所述合成片段和所述第二读段对应所述核酸模板上三个没有重叠的连续的部分。
根据本公开的实施方案,所述读段为第一读段,所述方法包括:
i)加入第一引物并使所述核酸模板与所述第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述核酸模板共价连接在所述固相载体的表面上;
ii)利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第一引物合成所述核酸模板的互补链以测定所述核酸模板的第一部分,获得所述第一读段,定义所述核酸模板的互补链为第一模板;
iii)利用所述第二核苷酸进行所述聚合反应,包括继续延伸所述第一模板,获得所述合成片段;以及
iv)利用所述第一核苷酸进行所述合成测序反应,包括继续延伸所述第一模板以测定所述核酸模板的第二部分,获得第二读段,
所述第一读段、所述合成片段和所述第二读段对应所述核酸模板上三个没有重叠的连续的部分。
根据本公开的实施方案,所述合成片段为第一合成片段,所述方法还包括:
v)去除所述核酸模板;
vi)加入第二引物并使该第二引物结合到所述第一模板,利用所述第二核苷酸进行所述聚合反应,包括延伸所述第二引物合成所述第一模板的互补链,获得预设长度的第二合成片段,所述第二引物的至少一部分与所述第一模板的3'端互补,定义所述第一模板的互补链为第二模板;以及
vii)利用所述第一核苷酸进行所述合成测序反应,包括继续延伸所述第二模板以测定所述核酸模板的第三部分,获得第三读段,
所述第二合成片段和所述第三读段对应所述核酸模板上两个连续的部分。
根据本公开的实施方案,所述方法还包括:重复iii)和iv)至少一次。
根据本公开的实施方案,所述方法还包括:重复vi)和vii)至少一次。
根据本公开的实施方案,所述第一读段、第一合成片段、第二读段、第二合成片段和第三读段之间的长度关系能使所述核酸模板的非末端部分的任一个位置的核苷酸被至少测定一次。
根据本公开的实施方案,所述方法还包括在iv)之后且v)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
根据本公开的实施方案,所述方法还包括在在v)之后且vi)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
根据本公开的实施方案,在DNA聚合酶的作用下使延伸反应阻断剂结合到所述第一模板实现所述封闭,所述延伸反应阻断剂选择ddNTP及其衍生物中的至少一种。
根据本公开的实施方案,在末端转移酶的作用下使延伸反应阻断剂结合到所述第一模板实现所述封闭,所述延伸反应阻断剂选择ddNTP及其衍生物中的至少一种。
根据本公开的实施方案,所述读段为第一读段,所述合成片段为第一合成片段,所述方法包括:
i)加入第一引物并使所述核酸模板与所述第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述核酸模板共价连接在所述固相载体的表面上;
ii)利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第一引物合成所述核酸模板的互补链以测定所述核酸模板的第一部分,获得所述第一读段,定义所述核酸模板的互补链为第一模板;
iii)去除所述第一模板;
iv)加入所述第一引物并使该第一引物结合到所述核酸模板,利用所述第二核苷酸进行所述聚合反应,包括延伸所述第一引物合成所述核酸模板的互补链,获得所述第一合成片段,所述第一合成片段的长度不长于所述第一读段的长度,定义所述核酸模板的互补链为第一模板;以及
v)利用所述第一核苷酸进行所述合成测序反应,包括继续延伸所述第一模板以测定所述核酸模板的第二部分,获得第二读段。
根据本公开的实施方案,所述方法还包括:重复iii)-v)至少一次,并且使每个重复中的第一合成片段的长度不短于上一个重复中的第一合成片段的长度且不长于上一个重复中的第一合成片段和第二读段的长度之和。
根据本公开的实施方案,所述读段为第一读段,所述合成片段为第一合成片段,所述方法包括:
i)加入第一引物并使所述核酸模板与所述第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述核酸模板共价连接在所述固相载体的表面上;
ii)利用所述第二核苷酸进行所述聚合反应,包括延伸所述第一引物合成所述核酸模板的互补链,获得所述第一合成片段,定义所述核酸模板的互补链为第一模板;
iii)利用所述第一核苷酸进行所述合成测序反应,包括继续延伸所述第一模板以测定所述核酸模板的第一部分,获得所述第一读段;
iv)去除所述第一模板;以及
v)加入所述第一引物并使该第一引物结合到所述核酸模板,利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第一引物合成所述核酸模板的互补链以测定所述核酸模板的第二部分,获得第二读段,所述第二读段的长度不短于所述第一合成片段的长度。
根据本公开的实施方案,所述读段为第一读段,所述方法包括:
i)使所述核酸模板与第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述第一引物共价连接在所述固相载体的表面上;
ii)利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第一引物合成所述核酸模板的互补链以测定所述核酸模板的第一部分,获得所述第一读段,定义所述核酸模板的互补链为第一模板;
iii)利用所述第二核苷酸进行所述聚合反应,包括继续延伸所述第一模板,获得所述合成片段;
iv)去除所述核酸模板;
v)加入第二引物并使该第二引物结合到所述第一模板,利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第二引物合成所述第一模板的互补链以测定所述核酸模板的第二部分,获得第二读段,所述第二引物的至少一部分与所述第一模板的3'端互补。
根据本公开的实施方案,,通过使单链核酸分子与探针杂交,并基于聚合反应延伸所述探针获得所述核酸模板,所述探针共价连接在所述固相载体的表面上,所述单链核酸分子的3'端与所述探针互补。
根据本公开的实施方案,所述方法还包括在ii)之后且iii)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
根据本公开的实施方案,所述方法还包括在iii)之后且iv)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
根据本公开的实施方案,在DNA聚合酶的作用下使延伸反应阻断剂结合到所述第一模板实现所述封闭,所述延伸反应阻断剂选择ddNTP及其衍生物中的至少一种。
根据本公开的实施方案,所述方法还包括在iii)之后且iv)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
根据本公开的实施方案,所述方法还包括在iv)之后且v)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
根据本公开的实施方案,在末端转移酶的作用下使延伸反应阻断剂结合到所述第一模板实现所述封闭,所述延伸反应阻断剂选择ddNTP及其衍生物中的至少一种。
根据本公开的实施方案,通过加入变性试剂解离所述核酸模板与所述第一模板,以去除所述核酸模板。
根据本公开的实施方案,通过加入变性试剂解离所述第一模板与所述核酸模板,以去除所述第一模板。
根据本公开的实施方案,所述变性试剂包含甲酰胺。
本公开另一方面提供一种测序数据处理方法。根据本公开的实施方案,所述测序数据包括多个读段组,所述读段组包括多个读段,所述多个读段是通过对同一插入片段进行多轮测序而获得的,所述方法包括针对每个所述读段组的所述多个读段进行下列处理:
将所述多个读段与参考基因组进行全局比对,以便在所述参考基因组上确定与所述多个读段对应的多个匹配区域; 和
基于所述多个匹配区域之间的实际相对位置与预设位置要求的比较,对所述多个读段进行一次筛选,以便获得可拼接读段和过滤读段,
其中,
所述预设位置要求是由所述多轮测序的规则确定的,
所述实际相对位置满足所述预设位置要求是所述读段作为所述可拼接读段的指示;和
所述实际相对位置不满足所述预设位置要求是所述读段作为所述过滤读段的指示。
根据本公开的实施方案,所述测序数据处理方法进一步包括:
对于所述过滤读段进行二次筛选,所述二次筛选包括:
将所述读段组的至少一个作为初步读段,并基于所述初步读段对应的所述匹配区域和所述预设位置要求确定所述参考基因组上的二次比对区域;和
将所述过滤读段的每一个所述读段分别与所述二次比对区域进行局部比对,并将满足预定阈值的所述读段和所述初步读段归类为可拼接读段。
根据本公开的实施方案,所述读段组的每一个所述读段均作为初步读段,进行所述二次筛选。
根据本公开的实施方案,所述测序数据处理方法进一步包括:
对所述可拼接读段按照所述多轮测序的规则进行拼接。
根据本公开的实施方案,所述多轮测序的规则包括选自下列的至少之一:双端测序、Jumping测序、Overlap测序、双端Jumping测序以及这些测序规则的组合。
根据本公开的实施方案,所述多轮测序的规则为双端测序,所述读段组包括两个读段,所述所述预设位置要求包括:
两个所述读段的匹配区域分别位于所述参考基因组的正链和反链上;和
两个所述读段的匹配区域在所述参考基因组上的距离不超过预定阈值,
其中,所述预定阈值是基于插入片段的长度确定的。
根据本公开的实施方案,所述多轮测序的规则为Jumping测序,所述所述预设位置要求包括:
多个所述读段的匹配区域位于所述参考基因组的相同链上;和
多个所述读段的匹配区域中相邻两个所述读段在所述参考基因组上的距离不超过预定距离阈值,
其中,所述预定阈值是基于部分延伸步骤的长度确定的,任选地,所述预定距离阈值不超过50bp,优选不超过20bp,进一步优选在5~20bp之间。
根据本公开的实施方案,所述多轮测序的规则为Overlap测序,所述所述预设位置要求包括:
多个所述读段的匹配区域位于所述参考基因组的相同链上;和
多个所述读段的匹配区域中相邻两个所述读段在所述参考基因组上的重叠区域长度在预定距离范围,
其中,所述预定距离范围是基于测序过程中的重叠区域长度确定的,
任选地,所述预定距离范围为5~10bp之间。
根据本公开的实施方案,所述多轮测序的规则为双端Jumping测序,所述所述预设位置要求包括:
多个所述读段的匹配区域的一部分位于所述参考基因组的正链,另一部分位于所述参考基因组的反链上;和
多个所述读段的匹配区域中相邻两个所述读段在所述参考基因组上的重叠区域长度在预定距离范围,
其中,所述预定距离范围是基于测序过程中部分延伸步骤的长度确定的,
任选地,所述预定距离阈值不超过50bp,优选不超过20bp,进一步优选在5~20bp之间。
根据本公开的实施方案,所述Jumping测序包括:
提供核酸模板,所述核酸模板直接或者间接连接在固相载体的表面;
采用第一核苷酸和第二核苷酸,与所述核酸模板发生多轮延伸反应,
其中,
所述第一核苷酸为带有可检测标记的可逆终止子,并且用于通过所述延伸反应获得多个读段;
所述第二核苷酸为不带有可检测标记的可逆终止子,并且用于通过所述延伸反应获得至少一个预设长度的合成片段。
根据本公开的实施方案,所述Overlap测序包括:
所述核酸模板直接或者间接连接在固相载体的表面;
采用第一测序接头和第二测序接头与所述核酸模板发生多轮延伸反应,以便获得多个读段,
其中,
所述第一测序接头产生的第一读段与所述第二测序接头产生的第二读段存在至少一个碱基的重叠区域,
可选的,
所述第一测序接头采用所述第一核苷酸进行所述延伸反应,以便获得所述第一读段;
所述第二测序接头产生首先采用第二核苷酸进行延伸反应,之后采用所述第一核苷酸进行多个所述延伸反应,以便获得所述第二读段。
根据本公开的实施方案,所述双端Jumping测序包括:
使所述核酸模板与第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述第一引物共价连接在所述固相载体的表面上;
采用所述第一核苷酸和所述第二核苷酸,基于所述第一引物与所述核酸模板发生多轮延伸反应,并获得第一引物延伸链;
去除所述核酸模板,并使第二引物与所述第一引物延伸链杂交;
采用所述第一核苷酸和所述第二核苷酸,基于所述第二引物与所述第一引物延伸链发生多轮延伸反应,;
其中,
所述第一核苷酸为带有可检测标记的可逆终止子,并且用于通过所述延伸反应获得多个读段;
所述第二核苷酸为不带有可检测标记的可逆终止子,并且用于通过所述延伸反应获得至少一个预设长度的合成片段。
本公开另一方面提供一种测序数据处理设备。根据本公开的实施方案,所述测序数据处理设备包括:多个读段,所述多个读段是通过对同一插入片段进行多轮测序而获得的,所述设备包括针对每个所述读段组的所述多个读段进行下列处理的多个模块:
全局比对模块,用于将所述多个读段与参考基因组进行全局比对,以便在所述参考基因组上确定与所述多个读段对应的多个匹配区域;和
一次筛选模块,用于基于所述多个匹配区域之间的实际相对位置与预设位置要求的比较,对所述多个读段进行一次筛选,以便获得可拼接读段和过滤读段,
其中,
所述预设位置要求是由所述多轮测序的规则确定的,
所述实际相对位置满足所述预设位置要求是所述读段作为所述可拼接读段的指示;和
所述实际相对位置不满足所述预设位置要求是所述读段作为所述过滤读段的指示。
根据本公开的实施方案,所述测序数据处理设备进一步包括二次筛选模块,用于对于所述过滤读段进行二次筛选,所述二次筛选包括:
将所述读段组的至少一个作为初步读段,并基于所述初步读段对应的所述匹配区域和所述预设位置要求确定所述参考基因组上的二次比对区域;和
将所述过滤读段的每一个所述读段分别与所述二次比对区域进行局部比对,并将满足预定阈值的所述读段和所述初步读段归类为可拼接读段。
本公开另一方面提出了一种计算设备。根据本公开的实施方案,所述计算设备包括:处理器和存储器;
所述存储器,用于存储计算机程序;
所述处理器,用于执行所述计算机程序以实现根据前面所述的测序数据处理方法。
本公开又一方面提供了一种计算机可读存储介质。根据本公开的实施方案,所述计算机可读存储介质包括计算机指令,当所述指令被计算机执行时,使得所述计算机实现前面所述的测序数据处理方法。
本公开的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本公开的实践了解到。
附图说明
本公开的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:
图1是根据本公开一个实施例的测序数据处理方法的流程示意图;
图2是根据本公开另一个实施例的测序数据处理方法的流程示意图;
图3是根据本公开另一个实施例的二次筛选的流程示意图;
图4是根据本公开另一个实施例的测序数据处理方法的流程示意图;
图5是根据本公开一个实施例的测序数据处理设备的结构示意图;
图6是根据本公开一个实施例的测序数据处理设备的结构示意图;
图7是根据本公开一个实施例的测序数据处理设备的结构示意图;
图8是根据本公开一个实施例的双端测序的流程示意图;
图9是根据本公开一个实施例的Jumping测序的流程示意图;
图10是根据本公开一个实施例的Overlap测序的流程示意图;
图11是根据本公开一个实施例的双端jumping测序的流程示意图。
发明详细描述
下面详细描述本公开的实施例。下面描述的实施例是示例性的,仅用于解释本公开,而不能理解为对本公开的限制。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本公开的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
在本公开中,除非另有明确的规定和限定,术语“连接”、“固定”等术语应做广义理解,例如,可以是固定连接,也可以是可逆连接,可以是直接相连,也可以通过中间媒介间接相连,等,除非另有明确的限定。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本申请中的具体含义。
本公开中,术语“核酸模板”是指待测的核酸分子,表示一定长度的核苷酸的聚合物,核苷酸可以包括核糖核苷酸、脱氧核糖核苷酸、核糖核苷酸或脱氧核糖核苷酸的类似物或衍生物的一种或多种组成;包括单链或双链核酸分子。
在本公开中,术语“测序”又可称为“核酸测序”或“基因测序”,指核酸序列中碱基排列顺序的测定;包括双末端测序、单末端测序和/或配对末端测序等,所称的双末端测序或者配对末端测序可以指同一核酸分子的不完全重叠的任意两段或两个部分的读出;所称的测序包括使核苷酸(包括核苷酸类似物)结合到模板并采集相应的反应信号的过程。
在本公开中,“可逆终止子”指的是带有可逆修饰的4种天然核苷酸(dATP、dCTP、dGTP、dTTP)或其衍生物。天然核苷酸的衍生物指的是核苷酸的原子或原子团被其他原子或原子团取代所形成的化合物,天然核苷酸的衍生物可在聚合酶或者末端转移酶的作用下掺入到核酸链的3’端。3’端被可逆修饰的核苷酸的3’端去修饰后可继续与核苷酸进行磷酯反应,修饰基团可选择为含有叠氮基团的烷基,等。一旦将3’端被可逆修饰的核苷酸掺入到扩增链中,没有游离的3’羟基来进一步的序列延伸,因此聚合酶无法再添加另外的核苷酸。每进行一轮反应,扩增链只能添加一个核苷酸,当除去3’封闭才可以添加下一个核苷酸到扩增链中。
在本公开中,“核苷酸”指的4种天然核苷酸(dATP、dCTP、dGTP、dTTP)或其衍生物,除非另有明确的限定。
在本公开中,术语“核苷酸的糖”是指核糖或脱氧核糖。核糖的化学式为C 5H 10O 5,核糖有L-核糖和D-核糖两种构型,L-核糖的化学结构如下所示,L-核糖的3'位标示如下:
Figure PCTCN2022125967-appb-000001
D-核糖的化学结构如下所示,D-核糖的3'位标示如下:
Figure PCTCN2022125967-appb-000002
术语“脱氧核糖”又称为D-脱氧核糖、2-脱氧-D-核糖、胸腺糖,其化学式为C 4H 9O 3CHO(C 5H 10O 4),其化学结构如下所示,脱氧核糖的3'位标示如下:
Figure PCTCN2022125967-appb-000003
在本公开中,术语“碱基”,又称核碱基、含氮碱基,包括天然碱基、非天然碱基和碱基类似物。其中,天然碱基包括腺嘌呤(A)、鸟嘌呤(G)、胞嘧啶(C)、胸腺嘧啶(T)、尿嘧啶(U);非天然碱基包括诸如锁定核酸(LNA)和桥接核酸(BNA);碱基类似物包括诸如次黄嘌呤、脱氮腺嘌呤、脱氮鸟嘌呤、脱氮次黄嘌呤、7-甲基鸟嘌呤、5,6-二氢尿嘧啶、5-甲基胞嘧啶、5-羟甲基胞嘧啶。本公开中,由于核苷酸类型通过碱基类型来确定,因此,本公开中可以采用碱基类型来表示核苷酸类型。
在本公开中,术语“引物”是指:可以与感兴趣的靶序列杂交的寡聚核苷酸或核酸分子;引物是单链寡核苷酸或多核苷酸。
在本公开中,术语“可检测标记”是指能够在合适的条件下产生能够被检测到的信号的标记物或基团。
在本公开中,术语“接头”指的是含有已知序列的核苷酸序列,可为单链核酸或双链核酸。接头可用作引物,也可用于连接在核酸片段的一端或两端。
在本公开中,术语“Jumping测序”是指一种测序方法。该测序方法包括:提供核酸模板,核酸模板直接或者间接连接在固相载体的表面;采用第一核苷酸和第二核苷酸,与核酸模板发生多轮延伸反应,其中,第一核苷酸为带有可检测标记的可逆终止子,并且用于通过延伸反应获得多个读段;第二核苷酸为不带有可检测标记的可逆终止子,并且用于通过延伸反应获得至少一个预设长度的合成片段。
在本公开中,术语“Overlap测序”是指一种测序方法。该测序方法包括:核酸模板直接或者间接连接在固相载体的表面;采用第一测序接头和第二测序接头与核酸模板发生多轮延伸反应,以便获得多个读段,其中,第一测序接头产生的第一读段与第二测序接头产生的第二读段存在至少一个碱基的重叠区域,可选的,第一测序接头采用第一核苷酸进行延伸反应,以便获得第一读段;第二测序接头产生首先采用第二核苷酸进行延伸反应,之后采用第一核苷酸进行多个延伸反应,以便获得第二读段。
根据本公开的一些具体的实施方案,本公开提出一种测序方法,包括:
(11)提供固相载体表面,固相载体表面连接有核酸模板和第一引物形成的核酸复合体,第一引物的至少一部分被配置为与核酸模板的3'端的至少一部分杂交,核酸模板连接在固相载体表面或者第一测序引物连接在固相载体表面。
在步骤(11)中,第一引物和核酸模板互补,形成核酸复合体,核酸复合体连接在固相载体表面,以实现核酸模板在固相载体表面的固定。
在一种可能的实施方式中,核酸复合体中的核酸模板连接在固相载体表面。此时,核酸模板连接在固相载体表面并不是指核酸模板通过第一引物连接在固相载体表面,而是核酸模板通过与固相载体表面的分子/基团共价键连接,从而实现核酸模板在固相载体表面的连接。
在一些实施方案中,步骤(11)可以通过下述方法实现:核酸模板共价连接在固相载体的表面,加入第一引物并使核酸模板与第一引物杂交,第一引物的至少一部分与所述核酸模板的3'端互补。
在另一种可能的实施方式中,核酸复合体中的第一引物连接在固相载体表面。即第一引物通过共价键连接在固相载体表面,核酸模板通过第一测序引物连接于固相载体表面。此时,核酸模板不与固相载体表面直接连接,而是通过与第一引物互补连接,间接连接在固相载体表面。在一个实施方案中,第一引物与固相载体表面的分子或基团通过共价键连接,从而实现第一引物在固相载体表面的连接。
在一些实施方案中,步骤(11)可以通过下述方法实现:第一引物共价连接在所述固相载体的表面,使核酸模板与第一引物杂交,第一引物的至少一部分与所述核酸模板的3'端互补。
在一些实施方案中,核酸模板的长度小于或等于600bp。在一个实施方案中,核酸模板大于或等于75bp且小于或等于400bp。示例性的,核酸模板为75~80bp、80~90bp、90~100bp、100~120bp、120~150bp、150~180bp、180~200bp、200~220bp、220~250bp、250~280bp、280~300bp、300~320bp、320~350bp、350~380bp、380~400bp等情形。
(21)使用第一核苷酸,在适于进行聚合反应的条件下,以核酸模板为模板,以第一引物为引物进行延伸反应,获得第一延伸片段,第一延伸片段的长度小于核酸模板的长度。
在步骤(21)中,第一核苷酸是不带有可检测标记的可逆终止子。在一个实施方案中,步骤(21)中加入的第一核苷酸为4种不带有可检测标记的可逆终止子。利用此种核苷酸,一方面可通过可逆终止子中的阻断基团有效控制第一延伸片段的长度,又可以第一核苷酸中没有引入荧光染料基团,从而可以有效避免荧光染料切除后残留在碱基上的基团对延伸反应的影响。
在步骤(21)反应中,适于进行聚合反应的条件中包含DNA聚合酶,即:在DNA聚合酶的作用下进行合成聚合反应。DNA聚合酶可选用任何可以进行DNA扩增的酶,如Taq酶、Klenow片段、Bst、9°N、Pfu、KOD和Vent中的至少一种。。
在一些实施方案中,第一延伸片段的长度的长度的长度不短于合成片段的长度。在一些实施方案中,第一延伸片段的长度大于或等于1bp。在一些实施方案中,第一延伸片段的长度大于或等于10bp。在一些实施方案中,第一延伸片段的长度大于或等于10bp并且小于或等于20bp。示例性的,第一延伸片段的长度为10~12bp、12~14bp、14~16bp、16~18bp、18~20bp等情形。
(31)使用第二核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板为模板,以第一延伸片段为引物进行延伸循环来进行第一测序,形成第一新生测序链。
在步骤(31)中,第二核苷酸为带有可检测标记的可逆终止子。可逆终止子含有能够阻挡核苷酸的糖的3'位点发生反应的阻断基团,由此可以使得边合成边测序反应或者边连接边测序反应,只在核酸模板的互补链上引入一个第二核苷酸。
本公开实施例提供的可逆终止子,在核苷酸中引入阻断基团,以消除核苷酸的糖的3'位点的反应活性。上述第一封闭处理可采用不同的方法进行。
在一些实施方案中,可检测标记为荧光标记。根据本公开的实施方案,参与延伸反应的每种第一核苷酸可以携带不同的荧光标记,或者参与延伸反应的四种第一核苷酸中至少两种第一核苷酸携带不同的荧光标记。示例性的,四种第一核苷酸各自携带四种不同的荧光标记;四种第一核苷酸带三种荧光标记,其中,第一种和第三种核苷酸带不同的荧光基团,第四种核苷酸携带的荧光基团与前三种第一核苷酸中的一种携带的荧光基团相同,或第四种核苷酸不携带荧光基团,应当理解的是,第四种第一核苷酸的类型没有限制。示例性的,四种第一核苷酸携带两种荧光标记,如两种第一核苷酸携带一种相同的荧光标记,另两种第一核苷酸携带另一种相同的荧光标记。示例性的,四种核苷酸带一种荧光标记。
然而,可检测标记不一定为荧光标记。允许检测DNA序列中所掺入的核苷酸的种类的任何可检测标记都可以。
在步骤(31)反应中,适于进行边合成边测序反应或者边连接边测序反应的条件中包含DNA聚合酶,即:在DNA聚合酶的作用下进行边合成边测序反应或者边连接边测序反应。DNA聚合酶可选用任何可以进行DNA扩增的酶,如Taq酶、Klenow片段、Bst、9°N、Pfu、KOD和Vent中的至少一种。
在一个实施方案中,在相同DNA聚合酶的作用下进行步骤(21)的聚合反应和步骤(31)的合成边测序反应或者边连接边测序反应,其中,DNA聚合酶为Klenow片段突变体。
在一个实施方案中,在相同DNA聚合酶的作用下进行步骤(21)的聚合反应和步骤(31)的合成边测序反应或者边连接边测序反应,其中,DNA聚合酶为9°N突变体。
通过步骤(31)可获得第一测序数据。
应当理解的是,根据本公开的实施方案,使用第一核苷酸,在适于进行聚合反应的条件下,以核酸模板为模板,以第一引物为引物进行延伸反应,获得第一延伸片段的步骤(21)和使用第二核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板为模板,以第一延伸片段为引物进行延伸循环来进行第一测序,形成第一新生测序链的步骤(31)的顺序可以调换。即可以先进行合成测序反应,以测定所述核酸模板的一部分,再利用第二核苷酸进行聚合反应,以合成所述核酸模板的一部分,获得预设长度的合成片段;也可以先利用第二核苷酸进行聚合反应,以合成所述核酸模板的一部分,获得预设长度的合成片段,再进行合成测序反应,以测定所述核酸模板的一部分。
根据本公开另一种具体的实施方案,本公开提出一种测序方法,包括第一测序方法,第一测序方法在上述测序方法的基础上还包括:
(41)去除固相载体表面的第一新生测序链。
(51)使用第二核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板为模板,以第一引物为引物进行延伸循环来进行第二测序,形成第二新生测序链,获得第二测序数据。
步骤(51)中,适于边合成边测序反应或者边连接边测序反应的条件参考前文所述,为了节约篇幅,此处不再赘述。
根据本公开的实施方案,第二新生测序链的长度不小于第一延伸片段的长度。此时,第一测序数据和第二测序数据 具有部分重叠数据。利用部分重叠数据进行测序数据分析,更有利于对模板序列的组装分析及测序数据之间的相互校对,提高测序数据分析的准确性。在一些实施方案中,第二新生测序链的长度小于第一新生测序链和第一延伸片段的总长度。
在一个实施方案中,在步骤(51)之前,上述方法还包括:对残余在芯片表面的第一新生测序链的3’末端进行第一封闭处理。对残余的第一新生测序链的3’末端进行封闭能够有效地避免在进行第二测序过程中第一新生测序链继续延伸产生干扰信号。通过降低干扰信号产生的无效数据对信息分析的干扰,可以有效增加有效数据量,从而提高测序数据分析的准确性。
在一个实施方案中,上述第一封闭处理可采用不同的方法进行,如通过去除3’末端羟基和/或通过使3’末端羟基与延伸反应阻断剂相连而进行。其中,延伸反应阻断剂用以阻断3’末端羟基与磷酸基团的反应,延伸反应阻断剂可为烷基、ddNTP或其衍生物,等。在一个实施方案中,延伸反应阻断剂为ddNTP或其衍生物。
在一个实施方案中,上述第一封闭处理采用DNA聚合酶和末端转移酶的至少之一进行。DNA聚合酶以DNA链为模板,在待封闭的核酸链的3’末端添加ddNTP,从而达到使3’末端封闭的效果。末端转移酶可以直接在单链核酸的3’末端添加ddNTP达到3’末端封闭的效果。
在一个实施方案中,上述第一封闭处理通过聚合酶连接ddNTP或其衍生物。
根据本公开另一种具体的实施方案,本公开提出的测序方法包括第二测序方法,以第二种实现方式为例,第二种测序方法在本公开第二种实现方式提出的测序方法的基础上,进一步包括如下技术特征:
在步骤(11)之后步骤(21)之前,包括步骤:
(a)使用第二核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板为模板,以第一测序引物为引物进行延伸循环来进行第三测序,形成第三新生测序链,获得第三测序数据;
步骤(a)中,核酸模板与固相载体的连接方式参考前文所述。在一些实施方案中,核酸模板通过共价键连接在固相载体表面。
在一些实施方案中,第三新生测序链的长度不小于第一延伸片段的长度。此时,第一测序数据和第三测序数据具有部分重叠数据。利用部分重叠数据进行数据分析,更有利于对模板序列的组装分析及测序数据之间的相互校对,提高测序数据分析的准确性。
(b)去除第三新生测序链。
在一个实施方案中,在步骤(b)之后,在步骤(21)之前,第三测序方法还包括步骤(c)对残余在芯片表面的第三新生测序链的3’末端进行第二封闭处理。对残余的第三新生测序链的3’末端进行封闭能够有效地避免残余的第三新生测序链在第一测序过程中继续延伸产生干扰信号。通过降低干扰信号产生的无效数据对的干扰,可以有效增加有效的测序数据量。由此,通过第二封闭处理能够通过增加有效的测序数据量而进一步提高测序数据分析的准确性。
在一个实施方案中,上述第二封闭处理可采用不同的方法进行,如通过去除3’末端羟基和/或通过使3’末端羟基与延伸反应阻断剂相连而进行。其中,延伸反应阻断剂用以阻断3’末端羟基与磷酸基团的反应,延伸反应阻断剂可为烷基、ddNTP或其衍生物,等。在一个实施方案中,延伸反应阻断剂为ddNTP或其衍生物。
在一个实施方案中,上述第二封闭处理采用DNA聚合酶和末端转移酶的至少之一进行。DNA聚合酶以DNA链为模板,在待封闭的核酸链的3’末端添加ddNTP,从而达到使3’末端封闭的效果。末端转移酶可以直接在单链核酸的3’末端添加ddNTP达到3’末端封闭的效果。
在一个实施方案中,上述第二封闭处理通过聚合酶连接ddNTP或其衍生物。
去除第三新生测序链,可通过物理方法或化学方法(如采用变性试剂)进行,物理方法如高温变性(如80℃-98℃),变性试剂如NaOH、甲酰胺等。在一个实施方案中,通过变性试剂如甲酰胺使第三新生测序链与核酸模板解离从而去除第三新生测序链。
在一个实施方案中,上述第一测序方法及其实施例,第三测序方法及其实施例中的核酸模板分别通过如下步骤获得:
(1-a)使测序文库中的文库分子与固相载体表面的接头进行杂交;
(1-b)利用文库分子作为初始模板,以固相载体表面的接头为引物合成初始模板的互补链以形成核酸模板;
(1-c)除去初始模板,并对芯片表面的核酸分子的3’末端进行第三封闭处理。
第三封闭用于封闭芯片表面的核酸分子,芯片表面的核酸分子包括接头、核酸模板、残余初始模板等。通过第三封闭,可有效避免在测序过程中芯片表面的核酸分子的3’末端连接含有检测信号的核苷酸产生干扰信号,通过降低干扰信号产生的无效数据对的干扰,可以有效增加有效的测序数据量。由此,通过第三封闭处理能够通过增加有效的测序数据量而进一步提高测序数据分析的准确性。
在一个实施方案中,测序文库为DNA文库,DNA文库中的文库分子含有多种单链DNA片段。
在一个实施方案中,上述第一种测序方法或第三种测序方法在(1-c)之前,进一步包括:
(1-b-1)对步骤(1-b)中延伸不完全的互补链的3’末端进行第四封闭处理。
第四封闭用于封闭模板链的互补链的3’末端,可有效避免互补链在测序过程中或扩增过程中继续延伸产生干扰信号,通过降低干扰信号产生的无效数据对的干扰,可以有效增加有效的测序数据量。由此,通过第四封闭处理能够通过增加有效的测序数据量而进一步提高测序数据分析的准确性。
在一个实施方案中,上述第三封闭处理和第四封闭处理可分别采用不同的方法进行,如分别独立地通过去除3’末端羟基和/或通过使3’末端羟基与延伸反应阻断剂相连而进行。其中,延伸反应阻断剂用以阻断3’末端羟基与磷酸基团的反应,延伸反应阻断剂可为烷基、ddNTP或其衍生物,等。在一个实施方案中,上述第一种测序方法及其实施例、第三种测序方法及其实施例中的延伸反应阻断剂分别为ddNTP或其衍生物。
在一个实施方案中,上述第三封闭处理和所述第四封闭处理分别独立地采用DNA聚合酶和末端转移酶的至少之一进行。DNA聚合酶以DNA链为模板,在待封闭的核酸链的3’末端添加ddNTP,从而达到使3’末端封闭的效果。末端转移酶可以直接在单链核酸的3’末端添加ddNTP达到3’末端封闭的效果。
在一个实施方案中,上述第四封闭处理分别独立地通过聚合酶连接ddNTP或其衍生物,上述第三封闭处理通过末端转移酶连接ddNTP或其衍生物。
根据本公开另一种具体的实施方案,本公开提出的测序方法还包括:
(12)提供固相载体表面,固相载体表面连接有核酸模板和第一引物形成的核酸复合体,第一引物的至少一部分被配置为与核酸模板的3'端的至少一部分杂交,核酸模板连接在固相载体表面或者第一测序引物连接在固相载体表面。
在步骤(12)中,第一引物和核酸模板互补,形成核酸复合体,核酸复合体连接在固相载体表面,以实现核酸模板在固相载体表面的固定。
在一种可能的实施方式中,核酸复合体中的核酸模板连接在固相载体表面。此时,核酸模板连接在固相载体表面不是指核酸模板通过第一引物连接在固相载体表面。在一个实施方案中,核酸模板通过与固相载体表面的分子/基团共价键连接,从而实现核酸模板在固相载体表面的连接。
在一些实施方案中,步骤(12)可以通过下述方法实现:核酸模板共价连接在固相载体的表面,加入第一引物并使核酸模板与第一引物杂交,第一引物的至少一部分与所述核酸模板的3'端互补。
在另一种可能的实施方式中,核酸复合体中的第一引物连接在固相载体表面。即第一引物通过共价键连接在固相载体表面,核酸模板通过第一测序引物连接于固相载体表面。此时,核酸模板不与固相载体表面直接连接,而是通过与第一引物互补连接,间接连接在固相载体表面。在一个实施方案中,第一引物与固相载体表面的分子或基团通过共价键连接,从而实现第一引物在固相载体表面的连接。
在一些实施方案中,步骤(12)可以通过下述方法实现:第一引物共价连接在所述固相载体的表面,使核酸模板与第一引物杂交,第一引物的至少一部分与所述核酸模板的3'端互补。
在一些实施方案中,核酸模板的长度小于或等于600bp。在一个实施方案中,核酸模板大于或等于75bp且小于或等于400bp。示例性的,核酸模板为75~80bp、80~90bp、90~100bp、100~120bp、120~150bp、150~180bp、180~200bp、200~220bp、220~250bp、250~280bp、280~300bp、300~320bp、320~350bp、350~380bp、380~400bp等情形。
(22)使用第三核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板为模板,以第一引物为引物进行延伸循环来进行第一测序,形成第一新生测序链,第三核苷酸为带有可检测标记的可逆终止子。
在步骤(22)中,利用第三核苷酸作为合成测序反应的底物,第三核苷酸为带有可检测标记的可逆终止子。可逆终止子含有能够阻挡核苷酸的糖的3'位点发生反应的阻断基团,由此可以使得形成第一新生测序链的每一轮延伸反应,只在第一新生测序链上引入一个第三核苷酸。
根据本公开的实施方案,第三核苷酸为带有可检测标记。在一些实施方案中,可检测标记为荧光标记。根据本公开的实施方案,参与延伸反应的每种三核苷酸可以携带不同的荧光标记,或者参与延伸反应的四种第三核苷酸中至少两种第三核苷酸携带不同的荧光标记。示例性的,四种第三核苷酸各自携带四种不同的荧光标记;四种第三核苷酸带三种荧光标记,其中,第一种和第三种第三核苷酸带不同的荧光基团,第四种第三核苷酸携带的荧光基团与前三种第三核苷酸中的一种携带的荧光基团相同,或第四种第三核苷酸不携带荧光基团,应当理解的是,第四种第三核苷酸的类型没有限制。示例性的,四种第三核苷酸携带两种荧光标记,如两种第三核苷酸携带一种相同的荧光标记,另两种第三核苷酸携带另一种相同的荧光标记。示例性的,四种第三核苷酸带一种荧光标记。
然而,可检测标记不一定为荧光标记。允许检测DNA序列中所掺入的核苷酸的种类的任何可检测标记都可以。
由于第三核苷酸为带有可检测标记的可逆终止子,因此,在测序过程中,第三核苷酸在聚合酶的作用下掺入到核酸模板互补链的3’端,同时,由于第三核苷酸的糖的3’羟基的反应活性被阻断,无法进行进一步的序列延伸,使得聚每一轮延伸反应仅能在核酸模板互补链上引入一个第三核苷酸;通过检测到的检测标记以确定掺入的核苷酸种类;通过去除3’端封闭基团,可使核苷酸3’产生游离的羟基而恢复反应活性。
步骤(22)中,适于进行测序反应的条件中包含DNA聚合酶,即:在DNA聚合酶的作用下进行合成测序反应。DNA聚合酶可选用任何可以进行DNA扩增的酶,如Taq酶、Klenow片段、Bst、9°N、Pfu、KOD和Vent中的至少一种。
通过步骤(22),可以读取第一新生测序列的核苷酸类型和排序,获得第一新生测序列的序列信息。本公开中,序列确定的新生测序链,又称为读段,第一新生测序链又可称为第一读段,第二新生测序链又可称为第二读段。进一步的,该实施例中,由第一新生测序列的序列可以确定核酸模板的一部分的序列。
根据本公开的实施方案,第一新生测序链的长度小于核酸模板的长度。
(32)使用第四核苷酸,在适于进行聚合反应的条件下,以第一新生测序链为引物,以核酸模板为模板进行第一延伸,获得第一延伸片段,第四核苷酸为不带有可检测标记的核苷酸。
在步骤(32)中,第四核苷酸为不带有可检测标记的核苷酸,即核苷酸可选择天然核苷酸(dATP、dCTP、dGTP、dTTP)或其衍生物,也可选择不带有可检测标记的终止子,例如第四核苷酸选用不带有可检测标记的3’端被可逆修饰的核苷酸。在一个实施方案中,步骤(32)中加入的第四核苷酸为不带有可检测标记的3’端被可逆修饰的核苷酸。
在步骤(32)反应中,适于进行聚合反应的条件中包含DNA聚合酶,即:在DNA聚合酶的作用下进行合成聚合反应。DNA聚合酶可选用任何可以进行DNA扩增的酶,如Taq酶、Klenow片段、Bst、9°N、Pfu、KOD和Vent中的至少一种。
在一个实施方案中,在相同DNA聚合酶的作用下进行步骤(22)的合成测序反应和步骤(32)的聚合反应,其中,DNA聚合酶为Klenow片段突变体。
在一个实施方案中,在相同DNA聚合酶的作用下进行步骤(22)的合成测序反应和步骤(32)的聚合反应,其中,DNA聚合酶为9°N突变体。
根据本公开另一种具体的实施方案,本公开提出的测序方法包括第三种测序方法,其中第三测序方法在本公开的上述第二方面提出的测序方法的基础上,进一步包括:第一测序引物共价连接在固相载体表面,核酸模板通过第一测序引物连接于固相载体表面。
在一个实施方案中,上述第四核苷酸为天然核苷酸和/或其衍生物。
在一个实施方案中,在步骤(32)之后,上述第三测序方法还包括步骤:(42)去除核酸模板;(52)使用第三核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板的互补链为模板,以第二测序引物为引物进行延伸循环来进行第二测序,形成第二新生测序链,获得第二测序数据;其中,核酸模板的互补链是由第一新生测序链和第一延伸片段共同形成。
在一个实施方案中,在步骤(42)之后且在步骤(52)之前,上述第三测序方法还包括:对芯片表面的核酸链3’末端进行第五封闭处理。
第五封闭用于封闭芯片表面的核酸链,芯片表面的核酸分子包括接头、互补链、残余初始模板等。通过第五封闭,可有效避免在测序过程中芯片表面的核酸分子的3’末端连接含有检测信号的核苷酸产生干扰信号,通过降低干扰信号产生的无效数据对的干扰,可以有效增加有效的测序数据量。由此,通过第五封闭处理能够通过增加有效的测序数据量而进一步提高测序数据分析的准确性。
对核酸链末端的封闭,可采用不同的方法,如通过去除3’末端羟基和/或通过使3’末端羟基与延伸反应阻断剂相连。在一个实施方案中,上述第五封闭通过使3’末端羟基与延伸反应阻断剂相连而进行的。其中,延伸反应阻断剂用以阻断3’末端羟基与磷酸基团的反应,延伸反应阻断剂可为烷基、ddNTP或其衍生物,等。在一个实施方案中,上述延伸反应阻断剂为ddNTP或其衍生物。
在一个实施方案中,第五封闭采用末端转移酶进行。末端转移酶可直接将ddNTP或其衍生物连接到核酸链的末端达到封闭3’末端封闭的效果。
去除核酸模板可通过物理方法或化学方法(如采用变性试剂)进行,物理方法如高温变性(如80℃-98℃),变性试剂如NaOH、甲酰胺等,在一个实施方案中,上述去除核酸模板是通过变性试剂甲酰胺使核酸模板链与其互补链解离进行的。
去除核酸模板,可通过物理方法或化学方法(如采用变性试剂)进行,物理方法如高温变性(如80℃-98℃),变性试剂如NaOH、甲酰胺等。在一个实施方案中,通过变性试剂如甲酰胺使核酸模板链与其互补链解解离从而去除核酸模板链。
根据本公开另一种具体的实施方案,本公开提出测序方法包括第四种测序方法,其中第四种测序方法是在上述第二方面提出的测序方法的基础上,进一步包括:第四核苷酸为不带有可检测标记的可逆终止子。利用此种核苷酸,一方面可通过可逆终止子中的阻断基团有效控制合成片段的长度,又可以避免引入荧光染料,从而避免荧光染料切除后残留在碱基上的基团对延伸反应的影响。
在一个实施方案中,上述第四测序方法还包括步骤(43):使用第三核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板为模板,以第一延伸片段为引物进行延伸循环来进行第二测序,形成第二新生测序链,获得第二测序数据。
在一个实施方案中,上述第四测序方法还包括步骤(53):重复(32)和(43)步骤N-1次,获得第1~(N+1)新生测序链和第1~(N+1)测序数据,以及第1~N延伸片段,第1~(N+1)新生测序链和第1~N延伸片段共同形成第一新生链;第N延伸片段是通过使用第四核苷酸,在适于进行聚合反应的条件下,以核酸模板为模板,以第N新生测序链为引物进行延伸获得;第N+1新生测序链和第N+1测序数据,是通过使用第一核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板为模板,以第N延伸片段为引物进行延伸循环来进行第N+1测序获得;N为大于等于1的正整数;第一新生链的长度不长于核酸模板链的长度。
N的最大值和核酸模板的长度有关,依据核酸模板的长度、新生测序链的长度及延伸片段的长度确定N的大小,N的最大值为核酸模板长度/(新生测序链的长度+延伸片段的长度)结果取整数-1,如核酸模板长度300bp,新生测序链的长度为25bp,延伸片段的长度为15bp时,N的最大值取6。当N=1时,得到第一、第二测序数据。
在一个实施方案中,在上述第四测序方法中,第1~N延伸片段的长度分别为10-20bp。经多次实验测试结果可知,当延伸片段长度为10-20bp时可有效间隔两次新生测序链,降低新生测序链对再次测序时分子构象的影响,从而保证再次测序的测序长度及测序效率。当延伸片段的长度低于10bp,再次测序时因分子构象受之前测序链的影响,再次测序的测序长度变短、测序效率降低。相对于延伸片段长度为10-20bp,当延伸片段大于20bp时,增加测序成本。
上述第四测序方法及实施例中,核酸模板可以是通过共价键直接固定在固相载体的表面,也可以通过与第一测序引物杂交固定在固相载体的表面,其中第一测序引物通过共价键连接在固相载体表面。在一个实施方案中,在上述第四种测序方法及实施例中,核酸模板通过共价键直接固定在固相载体的表面,核酸模板是通过如下步骤获得:
(1-a)使测序文库中的文库分子与固相载体表面的接头进行杂交;
(1-b)利用文库分子作为初始模板,以固相载体表面的接头为引物合成初始模板的互补链以形成核酸模板;
(1-c)除去初始模板,并对芯片表面的核酸分子的3’末端进行第六封闭处理。
第六封闭用于封闭芯片表面的核酸链,芯片表面的核酸分子包括接头、核酸模板、残余初始模板等。通过第六封闭,可有效避免芯片表面的核酸分子在测序中产生干扰信号,能够进一步提高测序结果的准确性。
去除核酸模板可通过物理方法或化学方法(如采用变性试剂)进行,物理方法如高温变性(如80℃-98℃),变性试剂如NaOH、甲酰胺等,在一个实施方案中,上述去除核酸模板是通过变性试剂甲酰胺使核酸模板链与其互补链解离进行的。
在一个实施方案中,在上述第四测序方法中,在(1-c)之前,进一步包括:(1-b-1)对步骤(1-b)中互补链的3’末端进行第七封闭处理。
第七封闭用于封闭互补链的3’末端,避免在测序过程中互补链的继续延伸产生干扰信号,从而可以有效增加有效数据量,降低无效数据对信息分析的干扰。由此,通过第七封闭处理能够进一步提高测序结果的准确性。
在一个实施方案中,在上述第四测序方法中,第六封闭处理和第七封闭处理分别独立地通过使3’末端羟基与延伸反应阻断剂相连而进行的。
对核酸链末端的封闭,可采用不同的方法,如通过去除3’末端羟基和/或通过使3’末端羟基与延伸反应阻断剂相连。在一个实施方案中,上述第五封闭通过使3’末端羟基与延伸反应阻断剂相连而进行的。其中,延伸反应阻断剂用以阻断3’末端羟基与磷酸基团的反应,延伸反应阻断剂可为烷基、ddNTP或其衍生物,等。在一个实施方案中,上述延伸反应阻断剂为ddNTP或其衍生物。
在一个实施方案中,在上述第四测序方法中,第六封闭处理和第七封闭处理分别独立地采用DNA聚合酶和末端转移酶的至少之一进行。DNA聚合酶以DNA链为模板,在待封闭的核酸链的3’末端添加ddNTP,从而达到使3’末端封闭的 效果。末端转移酶可以直接在单链核酸的3’末端添加ddNTP达到3’末端封闭的效果。
在一个实施方案中,当第一测序引物通过共价键连接在固相载体表面,核酸模板通过第一测序引物连接于固相载体表面时,上述第四测序方法还包括:
((6)使用第五核苷酸,在适于进行聚合反应的条件下,以核酸模板为模板,以第N+1新生测序链为引物进行延伸,形成核酸模板的互补链,第五核苷酸为天然核苷酸和/或其衍生物;
(7)去除核酸模板;
(8)使用第三核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板的互补链为模板,以第三测序引物为引物进行延伸循环来进行第N+2测序,形成第N+2新生测序链,获得第N+2测序数据;
(9)使用第四核苷酸,在适于进行聚合反应的条件下,以核酸模板的互补链为模板,以第N+2新生测序链为引物进行延伸,形成第N+2延伸片段;
其中,第一测序引物通过共价键连接在固相载体表面,核酸模板通过第一测序引物连接于固相载体表面。
在一个实施方案中,在步骤(7)之后且在步骤(8)之前,上述第四测序方法还包含步骤(7-a):对芯片表面的核酸分子的3’末端进行第八封闭处理。
第八封闭用以封闭芯片表面的核酸分子。芯片表面的核酸分子包括核酸模板的互补链、第一测序引物、残留的模板等。通过封闭芯片表面的核酸分子,可避免在测序过程中互补链、第一测序引物的继续延伸产生干扰信号,从而可以有效增加有效数据量,降低无效数据对信息分析的干扰。由此,通过第八封闭处理能够进一步提高测序结果的准确性。
在一个实施方案中,上述第四测序方法还包含步骤(10):(10)使用第三核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板的互补链为模板,以第N+2延伸片段为引物进行延伸循环来进行第N+3测序,形成第N+3新生测序链,获得第N+3测序数据。
在一个实施方案中,上述第四测序方法还包含步骤(11):(11)重复(9)和(10)步骤N-1次,获得第(N+2)~(2N+2)新生测序链和第(N+2)~(2N+2)测序数据,及第(N+2)~2N+1的延伸片段;第2N+1延伸片段,是通过使用第四核苷酸,在适于进行聚合反应的条件下,以核酸模板的互补链为模板,以第2N+1新生测序链为引物进行延伸获得;第2N+2新生测序链和第2N+2测序数据,是通过使用第三核苷酸,在适于边合成边测序反应或者边连接边测序反应的条件下,以核酸模板的互补链为模板,以第2N+1延伸片段为引物进行延伸循环获得。
对核酸链末端的封闭,可采用不同的方法,如通过去除3’末端羟基和/或通过使3’末端羟基与延伸反应阻断剂相连。在一个实施方案中,上述第四测序方法中的第八封闭通过使3’末端羟基与延伸反应阻断剂相连而进行的。其中,延伸反应阻断剂用以阻断3’末端羟基与磷酸基团的反应,延伸反应阻断剂可为烷基、ddNTP或其衍生物,等。在一个实施方案中,上述延伸反应阻断剂为ddNTP或其衍生物。
在一个实施方案中,在上述第四测序方法中,第八封闭处理采用末端转移酶进行。末端转移酶可以直接在单链核酸的3’末端添加ddNTP达到3’末端封闭的效果。
上述一实施例提供的测序方法,通过两次或多次测序获得同一模板和/或其互补链不同位置的测序数据,利用此测序方法一方面可提高测序数据量,另一方面可利用相同模板/互补链的不同位置的测序数据,尤其是利用具有重叠数据的测序数据对模板序列进行组装或校对,可提高测序数据组装效率和准确率。在一实施例提供的测序方法中,通过对互补链末端的封闭,和/或芯片表面的引物的封闭,和/或残余新生测序链的封闭等,可避免在后续测序过程中互补链、芯片表面固定的测序引物和/或新生测序链的继续延伸产生干扰信号。通过降低干扰信号产生的无效数据对信息分析的干扰,可以有效增加有效数据量,继而可提高测序结果的准确性。在一实施例提供的测序方法中,通过使用未标记的终止子控制延伸片段的长度,一方面为了降低测序链对再次测序事的分子构象的影响,另一方面可用于控制测序成本。当延伸片段长度控制在10-20bp时可有效间隔两次新生测序链,降低新生测序链对再次测序时分子构象的影响,从而保证再次测序的测序长度及测序效率。当延伸片段的长度低于10bp,再次测序时因分子构象受之前测序链的影响,再次测序的测序长度变短、测序效率降低。相对于延伸片段长度为10-20bp,当延伸片段大于20bp时,增加测序成本。
本公开的完成是基于发明人的下列发现而完成的:
如前,单分子测序设备例如HeliScope的读长比较短,究其原因在于在延伸反应的循环过程中,荧光染料切除后碱基侧链会留下残余(Scar),这些Scar的累积会对后续的延伸反应中荧光信号的检测产生显著的影响,因此,目前的现状是通过采用单分子测序设备例如HeliScope很难实现长读长(long read)测序,通常平均读长为40bp左右。为了实现对较长插入片段的测序,发明人提出了对同一插入片段进行不同位置多轮测序的方案,必要时通过采用不带有可检测标记的可逆终止子进行延伸反应,不带有可检测标记的可逆终止子可以合成一段核酸序列作为间隔,能够弱化Scar的累积对后 续延伸反应中荧光信号的干扰。从而可以延长针对同一插入片段的实际测序效率,实现了延长读长的效果。显然,目前的读段分析策略并不完全满足这类新型的测序技术,为此,发明人在提出这类测序技术之后,又进一步研究和完善了相应的读段分析策略,由此,完成了本公开,提出了一种新型的测序数据分析手段。
根据本公开的另一些具体的实施方案,本公开提出了一种测序数据处理方法,该测序数据是通过对通过分别对同一插入片段进行多轮测序的测序策略而产生的,因此,这里所提到的测序数据包括了多个读段组,每个读段组对应一个插入片段,每个读段组中包括了多个读段,对于同一个读段组中的多个读段,其是由针对同一插入片段的多轮测序而获得的,因此每个读段实际上对应一轮测序,例如对于双端测序,每个读段组包括两个读段(read),即Read1和Read2分别对应从每个末端的测序结果。
根据本公开的实施方案,在获得测序数据后,本领域技术人员可以通过常规手段,例如每个读段所对应的位点等,对测序数据中的读段进行分组,从而得到多个读段组,每个读段组对应相同的插入片段。进一步,分别针对每个读段组内的读段进行分析和处理,从大量读段中选择可以用于后续拼接的读段。
首先,需要说明的是,本领域技术人员能够理解的是,每个读段组对应一个插入片段,应做广义理解,可以是基于同一条插入片段的核酸模板链不同位置的延伸反应获得的,也可以是基于与该插入片段存在关联关系的其他核酸链的测序反应获得的,这类其他核酸链的例子包括但不限于互补链或者多个相同拷贝(例如通过滚环复制得到的多拷贝)。
如前,按照测序平台的指导,按照预定的测序策略,本领域技术人员容易完成对测序数据中的大量读段(read)进行分组,通常而言每个插入片段对应测序反应芯片上的特定位置,通过区分各读段所对应的芯片位置即可以实现读段的分组。
继续下来,针对每个读段组中的读段进行分析,从而得到可以进行拼接的读段。下面参考图1~3,针对每个读段组中的多个读段处理进行详细描述。
S110:将多个读段与参考基因组进行全局比对,以便在参考基因组上确定与多个读段对应的多个匹配区域。
在该步骤中,通过采用全局比对,将各读段与参考基因组进行比对,可以确定各读段在参考基因组序列上的匹配位置。
在本文中所使用的术语“全局比对”是指将参与比对的两条序列里面的所有字符进行比对。当然,在本文中是指将读段与参考基因组或其一部分进行比对,全局比对在全局范围内对两条序列进行比对打分,找出最佳比对,通常主要被用来寻找关系密切的序列。全局比对的代表性算法是Needleman-Wunsch算法。当然,也可以使用测序平台所提供的算法进行全局比对,例如参看CN107403075A记载的内容可以实现上述全局比对操作。
S120:基于多个匹配区域之间的实际相对位置与预设位置要求的比较,对多个读段进行一次筛选,以便获得可拼接读段和过滤读段,
在完成全局比对后,可以确定读段在参考基因组序列上的匹配(mapping)区域。其中,如果读段只能与参考基因组序列的一个区域比对上,即只有一个匹配区域,则该读段被称为唯一比对序列(唯一比对read)。
根据本公开的实施方案,在实施多轮测序反应时,采用了不同的测序策略,如参见图8~图11所显示的多种测序策略。显然,这些测序策略对应了多个读段之间的相对位置关系。因此,可以通过将多个读段的多个匹配区域的实际相对位置与预先设定的位置要求进行比较,满足该要求的读段可以作为可拼接组合,后续进行拼接使用。由此,根据本公开的实施方案,预设位置要求是由多轮测序的规则确定的,实际相对位置满足预设位置要求是读段作为可拼接读段的指示;实际相对位置不满足预设位置要求是读段作为过滤读段的指示。
通过根据本公开实施例的该测序数据处理方法,能够有效地对来自同一插入片段多轮测序的读段进行筛选,得到可以进行拼接的读段,从而能够有有效提高测序数据的后续处理效率,避免了由于读段过短造成的不利影响。
另外,根据本公开的实施方案,在前面通过一次筛选得到可拼接的读段和不满足预设位置要求而被过滤的过滤读段后,可以进一步对过滤读段进行二次筛选。由此,根据本公开的实施方案,进一步包括:
S130:对于过滤读段进行二次筛选。
由于全局比对有其自身的局限性,因此,在一次筛选中被过滤掉的过滤读段有可能仍然包含有用的读段,因此,通过进行二次筛选,可以将这些读段找出来。
具体的,根据本公开的实施方案,二次筛选的过程包括:
S210:将读段组的至少一个作为初步读段,并基于初步读段对应的匹配区域和预设位置要求确定参考基因组上的二次比对区域。
在该步骤中,将一个读段作为初步读段,这个初步读段并不限定一定是过滤读段,也可以是已经在一次筛选中被选 定为可拼接读段的读段。
在确定初步读段后,在该初步读段的周围一定范围内,划定二次比对区域,例如在初步读段两个末端向外扩大一定长度,例如100bp、200bp、300bp、500bp、1000bp甚至2000bp。在该二次比对区域中,寻找是否有可以比配上的过滤读段。这样,可以进一步提高测序结果的准确性,另外,也可以避免样本核酸突变所产生的读段信息。通常,因为样本核酸存在突变,因此,与这些突变对应的读段,与参考基因组的比对结果通常不能满足前面的预设位置要求。
S220:将过滤读段的每一个读段分别与二次比对区域进行局部比对,并将满足预定阈值的读段和初步读段归类为可拼接读段。
与全局比对不同,局部比对不必对两个完整的序列进行比对,而是在每个序列中使用某些局部区域片段进行比对。其产生的需求在于、人们发现有的蛋白序列虽然在序列整体上表现出较大的差异性,但是在某些局部区域能独立的发挥相同的功能,序列相当保守。这时候依靠全局比对明显不能得到这些局部相似序列的。其次,在真核生物的基因中,内含子片段表现出了极大变异性,外显子区域却较为保守,这时候全局比对表现出了其局限性,无法找出这些局部相似性序列。其代表是Smith-Waterman局部比对算法。
通过局部比对,可以在二次比对区域中,完成对过滤读段的二次筛选。这里所提到的预定阈值以及在本文中其他位置所提到的阈值,均可以通过对已知属性的样本进行统计分析获得。
由此,可以通过结合全局比对和局部比对,在经过一次比对不满足条件需要被去除的读段中获取可以用于拼接的读段,从而节省了测序资源,同时也提高了测序的准确性。
根据本公开的实施方案,将读段组的每一个读段均作为初步读段,进行二次筛选。由此,可以尽可能完成对所有读段的筛选。
根据本公开的实施方案,进一步包括:
S140:对可拼接读段按照多轮测序的规则进行拼接。
这里的拼接,可以按照多轮测序的规则,将可以拼接的读段,通过在未知位置添加N或者将重叠区域合并,必要时候还需要进行正链和反链之间的转换后进行拼接,这里不再赘述。
根据本公开的实施方案,多轮测序的规则包括选自下列的至少之一:双端测序、Jumping测序、Overlap测序、双端Jumping测序以及这些测序规则的组合。
根据本公开的实施方案,参考图8,多轮测序的规则为双端测序,读段组包括两个读段,预设位置要求包括:两个读段的匹配区域分别位于参考基因组的正链和反链上;和两个读段的匹配区域在参考基因组上的距离不超过预定阈值,其中,预定阈值是基于插入片段的长度确定的。本领域技术人员可以通过各种已知的方案进行双端测序,这里不再进行赘述。
根据本公开的实施方案,对双端测序的测序数据进行分析的方法具体包括:
首先,通过比对算法可以分别得到双端测序的序列文件Fa1、Fa2,并且两个文件中的序列是位置上对应的。所谓位置上对应指文件中相同序号的读段,来自测序反应芯片上的物理位置一致。由此Fa1、Fa2中相同序号的读段分别对应读段1和读段2,且对应双端测序示意图中的两次测序的读段。
对Fa1和Fa2分别使用全局比对算法将其比对到对应基因组上,分别得到比对后的结果文件Sam1和Sam2。全局比对算法可以选用第三方mapping软件或者使用GenoCare配套的DirectAlignment算法软件。
对Sam1和Sam2中的序列,根据其每个位置上对应的双端序列的比对结果可以分为三类。分别为:1.双端序列均唯一比对到基因组上;2.双端序列有且仅有一端序列唯一比对到基因组上;3.双端序列均没有唯一比对到基因组上。
对于类1,若双端序列唯一比对结果分别在正反链上,且比对位置在一定距离范围内(如300bp内),则判断该位置为正确的双端测序位置,且两端序列可以拼接为一段较长且更置信的序列。若双端序列唯一比对结果不在正反链上,或唯一比对位置较远(如大于1000bp),则不认为该位置是准确的双端测序位置。这时,分别在双端序列唯一比对位置的前后300bp范围内局部比对(在本文中也将局部比对称为“细致比对”)另一端读段,若另一端读段可以找到相应位置,则认为该位置为准确的双端测序位置。若双端序列唯一比对位置上均找不到另一端读段可以匹配的位置,则舍弃该双端序列。
对于类2,在唯一比对的位置前后300bp位置范围内细致比对另一端读段,若另一端读段可以找到相应位置,则认为该唯一比对位置为正确的双端测序位置。反之舍弃该双端序列。
对于类3,若双端序列均能够比对上基因组但不唯一比对到基因组上,则按照类1处理;若双端序列有且仅有一端比对上基因组但不唯一比对到基因组上,则按照类2处理;若双端均不能比对到基因组上,则舍弃该双端序列。
在本文中采用的局部比对算法包括但不限于Smith-Waterman算法。另外,“另一条读段可以找到相应位置”指Smith-Waterman比对结果中局部最优序列长度大于预设阈值且错误率低于预设阈值则认为找到相应位置。
接下来,将Sam1和Sam2中确认是双端位置的序列合并,并输出到统一的Sam文件中。合并方式是:若读段1和读段2有重合区域,则合并重合区域,拼接为一段更长序列。拼接策可以采用一致性碱基判断策略。若读段1和读段2没有重合区域,则使用N标志中间缺失长度,N的长度为两端读段距离的Base数。若Sam1和Sam2中读段没有找到正确的双端测序位置,则输出Sam1或Sam2中可以比对到(包括唯一比对)到基因组的读段结果。
根据本公开的实施方案,多轮测序的规则为Jumping测序,预设位置要求包括:多个读段的匹配区域位于参考基因组的相同链上;和多个读段的匹配区域中相邻两个读段在参考基因组上的距离不超过预定距离阈值,其中,预定阈值是基于部分延伸步骤的长度确定的,例如,预定距离阈值不超过50bp,例如不超过20bp,例如在5~20bp之间。参考图9,根据本公开的实施方案,Jumping测序包括:提供核酸模板,核酸模板直接或者间接连接在固相载体的表面;采用第一核苷酸和第二核苷酸,与核酸模板发生多轮延伸反应,其中,第一核苷酸为带有可检测标记的可逆终止子,并且用于通过延伸反应获得多个读段;第二核苷酸为不带有可检测标记的可逆终止子,并且用于通过延伸反应获得至少一个预设长度的合成片段。
根据本公开的实施方案,多轮测序的规则为Overlap测序,预设位置要求包括:多个读段的匹配区域位于参考基因组的相同链上;和多个读段的匹配区域中相邻两个读段在参考基因组上的重叠区域长度在预定距离范围,其中,预定距离范围是基于测序过程中的重叠区域长度确定的,例如,预定距离范围为5~10bp之间。参考图10,根据本公开的实施方案,Overlap测序包括:核酸模板直接或者间接连接在固相载体的表面;采用第一测序接头和第二测序接头与核酸模板发生多轮延伸反应,以便获得多个读段,其中,第一测序接头产生的第一读段与第二测序接头产生的第二读段存在至少一个碱基的重叠区域,可选的,第一测序接头采用第一核苷酸进行延伸反应,以便获得第一读段;第二测序接头产生首先采用第二核苷酸进行延伸反应,之后采用第一核苷酸进行多个延伸反应,以便获得第二读段。
根据本公开的实施方案,对于Overlap测序,其读段的分析过程如下:
参考前面针对双端测序的实施例,如前通过GenoCare配套的BaseCalling算法可以得到相应的测序序列文件Fa。本实例中可以实现N个Overlap测序序列的拼接。但为了表述方便,本实例中按照2次测序的结果处理,因此可以得到两次测序的序列文件Fa1和Fa2。
尽管通过实验过程中的参数设置可以将重叠的平均长度控制在5-10bp,但有时也会发生不出现重叠的情况。在拼接过程中,使用局部比对算法(如Smith-Waterman)可以找到两段序列中局部最相似的区域。在比对的结果中若相似区域长度小于预设阈值(如5bp)或相似区域的错误率大于预设阈值,则认为该拼接结果不置信。排除上述两种情况,可以通过相似区域将两段序列进行拼接。
接下来,将拼接结果整合输出到同一个Fa文件中。对于判断为“不置信”的拼接,则输出读段1和读段2中长度较长的读段到最终Fa文件中。
如步骤一中提到,若有多次Overlap测序,则将两两拼接得到的读段设为读段1,再重复步前面的操作,通过迭代则可得到更长读长读段,输出到最终的Fa文件中。
根据本公开的实施方案,多轮测序的规则为双端Jumping测序,预设位置要求包括:多个读段的匹配区域的一部分位于参考基因组的正链,另一部分位于参考基因组的反链上;和多个读段的匹配区域中相邻两个读段在参考基因组上的重叠区域长度在预定距离范围,其中,预定距离范围是基于测序过程中部分延伸步骤的长度确定的,例如,预定距离阈值不超过50bp,例如不超过20bp,例如在5~20bp之间。参考图11,根据本公开的实施方案,双端Jumping测序包括:使核酸模板与第一引物杂交,第一引物的至少一部分与核酸模板的3'端互补,第一引物共价连接在固相载体的表面上;采用第一核苷酸和第二核苷酸,基于第一引物与核酸模板发生多轮延伸反应,并获得第一引物延伸链;去除核酸模板,并使第二引物与第一引物延伸链杂交;采用第一核苷酸和第二核苷酸,基于第二引物与第一引物延伸链发生多轮延伸反应;其中,第一核苷酸为带有可检测标记的可逆终止子,并且用于通过延伸反应获得多个读段;第二核苷酸为不带有可检测标记的可逆终止子,并且用于通过延伸反应获得至少一个预设长度的合成片段。
根据本公开的实施方案,可以通过结合双端测序和Jumping测序的规则进行双端Jumping测序,并参考前面所描述的分析过程完成对双端Jumping测序结果的分析。其中,具体的,通过双端Jumping测序得到N个测序片段。对于同一位置上双端测序的不同测序片段分别表示为Reads1,1、Reads1,2、…、Reads1,N,Reads2,1、Reads2,2、…、Reads2,N。
对于双端Jumping测序得到的Reads拼接可以按照需要在实验设计中保证双端交错的序列片段有重叠区域。在Reads拼接中使用双端交错的序列,如Reads1,N-1和Reads2,1、Reads2,2进行拼接。在拼接开始前需要将Reads2的序列换成反 向互补序列。其余步骤不再赘述。最终,将拼接完成的序列输出到最终的Fa文件中。
根据本公开的另一些具体的实施方案,本公开提出一种测序数据处理设备,测序数据包括多个读段组,读段组包括多个读段,多个读段是通过对同一插入片段进行多轮测序而获得的,设备包括针对每个读段组的多个读段进行下列处理的多个模块:
全局比对模块110,用于将多个读段与参考基因组进行全局比对,以便在参考基因组上确定与多个读段对应的多个匹配区域;和一次筛选模块120,用于基于多个匹配区域之间的实际相对位置与预设位置要求的比较,对多个读段进行一次筛选,以便获得可拼接读段和过滤读段,其中,预设位置要求是由多轮测序的规则确定的,实际相对位置满足预设位置要求是读段作为可拼接读段的指示;和实际相对位置不满足预设位置要求是读段作为过滤读段的指示。
通过采用该测序数据处理设备,能够有效地实施前述第一方面所描述的测序数据处理方法。通过根据本公开实施例的该测序数据处理方法,能够有效地对来自同一插入片段多轮测序的读段进行筛选,得到可以进行拼接的读段,从而能够有有效提高测序数据的后续处理效率,避免了由于读段过短造成的不利影响。
根据本公开的实施方案,进一步包括
二次筛选模块130,用于对于过滤读段进行二次筛选,二次筛选包括:将读段组的至少一个作为初步读段,并基于初步读段对应的匹配区域和预设位置要求确定参考基因组上的二次比对区域;和将过滤读段的每一个读段分别与二次比对区域进行局部比对,并将满足预定阈值的读段和初步读段归类为可拼接读段。
根据本公开的实施方案,进一步包括:
拼接模块140,用于对可拼接读段按照多轮测序的规则进行拼接。
根据本公开的实施方案,多轮测序的规则包括选自下列的至少之一:双端测序、Jumping测序、Overlap测序、双端Jumping测序以及这些测序规则的组合。
根据本公开的另一些具体的实施方案,本公开提出一种计算设备,根据本公开的实施方案,其包括:处理器和存储器;存储器,用于存储计算机程序;处理器,用于执行计算机程序以实现前面所述的测序数据处理方法。
根据本公开的另一些具体的实施方案,本公开提出一种计算机可读存储介质,根据本公开的实施方案,存储介质包括计算机指令,当指令被计算机执行时,使得计算机实现前面所述的测序数据处理方法。
需要说明的是,前面针对测序方法、测序数据处理方法所描述的特征和优点同样适用于其他方面,在此不在赘述。
另外,为了方便理解,下面对可以与本公开的测序方法以及分析方法匹配的测序策略进行详细描述。
下面将结合实施例对本公开的方案进行解释。本领域技术人员将会理解,下面的实施例仅用于说明本公开,而不应视为限定本公开的范围。实施例中未注明具体技术或条件的,按照本领域内的文献所描述的技术或条件或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者,均为可以通过市购获得的常规产品。
实施例
实施例中使用的Genocare单分子测序平台是使用TIRF成像系统检测掺入核苷酸种类的平台。Genocare测序过程有多种方式,第一种方式:四种核苷酸带有同种荧光信号,每轮反应加入一种核苷酸进行信号检测;第二种方式:四种核苷酸带有两种不同的荧光信号,每轮反应加入两种核苷酸进行信号检测;第三种方式:四种核苷酸带有四种不同的荧光信号,每轮反应加入四种核苷酸进行信号检测。具体测序过程可参看文章Single molecμLe targeted sequencing for cancer gene mutation detection,Scientific RepoRts|6:26110|DOI:10.1038/srep26110、专利申请CN201680047468.3、CN201910907555.7、CN201880077576.4和/或CN201911331502.1中测序过程的描述。
实施例中采用的试剂:
清洗液1组分包括:150mmol/L的氯化钠,15mmol/L的柠檬酸钠,150mmol/L的4-羟乙基哌嗪乙磺酸,0.1%的十二烷基硫酸钠。
清洗液2的组分包括:150mmol/L的氯化钠,150mmol/L的4-羟乙基哌嗪乙磺酸。
杂交液:3×SSC缓冲液,由20×SSC缓冲液(西格玛,#S6639-1L)用无核酸酶水(Rnase-free水)稀释而成。
Cold-dNTP:末端封闭的核苷酸,包含末端封闭的腺嘌呤核苷酸(Cold-dATP)、末端封闭的胸腺嘧啶核苷酸(Cold-dTTP)、末端封闭的胞嘧啶核苷酸(Cold-dCTP)、末端封闭的鸟嘌呤核苷酸Cold-dGTP。末端封闭的核苷酸购MyChem公司的核苷酸,其为3’OH被可逆封闭基团封闭的天然的dATP、dTTP、dCTP、dGTP。
表1:接头和测序引物序列
Figure PCTCN2022125967-appb-000004
实施例1
1.文库构建
使用诺唯赞公司货号为ND606-01的DNA文库制备试剂盒(
Figure PCTCN2022125967-appb-000005
Universal DNA Library Prep Kit for Illumina V2)将D7-S1-T/D9-S2接头与DNA片段(100~300bp)进行连接,连接后无需进行PCR扩增,直接使用诺唯赞公司型号为N411-01的DNA纯化磁珠(VAHTS DNA Clean Beads)进行纯化获得目的文库。
具体地,本实施例中文库构建的步骤包括:
1)DNA片段进行末端修复和加A尾,反应体系与条件如表2所示:
表2:反应体系
H 2O (16.2-X)μL
末端修复体系(EndPrepMix) 3.8μL
DNA片段(总量50ng) XμL
总体积 20μL
反应条件为:20℃反应15分钟,接着在65℃条件下反应10分钟。
2)末端修复加A产物与接头进行连接,反应体系与条件如表3所示:
表3:反应体系
末端修复加A产物 20μL
D7-S1-T/D9-S2接头(20μmol/L) 5μL
连接混合体系(LigationMix) 25μL
总体积 50μL
反应条件为,混匀后室温放置15min。
4)连接产物纯化
纯化使用VAHTS DNA Clean Beads(N411-01)试剂盒并按试剂盒说明书所示步骤进行纯化,回收产物10μL,完成测序文库的构建。具体步骤如下:
a)将连接后的PCR体系转移至1.5mLEP管中,加入0.8×(40μL)磁珠,吹打混匀10次,室温放置3分钟;
b)将1.5mL EP管放置在磁力架上,静置2-3分钟,移去上清;
c)用200μL体积80%乙醇洗涤,漂洗磁珠,室温孵育30sec,小心移除上清;
d)开盖干燥磁珠约5-10分钟至残余乙醇完全挥发;
e)加入22μL体积的去离子水从磁力架上去取进行洗脱,充分混匀后室温静置3分钟,置于磁力架上3分钟,待液体澄清后,回收产物20μL,再加入1.2x(24μL)磁珠,吹打混匀10次,室温放置3分钟;
f)将1.5mLEP管放置在磁力架上,静置2-3分钟,移去上清;
g)重复步骤c)~d)一次;
h)加入11μL体积的去离子水从磁力架上取下进行洗脱,充分混匀后室温静置3分钟,置于磁力架上3分钟,待液体澄清后,回收产物10μL,完成测序文库构建。
5)定量及检测
使用Qubit 3.0仪器和Qubit dsDNA HS检测试剂盒对构建的文库进行浓度检测。
使用Labchip DNA HS检测试剂盒和LabChip仪器对构建的文库进行片段分布检测。
2.文库与芯片表面探针进行杂交
芯片选择:
所用的芯片为环氧基修饰的芯片,通过探针上的氨基和芯片表面的环氧基团反应的方法,例如参看公开号CN109610006A公开的内容来固定探针(序列:TTTTTTTTTTTCCTTGATACCTGCGACCATCCAGTTCCACTCAGATGTGTATAAGAGACAG)(SEQ ID NO:4)。
文库与芯片上探针杂交过程如下:
1)取3μL体积20nmol/L浓度的步骤一构建的测序文库,加入3μL的去离子水,混合均匀,于95℃热变性5分钟;
2)将变性文库迅速置于冰水混合物冷却2分钟以上;
3)加入24μL的杂交液,将文库稀释至2nmol/L的工作浓度。
4)将从步骤3)获得的30μL体积稀释的杂交文库通入从芯片的一条通道中,于42℃杂交反应30分钟,然后冷却至室温;
5)向测序通道中通入200μL的清洗液1,去除未杂交至芯片表面的文库;
6)向芯片测序通道通入200μL的清洗液2,去除清洗液1,完成文库与测序芯片表面接头的杂交。
实施例2双端测序
将实施例1中杂交可文库的芯片置于Genocare单分子测序仪中进行测序。测序步骤如下,测序流程示意图如图8所示。
2.1测序方法
2.1.1 Read1测序
利用Genocare单分子测序平台进行80个循环的测序,测序过程中采用四种核苷酸带有两种不同的荧光信号,每轮反应加入两种标记不同荧光信号的核苷酸进行信号检测的方式进行测序。
2.1.2合成初始模板完整的互补链
Read1测序结束后新生的测序链继续延伸合成初始模板完整的互补链,具体过程如下:
1)向芯片测序通道泵入750μL的延伸试剂,其中,延伸试剂组分为:120U/ml Bst DNA聚合酶(NEB,#M0275M),0.2mmol/L dNTP(dATP、dTTP、dCTP、dGTP各0.2μmol/L的混合物),1M甜菜碱,20mmol/L的三羟甲基氨基甲烷,10mmol/L的氯化钠,10mmol/L的氯化钾,10mmol/L的硫酸铵,3mmol/L的氯化镁,0.1%的Triton X-100,pH值为8.3;
2)将芯片升温至60±0.5℃,反应10分钟;
3)向芯片测序通道泵入220μL的清洗液1,去除延伸试剂;
4)向芯片测序通道泵入440μL的清洗液2,去除清洗液1,完成初始模板互补链的合成。
2.1.3去除初始模板
通过加入变性试剂去除初始模板,具体步骤如下:
1)将芯片降温至55±0.5℃
2)向芯片测序通中通入800μL体积的甲酰胺,变性2分钟;
3)通入220μL体积的清洗液1,去除变性后的初始模板;
4)重复步骤2)和步骤3)一次,完成对初始模板的去除。
2.1.4 3’OH封闭
利用封闭试剂封闭芯片表面核酸链的3’OH,具体过程如下:
1)将芯片降温至37±0.5℃;
2)向芯片测序通道中通入440μL体积的清洗液2,去除清洗液1;
3)通入750μL体积的封闭试剂2,反应15分钟。其中,封闭试剂2的组分为:100U/ml Terminal Transferase(NEB,M0315L),1×Terminal Transferase Buffer,0.25mmol/L氯化钴,100μmol/L ddNTP mix(ddATP、ddTTP、ddCTP、ddGTP各100μmol/L的混合物);
4)通入220μL体积的清洗液1,完成对芯片表面核酸链3’OH的封闭。
2.1.5 Read 2测序
相测序通道中加入测序引物并进行Read2测序,具体过程如下:
1)向测序通道中通入800μL体积的稀释的测序引物杂交液,杂交反应30分钟。稀释的测序引物杂交液为含有0.1μmol/L引物D7S1T-R2P的清洗液3;
2)将芯片在37±0.5℃条件下,保持90秒;
3)向测序通道中通入220μL体积的清洗液1,去除通道中未被杂交的测序引物;
4)向测序通道中通入440μL体积的清洗液2,去除清洗液1,完成测序引物的杂交。
采用本实施例步骤1相同的测序方式进行测序,获得Read2测序结果。
测序结果:利用该测序方法,获得有效的测序数据Read 1、Read 2用于测序分析。
2.2测序结果分析
2.2.1:获取双端测序序列
通过比对算法可以分别得到双端测序的序列文件Fa1、Fa2,并且两个文件中的序列是位置上对应的。所谓位置上对应指文件中相同序号的Reads,来自测序中的物理位置一致。
2.2.2:序列mapping
对Fa1和Fa2分别使用mapping算法将其比对到对应基因组上,分别得到比对后的结果文件Sam1和Sam2。Mapping算法可以选用已公开的方法。
2.2.3:分类处理双端序列
对Sam1和Sam2中的序列,根据其每个位置上对应的双端序列的比对结果可以分为三类。分别为:1.双端序列均Unique Mapping到基因组上;2.双端序列有且仅有一端序列Unique Mapping到基因组上;3.双端序列均没有Unique Mapping到基因组上。
对于类1,若双端序列Unique Mapping结果分别在正反链上,且mapping位置在一定距离范围内(如300bp内),则判断该位置为正确的双端测序位置,且两端序列可以拼接为一段较长且更置信的序列。若双端序列Unique Mapping结果不在正反链上,或Unique Mapping位置较远(如大于1000bp),则不认为该位置是准确的双端测序位置。这时,分别在双端序列Unique Mapping位置的前后300bp范围内细致比对另一端Reads,若另一端Reads可以找到相应位置,则认为该位置为准确的双端测序位置。若双端序列Unique位置上均找不到另一端Reads可以匹配的位置,则舍弃该双端序列。
对于类2,在Unique Mapping的位置前后300bp位置范围内细致比对另一端Reads,若另一端Reads可以找到相应位置,则认为该Unique Mapping位置为正确的双端测序位置。反之舍弃该双端序列。
对于类3,若双端序列均mapping但不Unique到基因组上,则按照类1处理;若双端序列有且仅有一端mapping但不Unique到基因组上,则按照类2处理;若双端均不mapping到基因组上,则舍弃该双端序列。
以上的“细致比对”指使用更加精细的局部比对算法,如Smith-Waterman算法。“另一条Reads可以找到相应位置”指Smith-Waterman比对结果中局部最优序列长度大于预设阈值且错误率低于预设阈值则认为找到相应位置。
2.2.4:输出最终Mapping结果
对于1.2.3中得到的结果,将Sam1和Sam2中确认是双端位置的序列合并,并输出到统一的Sam文件中。合并方式是:若Reads1和Reads2有重合区域,则merge重合区域,拼接为一段更长序列,拼接策略具体如下。若Reads1和Reads2没有重合区域,则使用NS标志中间缺失长度,N为两端Reads距离的Base数。若Sam1和Sam2中Reads没有找到正确的双端测序位置,则输出Sam1或Sam2中可以Mapping(包括Unique Mapping)到基因组的Reads结果。
拼接策略:将两条对应Reads相互配准,得到共同的一致性序列部分。其中两条序列配准使用Smith-Waterman算法,一致性序列指配准后通过在序列中增加、删除或修改部分Base,得到的局部最佳匹配序列。得到一致性序列后,根据构建的矫正模型逐个判断一致性序列中不一致的Base位置。根据该Base位置前后的碱基类型计算该位置出现Deletion或Insertion的概率。若Deletion的概率大于50%,则认为该位置所测Base不应该出现,从而删除该位置Base。反之,保留该位置上的Base。
本实施例中校正模型的过程包括:
1)使用python语言,提取获得的Reads1和Reads2序列中同一坐标两次测序读长均≥列中同一坐标的Reads,分别输出为T1(Read1)和T2(Read2)两个文件。其中同一坐标的对应方法是在生成Reads文件时将同一坐标Reads在不同文件中的Reads ID设置为一致;
2)将T1和T2中位置对应的Reads相互间做Align,在Align结果中标记两条Reads一致和不一致的Base,得到Common Reads。其中位置对应是通过比较两条Reads将的Reads ID是否一致实现;
3)分别将文件T1和T2和Reference做Mapping,得到Sam1和Sam2文件。将Sam1和Sam2中位置对应且mapping到同一位置的Reads,找到Reference中最长公共子串RefReads。公共子串指两条对应的Reads mapping后均覆盖的区域;
4)比较步骤2)中的Common Reads和步骤3)中的RefReads。对于Common Reads中不一致的Base,标记其是否真实存在于Reference中。若存在,对于没有测到的Reads则为Deletion。若不存在,对于测到的Reads则为Insertion;
5)统计步骤4)中的Deletion和Insertion情况,同时统计该不一致位置上前后Base的种类。因此得到在不同Base类型前或后引起Insertion或Deletion的概率。
具体地,本实例中运用的朴素贝叶斯模型如下:
Figure PCTCN2022125967-appb-000006
Figure PCTCN2022125967-appb-000007
其中:P(D|XY)表示对于某碱基在前后分别为X和Y碱基时发生Deletion的概率,X,Y∈[A,C,G,T]。P(D)表示对于某碱基发生Deletion的概率;P(I)表示对于某碱基发生Insertion的概率。
通过统计不同碱基下发生Deletion或Insertion时,前后碱基出现频率即可得到P(XY|D)和P(XY|I),从而可以计算得到P(D|XY)和P(I|XY)。
实施例3 Jumping测序
将实施例1获得的带有杂交文库的芯片置于测序仪中进行测序。测序步骤如下,测序流程示意图如图9所示:
3.1测序方法
3.1.1 Read1测序
利用测序平台进行80个循环的测序,测序过程中采用四种核苷酸带有两种不同的荧光信号,每轮反应加入两种标记不同荧光信号的核苷酸进行信号检测的方式进行测序。
3.1.2部分延伸
对初始模板互补链进行部分延伸的步骤包括:
1)将芯片升温至55℃±0.5℃
2)以1250μL/min的速度向Read1测序后的通道中通入440μL体积的延伸试剂2,反应2分钟。延伸试剂2的组分为:50mmol/L的三羟甲基氨基甲烷,50mmol/L的氯化钠,1mmol/L的乙二胺四乙酸,3mmol/L的硫酸镁,60mmol/L的硫酸铵,0.05%的吐温20,5%的二甲基亚砜,0.02mg/ml 9°N DNA聚合酶(NEB公司,货号M0260),5μmol/L的Cold-dNTPs(末端封闭核苷酸)(Cold-dATP、Cold-dTTP、Cold-dCTP、Cold-dGTP各5μmol/L的混合物),pH值9.0。
3)向测序通道泵入220μL体积的清洗液1,去除延伸试剂2。
4)向测序通道泵入400μL体积的切除试剂1,切除试剂1的组分为:75mmol/L的三羟甲基氨基甲烷,1M的氯化钠,0.05%的吐温20,10mmol/L的三(3-羟基丙基),pH=9.0。
5)将芯片升温至60℃±0.5℃,反应2分钟。
6)向测序通道泵入220μL体积的清洗液1,去除切除试剂1。
7)向测序通道泵入440μL体积的清洗液2,去除清洗液1。
8)重复步骤1)至步骤7)10至20个循环,完成对初始模板互补链的部分延伸。
3.1.3 Read2测序
采用与本实施例步骤1中Read1测序相同的方式进行测序,获得Read2测序结果。
测序结果:利用该测序方法,获得有效的测序数据Read1、Read2用于测序分析。
3.2测序结果分析
3.2.1:获取两段序列
同实施例2中2.2.1的步骤。
3.2.2:序列mapping
同实施例2中2.2.2的步骤。
3.2.3:分类处理两端序列
同实施例2中2.2.3的步骤。
判断是否是双端位置的标准由“双端序列mapping结果分别在正反链上”改为“两段序列均在同一方向链上”。
3.2.4:输出最终Mapping结果
同实施例2中2.2.4的步骤。
实施例4overlap测序
将实施例1获得的带有杂交文库的芯片置于Genocare单分子测序仪中进行测序。测序步骤如下,测序流程示意图如图10所示。
4.1测序方法
4.1.1初始模板的互补链合成
初始模板互补链合成的具体步骤如下:
1)向芯片测序通道泵入750μL体积的延伸试剂,其中,延伸试剂组分为:120U/ml Bst DNA聚合酶(NEB,#M0275M),0.2mmol/L dNTP(dATP、dTTP、dCTP、dGTP各0.2μmol/L的混合物),1M甜菜碱,20mmol/L的三羟甲基氨基甲烷,10mmol/L的氯化钠,10mmol/L的氯化钾,10mmol/L的硫酸铵,3mmol/L的氯化镁,0.1%的Triton X-100,pH值为8.3;
2)将芯片升温至60±0.5℃,反应10分钟;
3)向芯片测序通道泵入220μL体积的清洗液1,去除延伸试剂;
4)向芯片测序通道泵入440μL体积的清洗液2,去除清洗液1,完成初始模板互补链的合成。
4.1.2去除初始模板
通过加入变性试剂去除初始模板,具体步骤如下:
1)将芯片降温至55±0.5℃
2)向芯片测序通中通入800μL体积的甲酰胺,变性2分钟;
3)通入220μL体积的清洗液1,去除变性后的初始模板;
4)重复步骤2)和步骤3)一次,完成对初始模板的去除。
4.1.3 3’OH封闭
利用封闭试剂封闭芯片表面核酸链的3’OH,具体过程如下:
1)将芯片降温至37±0.5℃;
2)向芯片测序通道中通入440μL体积的清洗液2,去除清洗液1;
3)通入750μL体积的封闭试剂2,反应15分钟。其中,封闭试剂2的组分为:100U/ml Terminal Transferase(NEB,M0315L),1×Terminal Transferase Buffer,0.25mmol/L氯化钴,100μmol/L ddNTP mix(ddATP、ddTTP、ddCTP、ddGTP各100μmol/L的混合物);
4)通入220μL体积的清洗液1,完成对芯片表面核酸链3’OH的封闭。
4.1.4杂交测序引物D7S1T-R2P
1)将芯片升温至55±0.5℃,保持1分钟;
2)向测序通道中通入800μL体积的稀释的测序引物杂交液,杂交反应30分钟。稀释的测序引物杂交液为含有0.1μmol/L引物D7S1T-R2P的清洗液3,清洗液3组分包括:450mmol/L的氯化钠,45mmol/L的柠檬酸钠;
3)将芯片降温至37±0.5℃,保持90秒;
4)向测序通道中通入220μL体积的清洗液1,去除通道中未被杂交的测序引物;
5)向测序通道中通入440μL体积的清洗液2,去除清洗液1,完成测序引物的杂交。
4.1.5 Read1测序
利用Genocare单分子测序平台进行80个循环的测序,测序过程中采用四种核苷酸带有两种不同的荧光信号,每轮反应加入两种标记不同荧光信号的核苷酸进行信号检测的方式进行测序。
4.1.6变性去除新生测序链
通过加入变性试剂去除初始模板,具体步骤如下:
1)将芯片降温至55±0.5℃
2)向芯片测序通中通入800μL体积的甲酰胺,变性2分钟;
3)通入220μL体积的清洗液1,去除变性后的初始模板;
4)重复步骤2)和步骤3)一次,完成对初始模板的去除。
4.1.7封闭残余新生链的3’OH
残余新生链的3’OH封闭过程如下:
1)将芯片降温至37±0.5℃,维持90秒;
2)向测序通道中泵入750μL体积的封闭试剂1,反应10分钟。封闭试剂1的组分为:100U/ml Klenow DNA聚合酶大片段(3′→5′exo-,NEB,#M0212M)12.5μmol/L ddNTP mix(ddATP、ddTTP、ddCTP、ddGTP各12.5μmol/L的混合物),5mmol/L的氯化锰,20mmol/L的三羟甲基氨基甲烷,10mmol/L的氯化钠,10mmol/L的氯化钾,10mmol/L的硫酸铵,3mmol/L的氯化镁,0.1%的Triton X-100,pH值为8.3;
3)向测序通道中通入220μL体积的清洗液1,去除封闭反应后剩余的封闭液,完成对延伸不完全的新生链的3’OH的封闭。
4.1.8杂交测序引物D7S1T-R2P
测序引物的杂交过程同本实施例步骤4的相同。
4.1.9部分延伸
部分延伸的步骤包括:
1)将芯片升温至55℃±0.5℃
2)以1250μL/min的速度向Read1测序后的通道中通入440μL体积的延伸试剂2,反应2分钟。延伸试剂2的组分为:50mmol/L的三羟甲基氨基甲烷,50mmol/L的氯化钠,1mmol/L的乙二胺四乙酸,3mmol/L的硫酸镁,60mmol/L的硫酸铵,0.05%的吐温20,5%的二甲基亚砜,0.02mg/ml 9°N DNA聚合酶(NEB公司,货号M0260),5μmol/L的Cold-dNTPs(Cold-dATP、Cold-dTTP、Cold-dCTP、Cold-dGTP各5μmol/L的混合物),pH值9.0。
3)向测序通道泵入220μL体积的清洗液1,去除延伸试剂2。
4)向测序通道泵入400μL体积的切除试剂1,切除试剂1的组分为:75mmol/L的三羟甲基氨基甲烷,1M的氯化钠,0.05%的吐温20,10mmol/L的三(3-羟基丙基),pH=9.0。
5)将芯片升温至60℃±0.5℃,反应2分钟。
6)向测序通道泵入220μL体积的清洗液1,去除切除试剂1。
7)向测序通道泵入440μL体积的清洗液2,去除清洗液1。
8)重复步骤1)至步骤7)10至20个循环,完成对初始模板互补链的部分延伸。
4.1.10 Read2测序
采用与本实施例4.1.5中Read 1测序相同的方式进行测序,获得Read 2测序结果。
测序结果:利用该测序方法,获得有效的测序数据Read 1、Read 2用于测序分析。
4.2测序结果分析
4.2.1:获取测序序列
同实例1算法步骤一,通过GenoCare配套的BaseCalling算法可以得到相应的测序序列文件Fa。本实例中可以实现N测overlap测序序列的拼接。但为了表述方便,本实例中按照2次测序的结果处理,因此可以得到两次测序的序列文件Fa1和Fa2。
4.2.2:两段序列拼接
通过实验过程中的参数设置可以将overlap的平均长度控制在5-10bp,但是也不保证肯定有overlap的情况。在拼接过程中,使用局部比对算法(如Smith-Waterman)可以找到两段序列中局部最相似的区域。在比对的结果中若相似区域长度小于预设阈值(如5bp)或相似区域的错误率大于预设阈值,则认为该拼接结果不置信。
排除上述两种情况,可以通过相似区域将两段序列进行拼接。拼接过程中对于相似区域中不一致Base的取舍,具体操作如下:将两条对应Reads相互配准,得到共同的一致性序列部分。其中两条序列配准使用Smith-Waterman算法,一致性序列指配准后通过在序列中增加、删除或修改部分Base,得到的局部最佳匹配序列。得到一致性序列后,根据构建的矫正模型(详见2.2.4中的校正模型),逐个判断一致性序列中不一致的Base位置。根据该Base位置前后的碱基类型计算该位置出现Deletion或Insertion的概率。若Deletion的概率大于50%,则认为该位置所测Base不应该出现,从而删除该位置Base。反之,保留该位置上的Base。
4.2.3:输出拼接后序列
通过步骤二得到的拼接结果,将其整合输出到同一个Fa文件中。对于步骤二中判断“不置信”的拼接,则输出Reads1和Reads2中长度较长的Reads到最终Fa文件中。
如步骤一中提到,若有多次overlap测序,则将两两拼接得到的Reads设为Reads1,再重复步骤二操作和下一段序列拼接,通过迭代则可得到更长读长Reads,输出到最终的Fa文件中。
实施例5双端Jumping测序
将实施例1中带有杂交文库的芯片置于Genocare单分子测序仪中进行测序。测序步骤如下,测序流程示意图如图11所示。
5.1测序方法
5.1.1 Read1.1测序
利用双色单分子测序平台进行80个循环的测序,测序过程中采用四种核苷酸带有两种不同的荧光信号,每轮反应加入两种标记不同荧光信号的核苷酸进行信号检测的方式进行测序。
5.1.2部分延伸1.1
部分延伸的步骤包括:
1)将芯片升温至55℃±0.5℃
2)以1250μL/min的速度向Read1测序后的通道中通入440μL体积的延伸试剂2,反应2分钟。延伸试剂2的组分为:50mmol/Lmmol/Lol/L的三羟甲基氨基甲烷,50mmol/Lmmol/Lol/L的氯化钠,1mmol/Lmmol/Lol/L的乙二胺四乙酸,3mmol/Lmmol/Lol/L的硫酸镁,60mmol/Lmmol/Lol/L的硫酸铵,0.05%的吐温20,5%的二甲基亚砜,0.02mg/ml 9°N DNA聚合酶(NEB公司,货号M0260),5μmol/L的Cold-dNTPs(Cold-dATP、Cold-dTTP、Cold-dCTP、Cold-dGTP各5μmol/L的混合物),pH值9.0。
3)向测序通道泵入220μL体积的清洗液1,去除延伸试剂2。
4)向测序通道泵入400μL体积的切除试剂1,切除试剂1的组分为:75mmol/Lmmol/Lol/L的三羟甲基氨基甲烷,1M的氯化钠,0.05%的吐温20,10mmol/Lmmol/Lol/L的三(3-羟基丙基),pH=9.0。
5)将芯片升温至60℃±0.5℃,反应2分钟。
6)向测序通道泵入220μL体积的清洗液1,去除切除试剂1。
7)向测序通道泵入440μL体积的清洗液2,去除清洗液1。
8)重复步骤1)至步骤7)10至20个循环,完成对初始模板互补链的部分延伸。
5.1.3重复步骤5.1.1和步骤5.1.2若干次
根据初始模板长度设定重复次数。
5.1.4去除初始模板
通过加入变性试剂去除初始模板,具体步骤如下:
1)将芯片降温至55±0.5℃
2)向芯片测序通中通入800μL体积的甲酰胺,变性2分钟;
3)通入220μL体积的清洗液1,去除变性后的初始模板;
4)重复步骤2)和步骤3)一次,完成对初始模板的去除。
5.1.5 3’OH封闭
利用封闭试剂封闭芯片表面核酸链的3’OH,具体过程如下:
1)将芯片降温至37±0.5℃;
2)向芯片测序通道中通入440μL体积的清洗液2,去除清洗液1;
3)通入750μL体积的封闭试剂2,反应15分钟。其中,封闭试剂2的组分为:100U/ml Terminal Transferase(NEB,M0315L),1×Terminal Transferase Buffer,0.25mmol/Lmmol/Lol/L氯化钴,100μmol/L ddNTP mix(ddATP、ddTTP、ddCTP、ddGTP各100μmol/L的混合物);
4)通入220μL体积的清洗液1,完成对芯片表面核酸链3’OH的封闭。
5.1.6杂交测序引物D7S1T-R2P
1)将芯片升温至55±0.5℃,保持1分钟;
2)向测序通道中通入800μL体积的稀释的测序引物杂交液,杂交反应30分钟。稀释的测序引物杂交液为含有0.1μmol/L引物D7S1T-R2P的清洗液3,清洗液3组分包括:450mmol/Lmmol/Lol/L的氯化钠,45mmol/Lmmol/Lol/L的柠 檬酸钠;
3)将芯片降温至37±0.5℃,保持90秒;
4)向测序通道中通入220μL体积的清洗液1,去除通道中未被杂交的测序引物;
5)向测序通道中通入440μL体积的清洗液2,去除清洗液1,完成测序引物的杂交。
5.1.7 Read2的若干读段的测序
测序步骤与本实施例步骤1~3相同。
测序结果:利用该测序方法,获得有效的测序数据Read1、Read2用于测序分析。
5.2测序结果分析
5.2.1获取测序序列
同4.2.1步骤,得到双端测序N个测序片段。对于同一位置上双端测序的不同测序片段分别表示为Reads1,1、Reads1,2、…、Reads1,N,Reads2,1、Reads2,2、…、Reads2,N。
5.2.2:序列拼接
对于双端Jumping测序得到的Reads拼接需要在实验设计中保证双端交错的序列片段有overlap区域。在Reads拼接中使用双端交错的序列,如Reads1,N-1和Reads2,1、Reads2,2进行拼接。在拼接开始前需要将Reads2的序列换成反向互补序列。具体的拼接方法同4.2.2步骤。
5.2.3:输出拼接后序列
同4.2.3步骤,将5.2.2步骤中拼接完成的序列输出到最终的Fa文件中。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本公开的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
尽管上面已经示出和描述了本公开的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本公开的限制,本领域的普通技术人员在本公开的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (45)

  1. 一种测序方法,其中,其中包括:
    提供核酸模板,所述核酸模板直接或者间接连接在固相载体的表面;
    利用第一核苷酸进行合成测序反应,以测定所述核酸模板的一部分,获得读段,所述第一核苷酸为带有可检测标记的可逆终止子;
    利用第二核苷酸进行聚合反应,以合成所述核酸模板的一部分,获得预设长度的合成片段,所述第二核苷酸为不带有可检测标记的可逆终止子,所述读段和所述合成片段对应所述核酸模板上有重叠或者没有重叠的连续的部分。
  2. 根据权利要求1所述的测序方法,其中,所述读段的长度不短于所述合成片段的长度;
    任选地,所述合成片段的长度大于或等于1bp;
    任选地,所述合成片段的长度大于或等于10bp;
    任选地,所述合成片段的长度大于或等于10bp并且小于或等于20bp。
  3. 根据权利要求1或2所述的测序方法,其中,所述核酸模板的长度小于或等于600bp;
    任选地,所述核酸模板大于或等于75bp且小于或等于400bp。
  4. 根据权利要求1-3中任一项所述的测序方法,其中,所述第一核苷酸和/或所述第二核苷酸的糖的3'-OH被可逆阻断;
    任选地,所述第一核苷酸和/或所述第二核苷酸的糖的3'-OH为天然状态,并且所述第一核苷酸和/或所述第二核苷酸的碱基连接有可切割的阻断基团;
    任选地,所述可检测标记为荧光分子。
  5. 根据权利要求1-4中任一项所述的测序方法,在DNA聚合酶的作用下进行所述合成测序反应和/或所述聚合反应,所述DNA聚合酶选自Klenow片段、Bst、9°N、Pfu、KOD和Vent中的至少一种;
    任选地,在相同DNA聚合酶的作用下进行所述合成测序反应和所述聚合反应,所述DNA聚合酶为Klenow片段突变体;
    任选地,在相同DNA聚合酶的作用下进行所述合成测序反应和所述聚合反应,所述DNA聚合酶为9°N突变体。
  6. 根据权利要求1-5中任一项所述的测序方法,其中,所述读段为第一读段,所述方法包括:
    i)使所述核酸模板与第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述第一引物共价连接在所述固相载体的表面上;
    ii)利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第一引物合成所述核酸模板的互补链以测定所述核酸模板的第一部分,获得所述第一读段,定义所述核酸模板的互补链为第一模板;
    iii)利用所述第二核苷酸进行所述聚合反应,包括继续延伸所述第一模板,获得所述合成片段;以及
    iv)利用所述第一核苷酸进行所述合成测序反应,包括继续延伸所述第一模板以测定所述核酸模板的第二部分,获得第二读段,
    所述第一读段、所述合成片段和所述第二读段对应所述核酸模板上三个没有重叠的连续的部分。
  7. 根据权利要求1-5中任一项所述的测序方法,其中,所述读段为第一读段,所述方法包括:
    i)加入第一引物并使所述核酸模板与所述第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述核酸模板共价连接在所述固相载体的表面上;
    ii)利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第一引物合成所述核酸模板的互补链以测定所述核酸模板的第一部分,获得所述第一读段,定义所述核酸模板的互补链为第一模板;
    iii)利用所述第二核苷酸进行所述聚合反应,包括继续延伸所述第一模板,获得所述合成片段;以及
    iv)利用所述第一核苷酸进行所述合成测序反应,包括继续延伸所述第一模板以测定所述核酸模板的第二部分,获得第二读段,
    所述第一读段、所述合成片段和所述第二读段对应所述核酸模板上三个没有重叠的连续的部分。
  8. 根据权利要求6所述的测序方法,其中,所述合成片段为第一合成片段,所述方法还包括:
    v)去除所述核酸模板;
    vi)加入第二引物并使该第二引物结合到所述第一模板,利用所述第二核苷酸进行所述聚合反应,包括延伸所述第二引物合成所述第一模板的互补链,获得预设长度的第二合成片段,所述第二引物的至少一部分与所述第一模板的3'端互补,定义所述第一模板的互补链为第二模板;以及
    vii)利用所述第一核苷酸进行所述合成测序反应,包括继续延伸所述第二模板以测定所述核酸模板的第三部分,获得第三读段,
    所述第二合成片段和所述第三读段对应所述核酸模板上两个连续的部分。
  9. 根据权利要求6-8中任一项所述的测序方法,其中,还包括:重复iii)和iv)至少一次。
  10. 根据权利要求9所述的测序方法,其中,还包括:重复vi)和vii)至少一次。
  11. 根据权利要求10所述的测序方法,其中,所述第一读段、第一合成片段、第二读段、第二合成片段和第三读段之间的长度关系能使所述核酸模板的非末端部分的任一个位置的核苷酸被至少测定一次。
  12. 根据权利要求6、8-11中任一项所述的测序方法,其中,还包括在iv)之后且v)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
  13. 根据权利要求6、8-12中任一项所述的测序方法,其中,还包括在v)之后且vi)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
  14. 根据权利要求12所述的测序方法,其中,在DNA聚合酶的作用下使延伸反应阻断剂结合到所述第一模板实现所述封闭,所述延伸反应阻断剂选择ddNTP及其衍生物中的至少一种。
  15. 根据权利要求13所述的测序方法,其中,在末端转移酶的作用下使延伸反应阻断剂结合到所述第一模板实现所述封闭,所述延伸反应阻断剂选择ddNTP及其衍生物中的至少一种。
  16. 根据权利要求1-5中任一项所述的测序方法,其中,所述读段为第一读段,所述合成片段为第一合成片段,所述方法包括:
    i)加入第一引物并使所述核酸模板与所述第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述核酸模板共价连接在所述固相载体的表面上;
    ii)利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第一引物合成所述核酸模板的互补链以测定所述核酸模板的第一部分,获得所述第一读段,定义所述核酸模板的互补链为第一模板;
    iii)去除所述第一模板;
    iv)加入所述第一引物并使该第一引物结合到所述核酸模板,利用所述第二核苷酸进行所述聚合反应,包括延伸所述第一引物合成所述核酸模板的互补链,获得所述第一合成片段,所述第一合成片段的长度不长于所述第一读段的长度,定义所述核酸模板的互补链为第一模板;以及
    v)利用所述第一核苷酸进行所述合成测序反应,包括继续延伸所述第一模板以测定所述核酸模板的第二部分,获得第二读段。
  17. 根据权利要求16所述的测序方法,其中,还包括:重复iii)-v)至少一次,并且使每个重复中的第一合成片段的长度不短于上一个重复中的第一合成片段的长度且不长于上一个重复中的第一合成片段和第二读段的长度之和。
  18. 根据权利要求1-5中任一项所述的测序方法,其中,所述读段为第一读段,所述合成片段为第一合成片段,所述方法包括:
    i)加入第一引物并使所述核酸模板与所述第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述核酸模板共价连接在所述固相载体的表面上;
    ii)利用所述第二核苷酸进行所述聚合反应,包括延伸所述第一引物合成所述核酸模板的互补链,获得所述第一合成片段,定义所述核酸模板的互补链为第一模板;
    iii)利用所述第一核苷酸进行所述合成测序反应,包括继续延伸所述第一模板以测定所述核酸模板的第一部分,获得所述第一读段;
    iv)去除所述第一模板;以及
    v)加入所述第一引物并使该第一引物结合到所述核酸模板,利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第一引物合成所述核酸模板的互补链以测定所述核酸模板的第二部分,获得第二读段,所述第二读段的长度不短于所述第一合成片段的长度。
  19. 根据权利要求1-5中任一项所述的测序方法,其中,所述读段为第一读段,所述方法包括:
    i)使所述核酸模板与第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述第一引物共价连接在所述固相载体的表面上;
    ii)利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第一引物合成所述核酸模板的互补链以测定所述核酸模板的第一部分,获得所述第一读段,定义所述核酸模板的互补链为第一模板;
    iii)利用所述第二核苷酸进行所述聚合反应,包括继续延伸所述第一模板,获得所述合成片段;
    iv)去除所述核酸模板;
    v)加入第二引物并使该第二引物结合到所述第一模板,利用所述第一核苷酸进行所述合成测序反应,包括延伸所述第二引物合成所述第一模板的互补链以测定所述核酸模板的第二部分,获得第二读段,所述第二引物的至少一部分与所述第一模板的3'端互补。
  20. 根据权利要求7、16-18中任一项所述的测序方法,其中,通过使单链核酸分子与探针杂交,并基于聚合反应延伸所述探针获得所述核酸模板,所述探针共价连接在所述固相载体的表面上,所述单链核酸分子的3'端与所述探针互补。
  21. 根据权利要求16或17所述的测序方法,其中,还包括在ii)之后且iii)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
  22. 根据权利要求18或19所述的测序方法,其中,还包括在iii)之后且iv)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
  23. 根据权利要求12所述的测序方法,其中,在DNA聚合酶的作用下使延伸反应阻断剂结合到所述第一模板实现所述封闭,所述延伸反应阻断剂选择ddNTP及其衍生物中的至少一种。
  24. 根据权利要求16、17、21或23所述的测序方法,其中,还包括在iii)之后且iv)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭;
    任选地,还包括在iv)之后且v)之前,对所述固相载体表面上的至少一部分核酸分子进行封闭。
  25. 根据权利要求24所述的测序方法,其中,在末端转移酶的作用下使延伸反应阻断剂结合到所述第一模板实现所述封闭,所述延伸反应阻断剂选择ddNTP及其衍生物中的至少一种。
  26. 根据权利要求8-15、16、22-25任一所述的测序方法,其中,通过加入变性试剂解离所述核酸模板与所述第一模板,以去除所述核酸模板;
    任选地,通过加入变性试剂解离所述第一模板与所述核酸模板,以去除所述第一模板;
    任选地,所述变性试剂包含甲酰胺。
  27. 一种测序数据处理方法,其中,所述测序数据包括多个读段组,所述读段组包括多个读段,所述多个读段是通过对同一插入片段进行多轮测序而获得的,所述方法包括针对每个所述读段组的所述多个读段进行下列处理:
    将所述多个读段与参考基因组进行全局比对,以便在所述参考基因组上确定与所述多个读段对应的多个匹配区域;和
    基于所述多个匹配区域之间的实际相对位置与预设位置要求的比较,对所述多个读段进行一次筛选,以便获得可拼接读段和过滤读段,
    其中,
    所述预设位置要求是由所述多轮测序的规则确定的,
    所述实际相对位置满足所述预设位置要求是所述读段作为所述可拼接读段的指示;和
    所述实际相对位置不满足所述预设位置要求是所述读段作为所述过滤读段的指示。
  28. 根据权利要求27所述的测序数据处理方法,其中,进一步包括:
    对于所述过滤读段进行二次筛选,所述二次筛选包括:
    将所述读段组的至少一个作为初步读段,并基于所述初步读段对应的所述匹配区域和所述预设位置要求确定所述参考基因组上的二次比对区域;和
    将所述过滤读段的每一个所述读段分别与所述二次比对区域进行局部比对,并将满足预定阈值的所述读段和所述初步读段归类为可拼接读段。
  29. 根据权利要求28所述的测序数据处理方法,其中,将所述读段组的每一个所述读段均作为初步读段,进行所述二次筛选。
  30. 根据权利要求27-29中任一项所述的测序数据处理方法,其中,进一步包括:
    对所述可拼接读段按照所述多轮测序的规则进行拼接。
  31. 根据权利要求27-30中任一项所述的测序数据处理方法,其中,所述多轮测序的规则包括选自下列的至少之一:双端测序、Jumping测序、Overlap测序、双端Jumping测序以及这些测序规则的组合。
  32. 根据权利要求31所述的测序数据处理方法,其中,所述多轮测序的规则为双端测序,所述读段组包括两个读段,所述所述预设位置要求包括:
    两个所述读段的匹配区域分别位于所述参考基因组的正链和反链上;和
    两个所述读段的匹配区域在所述参考基因组上的距离不超过预定阈值,
    其中,所述预定阈值是基于插入片段的长度确定的。
  33. 根据权利要求31所述的测序数据处理方法,其中,所述多轮测序的规则为Jumping测序,所述所述预设位置要求包括:
    多个所述读段的匹配区域位于所述参考基因组的相同链上;和
    多个所述读段的匹配区域中相邻两个所述读段在所述参考基因组上的距离不超过预定距离阈值,
    其中,所述预定阈值是基于部分延伸步骤的长度确定的,任选地,所述预定距离阈值不超过50bp,优选不超过20bp,进一步优选在5~20bp之间。
  34. 根据权利要求31所述的测序数据处理方法,其中,所述多轮测序的规则为Overlap测序,所述所述预设位置要求包括:
    多个所述读段的匹配区域位于所述参考基因组的相同链上;和
    多个所述读段的匹配区域中相邻两个所述读段在所述参考基因组上的重叠区域长度在预定距离范围,
    其中,所述预定距离范围是基于测序过程中的重叠区域长度确定的,
    任选地,所述预定距离范围为5~10bp之间。
  35. 根据权利要求31所述的测序数据处理方法,其中,所述多轮测序的规则为双端Jumping测序,所述所述预设位置要求包括:
    多个所述读段的匹配区域的一部分位于所述参考基因组的正链,另一部分位于所述参考基因组的反链上;和
    多个所述读段的匹配区域中相邻两个所述读段在所述参考基因组上的重叠区域长度在预定距离范围,
    其中,所述预定距离范围是基于测序过程中部分延伸步骤的长度确定的,
    任选地,所述预定距离阈值不超过50bp,优选不超过20bp,进一步优选在5~20bp之间。
  36. 根据权利要求33所述的测序数据处理方法,其中,所述Jumping测序包括:
    提供核酸模板,所述核酸模板直接或者间接连接在固相载体的表面;
    采用第一核苷酸和第二核苷酸,与所述核酸模板发生多轮延伸反应,
    其中,
    所述第一核苷酸为带有可检测标记的可逆终止子,并且用于通过所述延伸反应获得多个读段;
    所述第二核苷酸为不带有可检测标记的可逆终止子,并且用于通过所述延伸反应获得至少一个预设长度的合成片段。
  37. 根据权利要求36所述的测序数据处理方法,其中,所述Overlap测序包括:
    所述核酸模板直接或者间接连接在固相载体的表面;
    采用第一测序接头和第二测序接头与所述核酸模板发生多轮延伸反应,以便获得多个读段,
    其中,
    所述第一测序接头产生的第一读段与所述第二测序接头产生的第二读段存在至少一个碱基的重叠区域,
    可选的,
    所述第一测序接头采用所述第一核苷酸进行所述延伸反应,以便获得所述第一读段;
    所述第二测序接头产生首先采用第二核苷酸进行延伸反应,之后采用所述第一核苷酸进行多个所述延伸反应,以便获得所述第二读段。
  38. 根据权利要求36所述的测序数据处理方法,其中,所述双端Jumping测序包括:
    使所述核酸模板与第一引物杂交,所述第一引物的至少一部分与所述核酸模板的3'端互补,所述第一引物共价连接在所述固相载体的表面上;
    采用所述第一核苷酸和所述第二核苷酸,基于所述第一引物与所述核酸模板发生多轮延伸反应,并获得第一引物延伸链;
    去除所述核酸模板,并使第二引物与所述第一引物延伸链杂交;
    采用所述第一核苷酸和所述第二核苷酸,基于所述第二引物与所述第一引物延伸链发生多轮延伸反应;
    其中,
    所述第一核苷酸为带有可检测标记的可逆终止子,并且用于通过所述延伸反应获得多个读段;
    所述第二核苷酸为不带有可检测标记的可逆终止子,并且用于通过所述延伸反应获得至少一个预设长度的合成片段。
  39. 根据权利要求27-38任一项所述的测序数据处理方法,所述测序数据由权利要求1-26任一项所述测序方法测得。
  40. 一种测序数据处理设备,所述测序数据包括多个读段组,所述读段组包括多个读段,所述多个读段是通过对同一插入片段进行多轮测序而获得的,所述设备包括针对每个所述读段组的所述多个读段进行下列处理的多个模块:
    全局比对模块,用于将所述多个读段与参考基因组进行全局比对,以便在所述参考基因组上确定与所述多个读段对应的多个匹配区域;和
    一次筛选模块,用于基于所述多个匹配区域之间的实际相对位置与预设位置要求的比较,对所述多个读段进行一次筛选,以便获得可拼接读段和过滤读段,
    其中,
    所述预设位置要求是由所述多轮测序的规则确定的,
    所述实际相对位置满足所述预设位置要求是所述读段作为所述可拼接读段的指示;和
    所述实际相对位置不满足所述预设位置要求是所述读段作为所述过滤读段的指示。
  41. 根据权利要求40所述的测序数据处理设备,其中,进一步包括二次筛选模块,用于对于所述过滤读段进行二次筛选,所述二次筛选包括:
    将所述读段组的至少一个作为初步读段,并基于所述初步读段对应的所述匹配区域和所述预设位置要求确定所述参考基因组上的二次比对区域;和
    将所述过滤读段的每一个所述读段分别与所述二次比对区域进行局部比对,并将满足预定阈值的所述读段和所述初步读段归类为可拼接读段。
  42. 根据权利要求40或41所述的测序数据处理设备,其中,进一步包括:
    拼接模块,用于对所述可拼接读段按照所述多轮测序的规则进行拼接。
  43. 根据权利要求40-42任一项所述的测序数据处理设备,其中,所述多轮测序的规则包括选自下列的至少之一:双端测序、Jumping测序、Overlap测序、双端Jumping测序以及这些测序规则的组合。
  44. 一种计算设备,其中,包括:处理器和存储器;
    所述存储器,用于存储计算机程序;
    所述处理器,用于执行所述计算机程序以实现根据权利要求27-39中任一项所述的测序数据处理方法。
  45. 一种计算机可读存储介质,其中,所述存储介质包括计算机指令,当所述指令被计算机执行时,使得所述计算机实现根据权利要求27-39中任一项所述的测序数据处理方法。
PCT/CN2022/125967 2021-10-18 2022-10-18 测序方法、测序数据处理方法、设备和计算机设备 WO2023066255A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202280070809.4A CN118139990A (zh) 2021-10-18 2022-10-18 测序方法、测序数据处理方法、设备和计算机设备

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111209946.5 2021-10-18
CN202111209946 2021-10-18

Publications (1)

Publication Number Publication Date
WO2023066255A1 true WO2023066255A1 (zh) 2023-04-27

Family

ID=86057923

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/125967 WO2023066255A1 (zh) 2021-10-18 2022-10-18 测序方法、测序数据处理方法、设备和计算机设备

Country Status (2)

Country Link
CN (1) CN118139990A (zh)
WO (1) WO2023066255A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403647A (zh) * 2023-06-08 2023-07-07 上海精翰生物科技有限公司 一种检测慢病毒整合位点的生物信息检测方法及其应用

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016022833A1 (en) * 2014-08-06 2016-02-11 Nugen Technologies, Inc. Digital measurements from targeted sequencing
CN106156536A (zh) * 2015-04-15 2016-11-23 深圳华大基因科技有限公司 对样本免疫组库测序数据进行处理的方法和系统
CN112654714A (zh) * 2018-12-17 2021-04-13 伊卢米纳剑桥有限公司 用于测序的引物寡核苷酸
CN113293205A (zh) * 2021-05-24 2021-08-24 深圳市真迈生物科技有限公司 测序方法
CN113337576A (zh) * 2020-04-30 2021-09-03 深圳市真迈生物科技有限公司 文库制备方法、试剂盒及测序方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016022833A1 (en) * 2014-08-06 2016-02-11 Nugen Technologies, Inc. Digital measurements from targeted sequencing
CN107075581A (zh) * 2014-08-06 2017-08-18 纽亘技术公司 由靶向测序进行数字测量
CN106156536A (zh) * 2015-04-15 2016-11-23 深圳华大基因科技有限公司 对样本免疫组库测序数据进行处理的方法和系统
CN112654714A (zh) * 2018-12-17 2021-04-13 伊卢米纳剑桥有限公司 用于测序的引物寡核苷酸
CN113337576A (zh) * 2020-04-30 2021-09-03 深圳市真迈生物科技有限公司 文库制备方法、试剂盒及测序方法
CN113293205A (zh) * 2021-05-24 2021-08-24 深圳市真迈生物科技有限公司 测序方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HU YU, FANG LI, CHEN XUELIAN, ZHONG JIANG F., LI MINGYAO, WANG KAI: "LIQA: long-read isoform quantification and analysis", GENOME BIOLOGY, vol. 22, no. 1, 1 December 2021 (2021-12-01), XP093057757, DOI: 10.1186/s13059-021-02399-8 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403647A (zh) * 2023-06-08 2023-07-07 上海精翰生物科技有限公司 一种检测慢病毒整合位点的生物信息检测方法及其应用
CN116403647B (zh) * 2023-06-08 2023-08-15 上海精翰生物科技有限公司 一种检测慢病毒整合位点的生物信息检测方法及其应用

Also Published As

Publication number Publication date
CN118139990A (zh) 2024-06-04

Similar Documents

Publication Publication Date Title
US11365445B2 (en) Linked paired strand sequencing
AU2019222723B2 (en) Methods for the epigenetic analysis of DNA, particularly cell-free DNA
JP2023071981A (ja) 酵素不要及び増幅不要の配列決定
EP2875131B1 (en) A method of normalizing biological samples
EP2768972B1 (en) Methods and compositions for nucleic acid sequencing
EP2619329B1 (en) Direct capture, amplification and sequencing of target dna using immobilized primers
US20070207482A1 (en) Wobble sequencing
EP2607496A1 (en) Methods useful in nucleic acid sequencing protocols
CA2921628A1 (en) Assays for single molecule detection and use thereof
US20230235384A1 (en) Compositions and methods for in situ single cell analysis using enzymatic nucleic acid extension
CN101575639B (zh) 可二次验证碱基信息的dna测序方法
WO2023066255A1 (zh) 测序方法、测序数据处理方法、设备和计算机设备
CN113337576A (zh) 文库制备方法、试剂盒及测序方法
WO2023034814A1 (en) Methods for differentiating modified nucleobases
CN115874291A (zh) 一种对样本中dna和rna分子进行标记并同时检测的方法
JP2023519979A (ja) ゲノム内の構造再編成の検出方法
CN116721701A (zh) 测序数据处理方法、设备、计算设备和计算机可读介质
CN114807324A (zh) 单引物扩增建库技术在检测片段化稀有dna分子突变中的应用及试剂盒
US20100285970A1 (en) Methods of sequencing nucleic acids
US20230340592A1 (en) Targeted sequencing
EP4396339A1 (en) Methods for differentiating modified nucleobases
JPH10262675A (ja) オリゴヌクレオチド及び核酸の塩基配列解析法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22882860

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE