WO2021219114A1 - 测序方法及其分析方法和系统、计算机可读存储介质和电子设备 - Google Patents

测序方法及其分析方法和系统、计算机可读存储介质和电子设备 Download PDF

Info

Publication number
WO2021219114A1
WO2021219114A1 PCT/CN2021/091279 CN2021091279W WO2021219114A1 WO 2021219114 A1 WO2021219114 A1 WO 2021219114A1 CN 2021091279 W CN2021091279 W CN 2021091279W WO 2021219114 A1 WO2021219114 A1 WO 2021219114A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
read
reads
read set
sequencing data
Prior art date
Application number
PCT/CN2021/091279
Other languages
English (en)
French (fr)
Inventor
樊济才
金欢
陈方
孙雷
Original Assignee
深圳市真迈生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010865293.5A external-priority patent/CN113593636B/zh
Application filed by 深圳市真迈生物科技有限公司 filed Critical 深圳市真迈生物科技有限公司
Priority to US17/922,340 priority Critical patent/US20230178183A1/en
Priority to EP21797265.2A priority patent/EP4144745A4/en
Publication of WO2021219114A1 publication Critical patent/WO2021219114A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1068Template (nucleic acid) mediated chemical library synthesis, e.g. chemical and enzymatical DNA-templated organic molecule synthesis, libraries prepared by non ribosomal polypeptide synthesis [NRPS], DNA/RNA-polymerase mediated polypeptide synthesis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • C12Q1/6874Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/10Design of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • the present invention relates to the field of sequencing. Specifically, the present invention relates to a sequencing method, a sequencing result analysis method, a sequencing result analysis system, a computer-readable storage medium, and electronic equipment.
  • the single-pass sequencing error rate is very high, up to 30%.
  • the error type of the above-mentioned sequencing platform is mainly InDel, and it occurs randomly, and the sequencing error rate can be reduced by the method of repeated reading.
  • the present invention aims to solve one of the technical problems in the related art at least to a certain extent. For this reason, an object of the present invention is to provide an effective sequencing method.
  • the present invention provides a sequencing method.
  • the method includes: (1) Performing first sequencing on the sequencing template on the chip surface, so as to form a first nascent sequencing chain
  • the sequencing template is connected to the surface of the chip through a sequencing adapter; (2) performing a first sealing process on at least a part of the 3'end of the first nascent sequencing strand; and (3)
  • the second sequencing is performed on the sequencing template, so as to obtain second sequencing data by forming a second nascent sequencing chain.
  • the embodiment of the present invention by performing two rounds of sequencing, subsequent corrections can be made to improve the accuracy of the sequencing results.
  • the first round of sequencing that is, after the first sequencing
  • the 3'end of the chain is blocked, which can effectively avoid interference signals during the second round of sequencing, that is, the second sequencing process.
  • the accuracy of the sequencing results can be further improved.
  • the present invention provides a sequencing result analysis method.
  • the sequencing result includes first sequencing data and second sequencing data, and the first sequencing data and the The second sequencing data is composed of multiple reads, at least a part of the reads in the first sequencing data has corresponding reads in the second sequencing data, and the first sequencing data and the second sequencing data have corresponding reads.
  • the sequencing data is obtained by the aforementioned method, and the sequencing result analysis method includes: (a) performing mutual correction based on at least a part of each of the first sequencing data and the second sequencing data, so as to obtain final sequence information .
  • a sequencing result analysis method includes first sequencing data and second sequencing data. Both the first sequencing data and the second sequencing data are composed of multiple reads. Constituted, at least a part of the reads in the first sequencing data has corresponding reads in the second sequencing data, and the sequencing result analysis method includes: (a) based on the first sequencing data and the At least a part of each of the second sequencing data is mutually corrected to obtain final sequence information.
  • the accuracy of the sequencing results can be improved by performing mutual correction on the results of the two rounds of sequencing.
  • the first round of sequencing that is, the first sequencing
  • the second round of sequencing by sealing the 3'end of the nascent sequencing strand remaining on the surface of the chip, it can effectively avoid the second round of sequencing. Interference signals are generated during the sequencing process.
  • the accuracy of the sequencing results can be further improved.
  • the present invention also provides a sequencing result analysis system.
  • the system includes: a sequencing device adapted to obtain a sequencing result by the aforementioned method, the sequencing result including first sequencing data and second sequencing data, the first sequencing Both the data and the second sequencing data are composed of multiple reads, and at least a part of the reads in the first sequencing data has corresponding reads in the second sequencing data; analysis equipment, said analysis equipment It is suitable for performing mutual correction based on at least a part of each of the first sequencing data and the second sequencing data, so as to obtain final sequence information.
  • a sequencing result analysis system includes a sequencing device configured to obtain a sequencing result, and the sequencing result includes first sequencing data and second sequencing data. Both the first sequencing data and the second sequencing data are composed of multiple reads, and at least a part of the reads in the first sequencing data has corresponding reads in the second sequencing data; the analysis device, the The analysis device is adapted to perform mutual correction based on at least a part of each of the first sequencing data and the second sequencing data, so as to obtain final sequence information.
  • any of the above systems can effectively implement the aforementioned sequencing result analysis method, so that the accuracy of the sequencing result can be improved by mutual correction of the results of multiple rounds of sequencing.
  • the first round of sequencing that is, the first sequencing
  • the second round of sequencing Interference signals are generated during the sequencing process.
  • the accuracy of the sequencing results can be further improved.
  • the present invention also provides a computer-readable storage medium on which a computer program is stored. According to an embodiment of the present invention, when the program is executed by a processor, the steps of the foregoing method are implemented.
  • the present invention also provides an electronic device, which includes: the aforementioned computer-readable storage medium; and one or more processors for executing programs in the computer-readable storage medium.
  • the present invention also provides a computer program product, including instructions, which when the computer executes the program, cause the computer to execute the sequencing method and/or the sequencing result analysis method in any of the above embodiments .
  • Fig. 1 is a schematic flowchart of a sequencing method according to an embodiment of the present invention
  • Fig. 2 is a schematic flowchart of a sequencing method according to another embodiment of the present invention.
  • Fig. 3 is a schematic flow chart of a sequencing method according to another embodiment of the present invention.
  • Fig. 4 is a schematic flow chart of a sequencing result analysis method according to an embodiment of the present invention.
  • Fig. 5 is a schematic flow chart of a sequencing result analysis method according to another embodiment of the present invention.
  • Fig. 6 is a schematic flowchart of a sequencing result analysis method according to another embodiment of the present invention.
  • Fig. 7 is a schematic structural diagram of a sequencing result analysis system according to an embodiment of the present invention.
  • Fig. 8 is a schematic structural diagram of a sequencing result analysis system according to another embodiment of the present invention.
  • Fig. 9 is a schematic structural diagram of a sequencing result analysis system according to another embodiment of the present invention.
  • Figure 10 is a schematic diagram of a sequencing method for obtaining Reads1 and Reads2 according to an embodiment of the present invention
  • Figure 11 is a schematic diagram of the construction of a sequencing library according to an embodiment of the present invention.
  • Fig. 12 is a schematic flowchart of an analysis method for obtaining Consensus Reads (consensus sequence/common sequence) according to an embodiment of the present invention.
  • first and second in the present invention are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, the features defined with “first” and “second” may explicitly or implicitly include at least one of the features. In the description of the present invention, “plurality” means at least two, such as two, three, etc., unless otherwise specifically defined.
  • the terms “installed”, “connected”, “connected”, “fixed” and other terms should be understood in a broad sense, for example, it can be a fixed connection or a detachable connection. , Or integrated; it can be mechanically connected or electrically connected; it can be directly connected or indirectly connected through an intermediary, it can be the internal connection of two components or the interaction relationship between two components, unless otherwise specified The limit.
  • installed can be a fixed connection or a detachable connection. , Or integrated; it can be mechanically connected or electrically connected; it can be directly connected or indirectly connected through an intermediary, it can be the internal connection of two components or the interaction relationship between two components, unless otherwise specified The limit.
  • the specific meanings of the above-mentioned terms in the present invention can be understood according to specific situations.
  • the present invention proposes a sequencing and analysis method that can reduce the noise and error rate of the sequencing output sequence of a sequencing platform (such as GenoCare TM single-molecule sequencing platform, etc.).
  • a sequencing platform such as GenoCare TM single-molecule sequencing platform, etc.
  • the method includes:
  • the first sequencing is performed on the sequencing template on the surface of the chip to obtain first sequencing data by forming a first nascent sequencing chain, and the sequencing template is connected to the surface of the chip through a sequencing adapter.
  • chip refers to a solid substrate having a surface, such as a flat surface, on which biomolecules to be measured are connected to the surface, which is also called a sequencing chip or a flow cell. It is understandable that this method is suitable for any sequencing platform that performs nucleic acid sequence determination based on chip detection. For example, the mainstream sequencing platform on the market that performs sequencing through the principle of sequencing by synthesis can use this method for processing. Single-molecule sequencing platforms for detection and sequencing are particularly suitable, such as the GenoCare TM single-molecule sequencing platform.
  • a chip that can be used in a sequencing platform can also be obtained through the following steps:
  • S10a Hybridize the library molecules in the sequencing library with the sequencing adapter on the surface of the chip;
  • S10c removes the initial template, and performs a second sealing treatment on the 3'end of the nucleic acid molecule on the surface of the chip.
  • the so-called "library” is a pool/collection of nucleic acid molecules containing multiple targets/fragments to be tested, and the target/fragments to be tested are derived from the nucleic acid of the sample to be tested.
  • the target/test fragment is processed, such as adding a known sequence to one or both ends of the target fragment, such as adding a linker (sequencing linker), so that the library can be connected or fixed on the chip , To be suitable for loading to the sequencing platform for sequencing.
  • step S10c before performing step S10c, it may further include S11b: performing a third sealing treatment on the 3'end of the complementary strand that is not fully extended in step S10b.
  • S11b performing a third sealing treatment on the 3'end of the complementary strand that is not fully extended in step S10b.
  • S20 Perform a first sealing process on at least a part of the 3'end of the first nascent sequencing strand. In this step, perform a first sealing process on at least a part of the 3'end of the first nascent sequencing strand, and pass the first sealing process. , Which can effectively increase the amount of effective data and reduce the interference of invalid data on information analysis.
  • step S20 includes removing the first nascent sequencing strand on the surface of the chip, and performing a first sealing treatment on the 3'end of the first nascent sequencing strand remaining on the surface of the chip .
  • step S20 includes performing a first sealing process on the 3'end of the first nascent sequencing strand, and removing the blocked first nascent sequencing strand.
  • second sequencing is performed on the sequencing template, so as to obtain second sequencing data by forming a second nascent sequencing chain.
  • the embodiment of the present invention by performing two rounds of sequencing, subsequent corrections can be made to improve the accuracy of the sequencing results.
  • the first round of sequencing that is, after the first sequencing
  • the 3'end of the chain is blocked, which can effectively avoid interference signals during the second round of sequencing, that is, the second sequencing process.
  • the accuracy of the sequencing results can be further improved.
  • the first blocking treatment, the second blocking treatment, and the third blocking treatment may be independently performed by connecting the 3'terminal hydroxyl group with an extension reaction blocker.
  • the blocking effect can be further improved, thereby further improving the accuracy of sequencing, and reducing undesired sequencing noise.
  • the extension reaction blocker is ddNTP or a derivative thereof.
  • the blocking effect can be further improved, thereby further improving the accuracy of sequencing, and reducing undesired sequencing noise.
  • the first blocking process, the second blocking process, and the third blocking process are independently performed using at least one of a DNA polymerase and a terminal transferase, respectively.
  • the blocking effect can be further improved, thereby further improving the accuracy of sequencing, and reducing undesired sequencing noise.
  • the first blocking treatment and the third blocking treatment are respectively independently linked to the ddNTP or its derivative via a polymerase, and the second blocking treatment is linked to the terminal transferase ddNTP or its derivatives.
  • the blocking effect can be further improved, thereby further improving the accuracy of sequencing, and reducing undesired sequencing noise.
  • the present invention proposes a sequencing result analysis method, which can effectively analyze two rounds of sequencing data generated by any of the foregoing sequencing methods, thereby further improving the accuracy of sequencing. , To avoid sequencing errors.
  • the sequencing result includes first sequencing data and second sequencing data, and both of the first sequencing data and the second sequencing data consist of multiple Read segment structure, at least a part of the read segment in the first sequencing data has a corresponding read segment in the second sequencing data, and the first sequencing data and the second sequencing data pass through the aforementioned Obtained by the method
  • the sequencing result analysis method includes:
  • the mutual correction includes the following steps:
  • the length of the reads is not less than a predetermined length, and the reads have different Sequencing quality below a predetermined quality threshold;
  • the high-quality read segment is compared with the corresponding read segment of the high-quality read segment, and sequence information correction is performed based on the comparison result.
  • the accuracy of the sequencing results can be improved by performing mutual correction on the results of the two rounds of sequencing.
  • the first round of sequencing that is, the first sequencing
  • the second round of sequencing by sealing the 3'end of the nascent sequencing strand remaining on the surface of the chip, it can effectively avoid the second round of sequencing. Interference signals are generated during the sequencing process.
  • the accuracy of the sequencing results can be further improved.
  • the mutual correction includes the following steps:
  • a first read set is constructed based on the first sequencing data, and the length of each read in the first read set is not less than a first predetermined length .
  • S200 constructs the second reading set and the third reading set
  • a second read set and a third read set are constructed based on the first read set, and all the reads in the second read set are The length of the corresponding read segment is not less than the second predetermined length, and the length of the corresponding read segment of each read segment in the third read segment set is within the predetermined length range.
  • S300 constructs the fourth reading set and the fifth reading set
  • a fourth read is constructed based on the second read set and the corresponding reads Collection and fifth reading collection.
  • the fourth reading set and the fifth reading set are determined according to the following principles:
  • the reads from the second read set are selected as the elements of the fourth read set, and the corresponding reads are selected as the elements of the fifth read set .
  • the sequencing quality is used to filter the fourth read set to construct a sixth read set.
  • the sequencing quality of the reads in the sixth read set is not lower than that of the first read set.
  • a predetermined quality threshold is used to filter the fourth read set to construct a sixth read set.
  • the sixth reading set is used to select the readings corresponding to the readings in the sixth reading set from the fifth reading set, so as to construct a seventh reading. Segment collection.
  • S600 compares the sixth reading set with the seventh reading set to determine the first difference site
  • a read comparison is performed between the sixth read set and the seventh read set, and a first difference site is determined on the reads of the sixth read set.
  • a predetermined sequencing error prediction model is used to correct the first difference site to determine the first sequence information.
  • the sequencing error prediction model is used to determine the occurrence of the difference site during the sequencing process. The probability of insertion or deletion.
  • the first sequencing information may further include:
  • the sequencing quality is used to filter the third read set to construct an eighth read set, wherein the sequencing quality of the reads in the eighth read set is not low At the second predetermined quality threshold.
  • the eighth read set is used to select the reads corresponding to the reads in the seventh read set from the second sequencing data, so as to construct a ninth read gather.
  • S600a compares the eighth read set with the ninth read set to determine the second difference site
  • S700a corrects the second difference site to determine the second sequence information.
  • the second difference site is corrected by using the sequencing error prediction model, so as to determine the second sequence information.
  • the sequencing error prediction model is obtained by training a naive Bayes model based on the comparison result of the first sequencing data and the second sequencing data with a reference genome.
  • the reads from the sixth read set have bases at the difference site, the corresponding reads from the seventh read set do not have bases at the difference site, and there is no base at the difference site.
  • the probability that the site is missing is more than 50%, and the bases of the reads of the sixth read set at the different sites are retained as the final sequencing result.
  • the scheduled read from the seventh read set has a base at the difference site, and If the probability of insertion of the difference site is more than 50%, the bases of the reads of the sixth read set at the difference site are retained as the final sequencing result.
  • the first predetermined length and the second predetermined length are independently not less than 20 bp, preferably not less than 25 bp, and the predetermined length ranges from 10 to 25 bp; the first predetermined quality The threshold and the second predetermined quality threshold are independently not lower than 50, preferably not lower than 60.
  • a joint D7-S1-T/D9-S2 and sequencing primer D7S1T-R2P using a sequencing platform such as GenoCare TM single-molecule sequencing platform to perform Two-Pass sequencing to obtain Reads1 And Reads2 sequencing method.
  • a further object of the present invention is to provide an analysis method for obtaining Consensus Reads by analyzing Reads1 and Reads2 obtained by the above-mentioned Two-Pass sequencing method.
  • This analysis method can significantly reduce the noise sequence and base error rate in the output Consensus Reads (consensus sequence/common sequence).
  • a linker used for the construction of a Two-Pass sequencing library and a sequencing primer is obtained by annealing the oligonucleotide chain D7-S1-T and D9-S2 modified with a 5-band phosphate group ,
  • the sequencing primer is D7S1T-R2P.
  • the sequence of the D7-S1-T is SEQ ID NO: 1
  • the sequence of the D9-S2 is SEQ ID NO: 2
  • the sequence of the D7S1T-R2P is SEQ ID NO: 3.
  • the present invention provides a method for obtaining Reads1 and Reads2 by Two-Pass sequencing using the above-mentioned adaptor and sequencing primer, which includes:
  • Step 1 Construct Two-Pass sequencing library, use library preparation kit ( Universal DNA Library Prep Kit for Illumina V2 (ND606-01)) connect the annealed D7-S1-T/D9-S2 adapter to the prepared fragmented human gDNA. After ligation, PCR amplification is not required, and it is used directly Purification kit (VAHTS DNA Clean Beads (N411-01)) for purification to obtain the target library;
  • library preparation kit Universal DNA Library Prep Kit for Illumina V2 (ND606-01)
  • VAHTS DNA Clean Beads N411-01
  • Step 2 Hybridize the library obtained in Step 1 with the surface adapter of the sequencing chip;
  • Step 3 Perform complementary strand synthesis on the initial template hybridized to the chip surface in Step 2;
  • Step four block the 3'end of the incompletely extended nascent strand in step three to reduce its interference with the sequencing process;
  • Step 5 Denaturate and remove the initial template hybridized to the surface of the chip in Step 2;
  • Step 6 Seal the 3'end of the residual adapter on the chip surface to reduce its interference with the sequencing process
  • Step 7 Hybridize the sequencing primer D7S1T-R2P with the complementary strand synthesized in Step 3 as a template;
  • Step 8 Use the complementary strand synthesized in step 3 as a template, and use the sequencing primer D7S1T-R2P hybridized in step 7 as primers to perform Read1 sequencing;
  • Step 9 Denaturate and remove the new sequencing chain in Step 8;
  • Step 10 Block the 3'end of the new sequencing strand in Step 8 that may remain after the processing of Step 9 to prevent it from continuing to extend during Read2 sequencing;
  • Step 11 Hybridize the sequencing primer D7S1T-R2P with the complementary strand synthesized in step 3 as a template;
  • Step 12 Use the complementary strand synthesized in step 3 as a template, and use the sequencing primer D7S1T-R2P hybridized in step 11 as a primer to perform Read2 sequencing;
  • Step 13 Split the sequencing data obtained in Step 8 and Step 12 to obtain two partial sequences of Reads1 and Reads2 with one-to-one coordinates.
  • the present invention provides an analysis method for obtaining Consensus Reads by analyzing Reads1 and Reads2 obtained in any of the foregoing embodiments, including:
  • Step 14 Construction of the calibration model, extract the read lengths of the same coordinates in the Reads1 and Reads2 sequences obtained in Step 13 for two sequencing reads ⁇ the Reads of the same coordinates in the column, and output them as two files T1 (Read1) and T2 (Read2) respectively , Compare the Reads in T1 and T2 with the reference genome respectively, and use the Naive Bayes method to calculate the probability of deletion or insertion of intermediate bases under different combinations of bases before and after.
  • the prediction process for the intermediate bases under different combinations of front and back bases, it is determined whether to retain the intermediate bases according to the probability of deletion or insertion in the model. If the probability of deletion is greater than 50%, the intermediate base is retained, otherwise the intermediate base is discarded.
  • Step 15 Filter the Reads1 data obtained in Step 13 according to the read length, and name the Reads1 sequence collection with a read length ⁇ 25bp as Fa1. Filtering short read sequences with a length of 25bp can remove some noise sequences and improve the accuracy of sequencing data mapping.
  • Step 16 According to the read length of Reads2 obtained in step 13, split the corresponding Fa1 obtained in step 15 into two sets.
  • the set of Reads in Fa1 corresponding to Read2 ⁇ 25bp is named Fa2, of which 10bp
  • the set of Reads in Fa1 corresponding to ⁇ Read2 ⁇ 25bp is named Fa3.
  • the purpose of splitting Fa1 into two parts for analysis is to improve the accuracy of Consensus Reads while reducing the loss of data throughput caused by length filtering.
  • Step 17 Compare the Q value of Reads in Fa2 obtained in step 16 with the Q value of Reads in Reads2 obtained in step 13 corresponding to its coordinates, and select the Reads with a higher Q value (if the Q value is equal, take Fa2)
  • the set of Reads) is named Fa4
  • the set of Reads with a lower Q value is named Fa5.
  • the purpose of this step is to divide the sequences in Reads1 and Reads2 into two sets with relatively high sequencing quality and low sequencing quality to ensure that the final output of the Consensus Reads sequence is the reads with high sequencing quality and relatively more accurate reads in the two sequencing. .
  • Step 18 further filtering the Fa4 and Fa5 obtained in step 17, name the set of Reads in Fa4 with Q value ⁇ a4 as Fa6, and name the set of Reads in Fa5 that corresponds to the coordinate of Reads in Fa6 one-to-one For Fa7.
  • Step 19 Align the Reads in Fa6 and Fa7 obtained in Step 18 one by one, rank according to their sequence similarity, and correct Fa6 with Fa7 as the reference sequence, and mark the Fa6 sequence different from Fa7 Position, according to the correction model constructed in step 14 to determine whether the bases at different positions are Deletion or Insertion one by one, so as to obtain the corrected Consensus Reads Part 1 for output.
  • the different position expressions in this step are only for a certain position where only the bases are detected on one Reads of Fa6 or Fa7.
  • the correction model constructed in step 14 will determine whether the base should be retained. For both Fa6 and Fa7 bases detected at a certain position, but the types of bases are inconsistent, the bases of Fa6 shall prevail. This model does not correct the above situation.
  • Step 20 further filter the Reads in Fa3 obtained in step 16, and name the set of Reads in Fa3 with Q value ⁇ a3 as Fa8;
  • Step 21 Extract the set of Reads in Reads2 obtained in Step 13 that corresponds to the coordinates in Fa8 on a one-to-one basis, and name it Fa9;
  • Step 22 Align the Reads in Fa8 and Fa9 one by one, rank according to the similarity of their sequences, and correct Fa8 with Fa9 as the reference sequence, mark the position different from Fa8 in Fa8, according to The correction model constructed in step 14 judges whether the bases at different positions are Deletion (Deletion) or Insertion (Insertion) one by one, so as to obtain the corrected Consensus Reads Part 2 for output;
  • Step 23 Combine the Consensus Reads Part 1 and the Consensus Reads Part 2 of different similarity levels according to different application requirements to obtain the output Consensus Reads.
  • the hybridization process of the library and the sequencing chip surface adapter described in step 2 includes (reagent routine):
  • step 2) The product obtained from the step 1) is quickly cooled on an ice-water mixture for more than 2 minutes to obtain a denatured hybrid library mother liquor;
  • the denatured hybridization library mother liquor obtained from the step 2) is diluted with a hybridization solution (such as 3*SSC solution) to a suitable concentration, preferably 0.1-2 nM, to obtain a diluted hybridization library;
  • a hybridization solution such as 3*SSC solution
  • step 4) Pass the hybridization library diluted in steps 30-50 obtained from step 3) into the sequencing chip channel pre-treated with reconstitution reagent, and hybridize at 40-60 minutes for 10-30 minutes;
  • the components of the reconstitution reagent include: cleaning solution 1, the components include: 150 mM sodium chloride, 15 mM sodium citrate, 150 mM 4-hydroxyethylpiperazine ethanesulfonic acid, 0.1% sodium lauryl sulfate.
  • the cleaning solution 3 includes: 450 mM sodium chloride and 45 mM sodium citrate.
  • the components of the cleaning solution 2 include: 150 mM sodium chloride and 150 mM 4-hydroxyethylpiperazine ethanesulfonic acid.
  • the process of synthesizing the complementary strand of the initial template in step 3 includes:
  • the components of the extension reagent include: 10-100 U/mL DNA polymerase, preferably, Bst DNA polymerase, Bsu DNA polymerase, Klenow DNA polymerase, etc., 0.2-2 mM DNTP, 0.5-2M betaine, 20mM tris, 10mM sodium chloride, 10mM potassium chloride, 10mM ammonium sulfate, 3mM magnesium chloride, 0.1% Triton X-100, pH value Is 8.3.
  • the process of closing the 3'end of the incompletely extended chain in step 4 includes:
  • the components of the blocking reagent 1 include: 10 to 100 U/mL DNA polymerase, preferably, Klenow DNA polymerase, Bsu DNA polymerase, N9 DNA polymerase, etc., 10 to 100 ⁇ M ddNTP, 5mM manganese chloride, 20mM tris, 10mM sodium chloride, 10mM potassium chloride, 10mM ammonium sulfate, 3mM magnesium chloride, 0.1% Triton X-100, pH value Is 8.3.
  • the process of removing the initial template in step 5 includes:
  • the denaturation reagent can be formamide, 0.1M NaOH, etc., and react at 50-60 for 2-5 minutes;
  • step 1) and step 2) once to complete the removal of the initial template.
  • the process of sealing the 3'end of the residual joint on the chip surface described in step 6 includes:
  • the components of the blocking reagent 2 include: 100U/mL terminal transferase (Terminal Transferase (NEB, M0315L)), 1 ⁇ Terminal Transferase Buffer, 0.25mM cobalt chloride, 10-100 ⁇ M DdNTP.
  • the process for the hybridization sequencing primer D7S1T-R2P described in step 7 includes:
  • step 2) Pass the sequencing primer hybridization solution in the volume of 200-1000 ⁇ L obtained from step 1) into the chip channel, and hybridize under the condition of 50-60 primers for 10-30 minutes;
  • the sequencing process of Read1 described in step 8 can be performed with reference to the operating instructions in the GenoCare TM Single Molecule Two-color Sequencing Universal Kit (record number: Yueshen Xie Bei No. 20190887).
  • step 9 the process of removing the nascent sequencing chain described in step 9 is performed with reference to step 5.
  • the closing process of the 3'end of the residual nascent strands described in step ten can be performed with reference to step four.
  • step eleven the process of the hybridization sequencing primer D7S1T-R2P described in step eleven is performed with reference to step seven.
  • step twelve the sequencing process of Read2 described in step twelve is performed with reference to step eight.
  • the process of splitting the sequencing data in step 13 to obtain the two partial sequences of Reads1 and Reads2 with one-to-one coordinates includes:
  • each Read in the ".fa_” file output by BaseCalling is equally divided into two parts from the middle, and output as two copies of the ".fa_” file "Reads1.fa_” with the same sequence coordinates. "Reads2.fa_”;
  • the construction process of the correction model described in step fourteen includes:
  • the two sequencing read lengths of the same coordinate are ⁇ 25bp, and they are output as two fast files of T1 (Read1) and T2 (Read2) respectively;
  • the method of any of the above embodiments combined with the characteristics of sequencing data, especially single-molecule sequencing platform data, provides a set of optional combined use of adapters D7-S1-T/D9-S2 and sequencing primer D7S1T-R2P to perform Two- Pass sequencing obtains the sequencing methods of Reads1 and Reads2.
  • the above-mentioned related embodiments of the present invention also provide an analysis method suitable for analyzing the data (Reads1 and Reads2) obtained by the Two-Pass sequencing method to obtain Consensus Reads. This analysis method can significantly reduce the noise sequence and base error rate in the output Consensus Reads (consensus sequence/common sequence).
  • the present invention also provides a sequencing result analysis system capable of implementing the above-mentioned sequencing result analysis method.
  • the system includes: a sequencing device, the sequencing device is adapted to obtain the sequencing result by the aforementioned method, the sequencing result includes the first sequencing data and the second sequencing data , The first sequencing data and the second sequencing data are both composed of multiple reads, and at least a part of the reads in the first sequencing data has corresponding reads in the second sequencing data; analyzing The analysis device is adapted to perform mutual correction based on at least a part of each of the first sequencing data and the second sequencing data, so as to obtain final sequence information.
  • the system can effectively implement the aforementioned sequencing result analysis method, so that the accuracy of the sequencing result can be improved by mutual correction of the results of two rounds of sequencing.
  • the first round of sequencing that is, the first sequencing
  • the second round of sequencing Interference signals are generated during the sequencing process.
  • the accuracy of the sequencing results can be further improved.
  • the present invention also provides a computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the aforementioned method are realized.
  • the present invention also provides an electronic device, which includes: the aforementioned computer-readable storage medium; and one or more processors for executing programs in the computer-readable storage medium.
  • the embodiment of the present invention also provides a computer program product, including instructions, which when the computer executes the program, cause the computer to execute the sequencing method and/or the sequencing result analysis method in any of the above embodiments.
  • This embodiment provides a set of sequencing and analysis methods that reduce the output sequence noise and error rate of sequencing platforms, especially single-molecule sequencing platforms.
  • the linker D7-S1-T/D9-S2 is composed of oligonucleotide chain D7-S1-T and D9-S2 modified with 5 phosphate groups.
  • the sequence of the D7-S1-T is SEQ ID NO: 1
  • the sequence of the D9-S2 is SEQ ID NO: 2
  • the sequence of the sequencing primer D7S1T-R2P is SEQ ID NO: 3.
  • the primer sequences and names involved in this example are summarized in Table 1.
  • sequences1 and Reads2P are used in combination with a sequencing platform such as GenoCare TM single molecule to perform Two-Pass sequencing to obtain Reads1 and Reads2.
  • Sequencing methods include:
  • Step 1 Construct Two-Pass sequencing library.
  • use Universal DNA Library Prep Kit for Illumina V2 (ND606-01) connects the annealed D7-S1-T/D9-S2 adapter to the prepared fragmented human gDNA. After ligation, PCR amplification is not required, and VAHTS is used directly Purify DNA Clean Beads (N411-01) to obtain the target library.
  • the steps of constructing a Two-Pass sequencing library in this embodiment include:
  • Human gDNA fragmentation Use Covaris to set the parameters Peak Power, 75; Duty Factor, 25; Cycles/Burst, 50; Time(s), 250. Ultrasonic interruption of 0.1-1ug human gDNA to obtain 100-300bp DNA fragments .
  • this step can also be achieved using restriction enzyme digestion.
  • reaction conditions are: 20 strips are reacted for 15 minutes, followed by 65 °C for 10 minutes.
  • the reaction conditions were as follows: after mixing, it was placed at room temperature for 15 minutes.
  • a Qubit 3.0 instrument and Qubit dsDNA HS detection kit were used to detect the concentration of the constructed library.
  • the Labchip DNA HS detection kit and LabChip instrument were used to detect the fragment distribution of the constructed library.
  • Step 2 Hybridize the library obtained in Step 1 with the probes on the surface of the sequencing chip.
  • Chip selection The chip used is an epoxy-modified chip, and the probe is fixed by reacting the amino group on the probe with the epoxy group on the surface of the chip.
  • the sequence is 5 chip selection: the chip used is epoxy
  • the base-modified chip fixes the probe by reacting the amino group on the probe with the epoxy group on the surface of the chip, and the sequence is the sequencing method-3' (SEQ ID NO: 4) in any of the above-mentioned embodiments.
  • This example does not limit the surface modification and probe fixation methods. For example, it can be carried out with reference to the published patent application CN109610006A, the full text of which is incorporated herein.
  • the hybridization process between the library and the probe on the chip is as follows:
  • step 1 1) Take 3 ⁇ L of the sequencing library constructed in step 1 with a volume of 20nM, add 3 ⁇ L of deionized water, mix well, and denature at 95°C for 5 minutes;
  • step 2) The denatured library obtained from step 1) is quickly placed in an ice-water mixture to cool for more than 2 minutes;
  • hybridization solution 3xSSC buffer 3xSSC solution is made by diluting 20*SSC buffer ((Sigma, #S6639-1L)) with nuclease-free water (Rnase-free water).
  • step 4) Pass the hybridization library diluted in volume of 30 ⁇ L obtained from step 3) into a channel of the slave chip, hybridize at 42°C for 30 minutes, and then cool to room temperature;
  • a 200 ⁇ L volume of cleaning solution 2 is passed into the hybridization channel of the chip, and the cleaning solution 1 in the channel is replaced to complete the hybridization of the library and the surface adapter of the sequencing chip.
  • the first component of the cleaning solution includes: 150 mM sodium chloride, 15 mM sodium citrate, 150 mM 4-hydroxyethylpiperazine ethanesulfonic acid, and 0.1% sodium lauryl sulfate.
  • the components of the cleaning solution 2 include: 150 mM sodium chloride and 150 mM 4-hydroxyethylpiperazine ethanesulfonic acid.
  • Step 3 Initial template for complementary strand synthesis.
  • the initial template is the library that is hybridized with the probe in step 2.
  • the specific steps for the synthesis of the complementary strand of the initial template are as follows:
  • step 2 1) Place the library hybridization chip in step 2 on the sequencer;
  • extension reagent 120U/mL Bst DNA polymerase (NEB, #M0275M), 0.2mM dNTP (dATP, dTTP, dCTP, dGTP each 0.2 ⁇ M Mixture), 1M Betaine, 20mM Tris, 10mM Sodium Chloride, 10mM Potassium Chloride, 10mM Ammonium Sulfate, 3mM Magnesium Chloride, 0.1% Triton X-100, pH 8.3 ;
  • Step 4 (optional): Close the 3'end of the incompletely extended nascent chain in Step 3.
  • the specific steps of closing are as follows:
  • blocking reagent 1 Pump 750 ⁇ L of blocking reagent 1 into the extended channel described in step 3, and react for 10 minutes.
  • the components of the blocking reagent 1 are: 100 U/mL Klenow DNA polymerase large fragment (3 synthase large fragment (L Klenow, #M0212M) 12.5 ⁇ M ddNTP mix (ddATP, ddTTP, ddCTP, ddGTP each 12.5 ⁇ M mixture) , 5mM manganese chloride, 20mM tris, 10mM sodium chloride, 10mM potassium chloride, 10mM ammonium sulfate, 3mM magnesium chloride, 0.1% Triton X-100, pH value 8.3;
  • step 3 Pass a 220 ⁇ L volume of cleaning solution 1 into the blocked channel in step 2) to remove the remaining blocking solution after the blocking reaction, and complete the blocking of the 3'end of the incompletely extended nascent chain.
  • Step 5 Denaturation to remove the original template, the process of removing the original template is as follows:
  • step 2) Repeat step 2) and step 3) once to complete the removal of the initial template.
  • Step 6 Seal the 3'end of the residual connector on the chip surface.
  • the process of sealing the 3'end of the residual connector on the chip surface includes:
  • blocking reagent 2 Pass the blocking reagent 2 in a volume of 750 ⁇ L into the channel after the step 2) treatment, and react for 15 minutes.
  • the components of blocking reagent 2 are: 100U/mL Terminal Transferase (NEB, M0315L), 1 ⁇ Terminal Transferase Buffer, 0.25mM cobalt chloride, 100 ⁇ M ddNTP mix (ddATP, ddTTP, ddCTP, ddGTP each 100 ⁇ M mixture);
  • Step 7 The process of hybridization sequencing primer D7S1T-R2P and hybridization sequencing primer D7S1T-R2P is as follows:
  • the diluted sequencing primer hybridization solution is cleaning solution 3 containing 0.1 ⁇ M primer D7S1T-R2P, and the cleaning solution 3 components include: 450 mM sodium chloride and 45 mM sodium citrate;
  • Step 8 Perform Read1 sequencing.
  • the process of Read1 sequencing is as follows:
  • Step 9 Remove the nascent sequencing chain.
  • the process of removing the nascent sequencing chain is carried out in accordance with the steps in step five.
  • Step 10 Close the 3'end of the remaining nascent chain:.
  • Step 11 Hybridize the sequencing primer D7S1T-R2P.
  • hybridization sequencing primer D7S1T-R2P is carried out in accordance with the steps in step seven.
  • Step 12 Perform Read2 sequencing.
  • the Read2 sequencing process is carried out in accordance with the steps in step eight.
  • Step 13 Split the sequencing data to obtain two partial sequences of Reads1 and Reads2 with one-to-one coordinates.
  • the process of splitting the sequencing data in this embodiment to obtain two partial sequences of Reads1 and Reads2 with one-to-one coordinates includes:
  • the set of analysis methods for obtaining Consensus Reads by analyzing Reads1 and Reads2 obtained by the above-mentioned Two-Pass sequencing method includes:
  • Step 14 Build a calibration model.
  • the process of constructing the correction model described in this embodiment includes:
  • step 4 Count the Delete and Insertion conditions in step 4), and count the types of Base before and after the inconsistent position at the same time. Therefore, the probability of causing Insertion or Deletion before or after different Base types is obtained.
  • naive Bayes model used in this example is as follows:
  • P(D) represents the probability of Deletion for a certain base
  • P(I) represents the probability of Insertion for a certain base.
  • P(XY ⁇ D) and P(XY ⁇ I) can be obtained by counting the occurrence frequency of Deletion or Insertion under different bases, so that P(D ⁇ XY) and P(I ⁇ XY) can be calculated ).
  • Step 15 Filter the read length to get Fa1.
  • the read length filtering process in this embodiment includes:
  • Step 16 According to the read length of Reads2, classify the Reads in Fa1.
  • the process of classifying Reads in Fa1 according to the read length of Reads2 in this example includes:
  • Step 17 Output the confidence Reads according to the Q value.
  • the process of re-outputting the confidence Reads according to the Q value in this example includes:
  • the Quality Score value (referred to as the Q value) of the Reads is obtained by segmenting from the Reads ID.
  • Step 18 Filter Reads in Fa4 and Fa5 according to the Q value.
  • the process of filtering Reads in Fa4 and Fa5 according to the Q value in this example includes:
  • Step 19 Use the Reads in Fa7 to correct the Reads in Fa6 to obtain Consensus Reads Part1 (CRP1 for short).
  • the process of using Reads in Fa7 to correct Reads in Fa6 in this example includes:
  • step 14 After obtaining the consensus sequence, according to the correction model constructed in step 14, judge the inconsistent Base positions in the consensus sequence one by one. Calculate the probability of Deletion or Insertion at this position based on the base type before and after the Base position. If the probability of deletion is greater than 50%, it is considered that the base measured at this location should not appear, and the base at this location is deleted. Otherwise, keep the Base at that position.
  • the inconsistent Base here refers specifically to the Base that is not measured at the same time in the two corresponding Reads. If the Base is measured twice, but the Base type is inconsistent, it is not within the range of candidates for correction in this example. In this case, the final Base type is based on the Base type of Reads in Fa6.
  • Step 20 Filter Reads in Fa3 according to the Q value.
  • the process of filtering Reads in Fa3 according to the Q value in this example includes:
  • Step 21 Output the Reads in the corresponding Reads2 in the Fa8 file.
  • the process of outputting Reads in Reads2 according to Reads in Fa8 in this example includes:
  • Step 22 Use the Reads in Fa9 to correct the Reads in Fa8 to obtain Consensus Reads Part 2 (CRP2 for short).
  • Step 23 According to the requirements of the accuracy of sequencing data corresponding to different applications, the Reads in CRP1 and CRP2 that meet the similarity threshold are combined and output to obtain Consensus Reads.
  • the different applications described in this example correspond to the requirements of the accuracy of sequencing data, and the process of filtering and outputting Reads in the Consensus Reads Part includes:
  • the similarity refers to the similarity of the corresponding Reads of a certain Read in Reads1 and Reads2.
  • the similarity calculation step is to first register two corresponding Reads with each other. Then calculate the ratio of the number of consistent Bases in the consensus sequence obtained by the registration to the total number of Bases. Among them, the registration method, consensus sequence and inconsistent Base definition refer to step 19.
  • Table 4 Comparison between the output sequence filtered by different similarity thresholds and the reference genome mapping analysis

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Genetics & Genomics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Biochemistry (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Microbiology (AREA)
  • Data Mining & Analysis (AREA)
  • Immunology (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Library & Information Science (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)

Abstract

一种有效的测序方法,该方法包括:(1)对芯片表面的测序模板进行第一测序,以便通过形成第一新生测序链获得第一测序数据,所述测序模板通过测序接头连接在所述芯片表面上;(2)对至少一部分所述第一新生测序链的3对至末端进行第一封闭处理;和(3)对所述测序模板进行第二测序,以便通过形成第二新生测序链获得第二测序数据。

Description

测序方法及其分析方法和系统、计算机可读存储介质和电子设备 技术领域
本发明涉及测序领域,具体的,本发明涉及测序方法、测序结果分析方法、测序结果分析系统、计算机可读存储介质和电子设备。
背景技术
据报道,早在上世纪八十年代人们就提出了单分子测序。2003年斯坦福大学生物工程系的教授Stephen Quake博士成功演示了第一个单分子DNA测序实验。2008年Helicos公司的第一台单分子测序仪(HeliScope)上市。2009年Korlach与Turner在《科学》杂志上发表文献介绍了PacBio单分子测序技术原理。随后,2010年PacBio公司推出了PacBio RS测序系统,并于2011年正式商用。2014年Oxford Nanopore公司在AGBT(基因组生物学技术进展年会)上展示了其MinION测序系统。据报道,无论是Helicos、PacBio还是MinION测序平台,其单重测序(Single-Pass)的测序错误率均很高,最高可达30%。许多研究显示,上述测序平台的错误类型主要是InDel,且是随机发生的,可以通过重复读取的方法降低其测序错误率。
有文献报道PacBio可以采用CCS(环形一致序列)克服其SMRT测序技术的高错误率问题。另外,MinION通过2D和1D2的测序方法可以将测序准确率大幅提升,最高可达97%的准确率。
有文献报道,Helicos通过双重测序法(Two-Pass)进行测序可以将其测序中的缺失类型的错误率降低至1%以内,但其操作过程比较繁琐复杂。
由此,现有的测序方法有待进一步改进。
申请内容
本发明旨在至少在一定程度上解决相关技术中的技术问题之一。为此,本发明的一个目的在于提出一种有效的测序方法。
在本发明的第一方面,本发明提出了一种测序方法,根据本发明的实施例,该方法包括:(1)对芯片表面的测序模板进行第一测序,以便通过形成第一新生测序链获得第一测序数据,所述测序模板通过测序接头连接在所述芯片表面上;(2)对至少一部分所述第一新生测序链的3'末端进行第一封闭处理;和(3)对所述测序模板进行第二测序,以便通过形成第二新生测序链获得第二测序数据。
根据本发明的实施例,通过进行两轮测序,后续可以互相校正,以提高测序结果的准确率,同时,在进行第一轮测序,即第一测序之后,通过对残留在芯片表面的新生测序链的3'末端进行封闭处理,能够有效地避免在第二轮测序即第二测序过程中产生干扰信号。由此,能够进一步提高测序结果的准确性。
在本发明的第二方面,本发明提出了一种测序结果分析方法,根据本发明的实施例,所述测序结果包括第一测序数据和第二测序数据,所述第一测序数据和所述第二测序数据均由多个读段构成,所述第一测序数据中的至少一部分所述读段在所述第二测序数据中存在对应读段,所述第一测序数据和所述第二测序数据是通过前面所述的方法得到的,所述测序结果分析方法包括:(a)基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以便获得最终序列信息。
根据本发明的实施例,提出了一种测序结果分析方法,所述测序结果包括第一测序数据和第二测序数据,所述第一测序数据和所述第二测序数据均由多个读段构成,所述第一测序数据中的至少一部分所述读段在所述第二测序数据中存在对应读段,所述测序结果分析方法包括:(a)基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以便获得最终序列信息。
根据本发明的实施例,通过对两轮测序的结果进行相互校正能够提高测序结果的准确率。另外,如前所述,在进行第一轮测序,即第一测序之后,通过对残留在芯片表面的新生测序链的3'末端进行封闭处理,能够有效地避免在第二轮测序即第二测序过程中产生干扰信号。由此,能够进一步提高测序结果的准确性。
在本发明的第三方面,本发明还提出了一种测序结果分析系统。根据本发明的实施例,该系统包括:测序设备,所述测序设备适于通过前面所述的方法获得测序结果,所述测序结果包括第一测序数据和第二测序数据,所述第一测序数据和所述第二测序数据均由多个读段构成,所述第一测序数据中的至少一部分所述读段在所述第二测序数据中存在对应读段;分析设备,所述分析设备适于基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以便获得最终序列信息。
根据本发明的实施例,提出了一种测序结果分析系统,该系统包括:测序设备,所述测序设备用于获得测序结果,所述测序结果包括第一测序数据和第二测序数据,所述第一测序数据和所述第二测序数据均由多个读段构成,所述第一测序数据中的至少一部分所述读段在所述第二测序数据中存在对应读段;分析设备,所述分析设备适于基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以 便获得最终序列信息。
利用上述任一系统能够有效地实施前面所述的测序结果分析方法,从而通过对包含多轮测序的结果进行相互校正能够提高测序结果的准确率。另外,如前所述,在进行第一轮测序,即第一测序之后,通过对残留在芯片表面的新生测序链的3'末端进行封闭处理,能够有效地避免在第二轮测序即第二测序过程中产生干扰信号。由此,能够进一步提高测序结果的准确性。
另外,本发明还提供了一种计算机可读存储介质,其上存储有计算机程序,根据本发明的实施例,该程序被处理器执行时实现前面所述方法的步骤。
本发明还提供了一种电子设备,其包括:前面所述的计算机可读存储介质;以及一个或者多个处理器,用于执行所述计算机可读存储介质中的程序。
最后,本发明还提供了一种计算机程序产品,包括指令,所述指令在所述计算机执行所述程序时,使所述计算机执行上述任一实施方式中的测序方法和/或测序结果分析方法。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:
图1是根据本发明一个实施例的测序方法的流程示意图;
图2是根据本发明又一个实施例的测序方法的流程示意图;
图3是根据本发明又一个实施例的测序方法的流程示意图;
图4是根据本发明一个实施例的测序结果分析方法的流程示意图;
图5是根据本发明又一个实施例的测序结果分析方法的流程示意图;
图6是根据本发明又一个实施例的测序结果分析方法的流程示意图;
图7是根据本发明一个实施例的测序结果分析系统的结构示意图;
图8是根据本发明又一个实施例的测序结果分析系统的结构示意图;
图9是根据本发明又一个实施例的测序结果分析系统的结构示意图;
图10是根据本发明一个实施例的获得Reads1和Reads2的测序方法示意图;
图11是根据本发明一个实施例的测序文库构建示意图;
图12是根据本发明一个实施例的获得Consensus Reads(一致序列/共同序列)的分析方法流程示意图。
具体实施方式
本发明中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
在本发明中,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”、“固定”等术语应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或成一体;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通或两个元件的相互作用关系,除非另有明确的限定。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本发明中的具体含义。
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。
在本发明的第一方面,本发明提出了一种能够降低测序平台(例如GenoCare TM单分子测序平台等)测序输出序列噪声及错误率的测序与分析方法测序方法,参考图1~3和图10~12,对根据本发明实施例的测序方法进行描述。
根据本发明的实施例,该方法包括:
S10第一测序获得第一测序数据
在该步骤中,对芯片表面的测序模板进行第一测序,以便通过形成第一新生测序链获得第一测序数据,所述测序模板通过测序接头连接在所述芯片表面上。
需要说明的是,在本文中所使用的术语“芯片”是指具有表面例如平面表面的固相基底,其表面连接有待测生物分子,也称为测序芯片或流动池(flow cell)。可以理解的,该方法适合于任何基于芯片检测进行核酸序列测定的在测序平台,例如目前市面上主流的通过边合成边测序原理进行测序的测序平台,均可以采用该方法进行处理,对基于芯片检测进行测序的单分子测序平台尤其适用,例如GenoCare TM单分子测序平台。
参考图2,在步骤S10之前,还可以通过下列步骤获得可以用于测序平台的芯片:
S10a:使测序文库中的文库分子与芯片表面的测序接头进行杂交;
S10b:利用所述文库分子作为初始模板,通过合成互补链形成所述测序模板;和
S10c除去所述初始模板,并对所述芯片表面的核酸分子的3'末端进行第二封闭处理。
由此,通过第二封闭处理,可以进一步去除残留的活性3'末端对后续反应的影响。所称的“文库”为包含多个目标/待测片段的核酸分子池/集合,目标/待测片段来源于待测样本的核酸。通常地,对源自目标/待测片段进行处理,如使目标片段的一个末端或两个末端加上已知序列,如加上接头(测序接头),使该文库能够连接或固定到芯片上,以适于载至测序平台进行测序。
参考图3,在进行S10c步骤之前,还可以进一步包括S11b:对步骤S10b中延伸不完全的所述互补链的3'末端进行第三封闭处理。由此,可以进一步提高测序的准确性,降低不期望的测序噪音。
S20对至少一部分所述第一新生测序链的3'末端进行第一封闭处理在该步骤中,对至少一部分所述第一新生测序链的3'末端进行第一封闭处理,通过第一封闭处理,可以有效增加有效数据量,降低无效数据对信息分析的干扰。
根据本发明的一个实施例,步骤S20包括去除所述芯片表面的所述第一新生测序链,并对残余在所述芯片表面的所述第一新生测序链的3'末端进行第一封闭处理。
根据本发明的一个实施例,步骤S20包括对所述第一新生测序链的3'末端进行第一封闭处理,并去除封闭后的第一新生测序链。
S30第二测序获得第二测序数据
在该步骤中,对所述测序模板进行第二测序,以便通过形成第二新生测序链获得第二测序数据。
根据本发明的实施例,通过进行两轮测序,后续可以互相校正,以提高测序结果的准确率,同时,在进行第一轮测序,即第一测序之后,通过对残留在芯片表面的新生测序链的3'末端进行封闭处理,能够有效地避免在第二轮测序即第二测序过程中产生干扰信号。由此,能够进一步提高测序结果的准确性。
根据本发明的实施例,所述第一封闭处理、所述第二封闭处理和所述第三封闭处理可以分别独立地通过使3'末端羟基与延伸反应阻断剂相连而进行的。由此,可以进一步提高封闭效果,从而进一步提高测序的准确性,降低不期望的测序噪音。
根据本发明的实施例,所述延伸反应阻断剂为ddNTP或其衍生物。由此,可以进一步提高封闭效果,从而进一步提高测序的准确性,降低不期望的测序噪音。
根据本发明的实施例,所述第一封闭处理、所述第二封闭处理和所述第三封闭处理分别独立地采用DNA聚合酶和末端转移酶的至少之一进行。由此,可以进一步提高封闭效果,从而进一步提高测序的准确性,降低不期望的测序噪音。
根据本发明的实施例,所述第一封闭处理和所述第三封闭处理分别独立地通过聚合酶连接所述ddNTP或其衍生物,所述第二封闭处理通过所述末端转移酶连接所述ddNTP或其衍生物。由此,可以进一步提高封闭效果,从而进一步提高测序的准确性,降低不期望的测序噪音。
在本发明的第二方面,本发明提出了一种测序结果分析方法,利用该方法能够对前面任一所述的测序方法所产生的两轮测序数据进行有效分析,从而进一步提高测序的准确性,避免测序误差。
参考图4~图6、以及图12,根据本发明的实施例,所述测序结果包括第一测序数据和第二测序数据,所述第一测序数据和所述第二测序数据均由多个读段构成,所述第一测序数据中的至少一部分所述读段在所述第二测序数据中存在对应读段,所述第一测序数据和所述第二测序数据是通过前面所述的方法得到的,所述测序结果分析方法包括:
基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以便获得最终序列信息。
根据本发明的实施例,所述相互校正包括下列步骤:
在所述第一测序数据和所述第二测序数据中选择高质量读段和所述高质量读段的对应读段,所述读段的长度不低于预定长度,所述读段具有不低于预定质量阈值的测序质量;和
将所述高质量读段与所述高质量读段的对应读段进行比对,并基于所述比对结果进行序列信息校正。
根据本发明的实施例,通过对两轮测序的结果进行相互校正能够提高测序结果的准确率。另外,如前所述,在进行第一轮测序,即第一测序之后,通过对残留在芯片表面的新生测序链的3'末端进行封闭处理,能够有效地避免在第二轮测序即第二测序过程中产生干扰信号。由此,能够进一步提高测序结果的准确性。
参考图5,所述相互校正包括下列步骤:
S100构建第一读段集合
在该步骤中,根据所述读段的长度,基于所述第一测序数据,构建第一读段集合,所述第一读段集合中的每一个读段长度均不低于第一预定长度。
S200构建第二读段集合和第三读段集合
在该步骤中,根据所述对应读段的长度,基于所述第一读段集合,构建第二读段集合和第三读段集合, 所述第二读段集合中每一个读段的所述对应读段的长度均不低于第二预定长度,所述第三读段集合中每一个读段的所述对应读段的长度均处于预定长度范围内。
S300构建第四读段集合和第五读段集合
在该步骤中,根据所述第二读段集合中所述读段及其所述对应读段的测序质量,基于所述第二读段集合及其所述对应读段,构建第四读段集合和第五读段集合。
根据本发明的实施例,所述第四读段集合和所述第五读段集合分别是按照下列原则确定的:
将所述第二读段集合中的所述读段与其所述对应读段进行测序质量比较;
选择测序质量高的一方作为所述第四读段集合的元素,选择测序质量低的一方作为所述第五读段集合的元素;
对于测序质量相同的情形,则选择来自所述第二读段集合的所述读段作为所述第四读段集合的元素,则选择所述对应读段作为所述第五读段集合的元素。
S400构建第六读段集合
在该步骤中,利用测序质量,对所述第四读段集合进行过滤处理,以便构建第六读段集合,所述第六读段集合中的所述读段的测序质量均不低于第一预定质量阈值。
S500构建第七读段集合
在该步骤中,利用所述第六读段集合,从所述第五读段集合中选择与所述第六读段集合中的所述读段对应的所述读段,以便构建第七读段集合。
S600将第六读段集合和第七读段集合进行比较,确定第一差异位点
在该步骤中,将所述第六读段集合与所述第七读段集合进行读段比对,并在所述第六读段集合的所述读段上确定第一差异位点。
S700对第一差异位点进行校正
在该步骤中,利用预先确定的测序误差预测模型,对所述第一差异位点进行校正,以便确定第一序列信息,所述测序误差预测模型用于确定在测序过程中,差异位点发生插入或者缺失的概率。
参考图6,在得到第一测序信息后,还可以进一步包括:
S400a构建第八读段集合
在该步骤中,利用测序质量,对所述第三读段集合进行过滤处理,以便构建第八读段集合,其中,所述第八读段集合中的所述读段的测序质量均不低于第二预定质量阈值。
S500a构建第九读段集合
在该步骤中,利用所述第八读段集合,从所述第二测序数据中选择与所述第七读段集合中的所述读段对应的所述读段,以便构建第九读段集合。
S600a将第八读段集合和第九读段集合进行比对,确定第二差异位点
在该步骤中,将所述第八读段集合与所述第九读段集合进行读段比对,并在所述第八读段集合的所述读段上确定第二差异位点。
S700a对第二差异位点进行校正确定第二序列信息。
在该步骤中,利用所述测序误差预测模型,对所述第二差异位点进行校正,以便确定第二序列信息。
根据本发明的实施例,所述测序误差预测模型是基于所述第一测序数据和所述第二测序数据与参考基因组的比对结果,对朴素贝叶斯模型进行训练获得的。
根据本发明的实施例,针对所述第一差异位点和所述第二差异位点:
如果来自所述第六读段集合的读段在所述差异位点存在碱基,来自所述第七读段集合的对应读段在所述差异位点不存在碱基,并且在所述差异位点发生缺失的概率为50%以上,则保留所述第六读段集合的读段在所述差异位点的碱基作为最终测序结果。
如果来自所述第六读段集合的读段在所述差异位点不存在碱基,来自所述第七读段的读段集合的定读段在所述差异位点存在碱基,并且在所述差异位点发生插入的概率为50%以上,则保留所述第六读段集合的读段在所述差异位点的碱基作为最终测序结果。
如果来自所述第六读段集合的读段在所述差异位点存在碱基,来自所述第七读段的读段集合的定读段在所述差异位点也存在碱基,则选择所述第六读段集合的读段在所述差异位点的碱基作为最终测序结果。
根据本发明的实施例,所述第一预定长度和所述第二预定长度分别独立地不低于20bp,优选不低于25bp,所述预定长度范围为10~25bp;所述第一预定质量阈值和所述第二预定质量阈值分别独立地不低于50,优选不低于60。
具体地,根据本发明的实施例,提供了一种联合使用接头D7-S1-T/D9-S2和测序引物D7S1T-R2P,利用测序平台如GenoCare TM单分子测序平台进行Two-Pass测序获得Reads1和Reads2的测序方法。
本发明进一步的目的在于,提供一种对上述Two-Pass测序方法获得的Reads1和Reads2进行分析获得Consensus Reads的分析方法。该分析方法可以显著降低输出的Consensus Reads(一致序列/共同序列)中 的噪声序列以及碱基错误率。
根据本发明的实施例,用于Two-Pass测序文库构建的接头以及测序引物,所述接头由寡核苷酸链D7-S1-T和带有5带磷酸基团修饰的D9-S2退火获得,所述测序引物为D7S1T-R2P。其中,所述D7-S1-T的序列为SEQ ID NO:1,所述D9-S2的序列为SEQ ID NO:2,所述D7S1T-R2P的序列为SEQ ID NO:3。
根据本发明的实施例,本发明提供一种使用上述接头和测序引物进行Two-Pass测序获得Reads1和Reads2的方法,包括:
步骤一:构建Two-Pass测序文库,使用文库制备试剂盒(
Figure PCTCN2021091279-appb-000001
Universal DNA Library Prep Kit for Illumina V2(ND606-01))将退火后的D7-S1-T/D9-S2接头与准备好的片段化的人类gDNA进行连接,连接后无需进行PCR扩增,直接使用纯化试剂盒(VAHTS DNA Clean Beads(N411-01))进行纯化获得目的文库;
步骤二:将步骤一获得的文库与测序芯片表面接头进行杂交;
步骤三:对步骤二中杂交于芯片表面的初始模板进行互补链合成;
步骤四(可选地):对步骤三中延伸不完全的新生链的3'末端进行封闭,减少其对测序过程的干扰;
步骤五:变性去除步骤二中杂交于芯片表面的初始模板;
步骤六:对芯片表面残余接头的3'末端进行封闭,减少其对测序过程的干扰;
步骤七:以步骤三中合成的互补链为模板杂交所述测序引物D7S1T-R2P;
步骤八:以步骤三中合成的互补链为模板,以步骤七中杂交的测序引物D7S1T-R2P为引物进行Read1测序;
步骤九:变性去除步骤八中的新生测序链;
步骤十:对经过步骤九处理后可能残余的步骤八中的新生测序链的3'末端进行封闭,防止其在Read2测序时继续延伸;
步骤十一:以步骤三中合成的互补链为模板杂交所述测序引物D7S1T-R2P;
步骤十二:以步骤三中合成的互补链为模板,以步骤十一中杂交的测序引物D7S1T-R2P为引物进行Read2测序;
步骤十三:对步骤八和步骤十二获得的测序数据进行拆分获得坐标一一对应的Reads1和Reads2两部分序列。
进一步地,本发明提供一种对上述任一实施例获得的Reads1和Reads2进行分析获得Consensus Reads的分析方法,包括:
步骤十四:校正模型构建,提取步骤十三获得的Reads1和Reads2序列中同一坐标两次测序读长均≥列中同一坐标的Reads,分别输出为T1(Read1)和T2(Read2)两个文件,将T1和T2中的Reads分别与参考基因组进行比对,通过朴素贝叶斯方法计算在不同的前后碱基组合下,中间碱基发生Deletion或Insertion的概率。在预测过程中,对于不同的前后碱基组合下的中间碱基,根据模型中发生Deletion或Insertion的概率决定是否保留中间碱基。若Deletion的概率大于50%,则保留中间碱基,反之舍弃该中间碱基。
步骤十五:对步骤十三获得的Reads1数据按读长进行过滤,将读长≥25bp的Reads1序列集合名为Fa1。使用25bp长度对短读序列进行过滤可以去除一部分噪声序列,提高测序数据mapping的准确性。
步骤十六:根据步骤十三获得的Reads2的序列读长将与其对应的步骤十五获得的Fa1拆分为两个集合,其中Read2≥25bp对应的Fa1中的Reads的集合命名为Fa2,其中10bp≤Read2<25bp对应的Fa1中的Reads的集合命名为Fa3。这里将Fa1拆分为两部分进行分析的目的是在提高Consensus Reads的准确率的同时,减少因长度过滤而导致的数据通量的损失。
步骤十七:将步骤十六获得的Fa2中的Reads和与其坐标相对应的步骤十三获得的Reads2中的Reads的Q值进行比较,取Q值较高的Reads(Q值相等时取Fa2中的Reads)的集合命名为Fa4,取Q值较低的Reads(Q值相等时取Reads2中的Reads)的集合命名为Fa5。该步骤的目的是将Reads1和Reads2中的序列划分为测序质量相对较高和较低的两个集合,保证最终输出的Consensus Reads中的序列是两次测序中测序质量较高相对更准确的Reads。
步骤十八:对步骤十七获得的Fa4和Fa5进行进一步的过滤,将Fa4中Q值≥a4的Reads的集合命名为Fa6,将与Fa6中Reads坐标一一对应的Fa5中的Reads的集合命名为Fa7。
步骤十九:将步骤十八获得的Fa6和Fa7中的Reads一一对齐,根据其序列的相似性进行分级,并以Fa7为参考序列对Fa6进行校正,在Fa6序列中标记出与Fa7不同的位置,根据步骤十四构建的校正模型逐个判断不同位置碱基是否为Deletion或者Insertion,从而得到校正后的用于输出的Consensus Reads Part1。
该步骤中的不同的位置表述仅针对某一位置上只在Fa6或者Fa7的一条Reads上测出碱基。这时步骤十四构建的矫正模型将判断该碱基是否应该被保留。对于某一位置上Fa6和Fa7上均测出碱基,但是碱基 种类不一致,则以Fa6的碱基为准,本模型不对上述情况进行矫正。
步骤二十:对所述步骤十六获得的Fa3中的Reads进行进一步过滤,将所述Fa3中Q值≥a3的Reads的集合命名为Fa8;
步骤二十一:提取所述步骤十三获得的Reads2中与所述Fa8中坐标一一对应的Reads的集合,命名为Fa9;
步骤二十二:将所述Fa8与Fa9中的Reads一一对齐,根据其序列的相似性进行分级,并以Fa9为参考序列对Fa8进行校正,在Fa8中标记出与Fa8不同的位置,根据步骤十四构建的校正模型逐个判断不同位置碱基是否为Deletion(缺失)或者Insertion(插入),从而得到校正后的用于输出的Consensus Reads Part2;
步骤二十三:根据不同的应用需求将不同的相似性级别的所述Consensus Reads Part1与所述Consensus Reads Part2进行合并,获得输出的Consensus Reads。
根据本发明的实施例,对于步骤二所述文库与测序芯片表面接头杂交过程包括(试剂常规):
1)将用于杂交的文库在90-100文预变性2-5分钟;
2)将从所述步骤1)获得的产物迅速在冰水混合物上冷却2分钟以上,获得变性的杂交文库母液;
3)将从所述步骤2)获得的变性的杂交文库母液使用杂交液(如3*SSC溶液)稀释至合适的浓度,优选地为0.1~2nM,得到稀释的杂交文库;
4)将从所述步骤3)获得的30~50步骤稀释的杂交文库通入使用复溶试剂预处理好的测序芯片通道,在40~60文杂交10~30分钟;
5)向芯片通道通入200~1000μL体积的清洗液1,去除步骤4)杂交后剩余的稀释的杂交文库;
6)向芯片通道通入200~1000μL体积的清洗液2,去除步骤5)中的清洗液1,完成文库与测序芯片表面接头杂交。
根据本发明的实施例,所述复溶试剂的组分包括:清洗液1,组分包括:150mM的氯化钠,15mM的柠檬酸钠,150mM的4-羟乙基哌嗪乙磺酸,0.1%的十二烷基硫酸钠。
根据本发明的实施例,清洗液3,组分包括:450mM的氯化钠,45mM的柠檬酸钠。
根据本发明的实施例,所述清洗液2的组分包括:150mM的氯化钠,150mM的4-羟乙基哌嗪乙磺酸。
根据本发明的实施例,对于步骤三所述初始模板的互补链合成的过程包括:
1)向芯片通道中通入200~1000μL体积的延伸试剂,在50~70试条件下反应5~10分钟;
2)向芯片道中通入200~1000μL体积的清洗液1,去除步骤1)中反应后的延伸试剂;
3)向芯片道中通入200~1000μL体积的清洗液2,去除步骤2)中的清洗液1,完成初始模板的互补链的合成。
根据本发明的实施例,所述延伸试剂的组分包括:10~100U/mL的DNA聚合酶,优选地,可以是Bst DNA聚合酶、Bsu DNA聚合酶、Klenow DNA聚合酶等,0.2~2mM的dNTP,0.5~2M的甜菜碱,20mM的三羟甲基氨基甲烷,10mM的氯化钠,10mM的氯化钾,10mM的硫酸铵,3mM的氯化镁,0.1%的Triton X-100,pH值为8.3。
根据本发明的实施例,对于步骤四所述对延伸不完全的链的3'末端进行封闭的过程包括:
1)向芯片通道中通入200~1000μL体积的封闭试剂1,在30~60试条件下反应5~30分钟;
2)向芯片通道中通入200~1000μL体积的清洗液1,去除步骤1)中反应后的封闭试剂1,完成对延伸不完全的链的3'末端的封闭。
根据本发明的实施例,所述封闭试剂1的组分包括:10~100U/mL的DNA聚合酶,优选地,可以是Klenow DNA聚合酶、Bsu DNA聚合酶、N9 DNA聚合酶等,10~100μM的ddNTP,5mM的氯化锰,20mM的三羟甲基氨基甲烷,10mM的氯化钠,10mM的氯化钾,10mM的硫酸铵,3mM的氯化镁,0.1%的Triton X-100,pH值为8.3。
根据本发明的实施例,对于步骤五所述去除初始模板的过程包括:
1)向芯片通道中通入200~1000μL体积的变性试剂,优选地,所述变性试剂可以是甲酰胺、0.1M的NaOH等,于50~60反应2~5分钟;
2)向芯片通道中通入200~1000μL体积的清洗液1,去除步骤1)中反应后的变性试剂和从芯片上变性分离的初始模板;
重复步骤1)和步骤2)一次,完成初始模板的去除。
根据本发明的实施例,对于步骤六所述封闭芯片表面残余接头3'末端的过程包括:
1)向芯片通道中通入200~1000μL体积的清洗液2;
2)向芯片通道中通入200~1000μL体积的封闭试剂2,在30~60试条件下反应5~30分钟;
3)向芯片通道中通入200~1000μL体积的清洗液1,去除步骤2)中反应后的封闭试剂2,完成对芯片表面残余接头3'末端的封闭;
根据本发明的实施例,所述封闭试剂2的组分包括:100U/mL的末端转移酶(Terminal Transferase(NEB,M0315L)),1×Terminal Transferase Buffer,0.25mM的氯化钴,10~100μM的ddNTP。
根据本发明的实施例,对于步骤七所述杂交测序引物D7S1T-R2P的过程包括:
1)将所述测序引物D7S1T-R2的母液使用清洗液3稀释至合适的浓度,优选地为0.1~1μM,获得稀释的测序引物杂交液;
2)将从步骤1)获得的200~1000μL体积的测序引物杂交液通入芯片通道,在50~60引条件下杂交10~30分钟;
3)向芯片通道中通入200~1000μL体积的清洗液1,去除步骤2)杂交后残余的测序引物;
4)向芯片通道中通入200~1000μL体积的清洗液2,去除步骤3)中的清洗液1,完成测序引物的杂交。
根据本发明的实施例,对于步骤八所述Read1的测序过程可参照GenoCare TM单分子双色测序通用试剂盒(备案号:粤深械备20190887号)中的操作说明书进行。
根据本发明的实施例,对于步骤九所述去除新生测序链的过程参照步骤五进行。
根据本发明的实施例,对于步骤十所述残余新生链的3'末端的封闭过程可参照步骤四进行。
根据本发明的实施例,对于步骤十一所述杂交测序引物D7S1T-R2P的过程参照步骤七进行。
根据本发明的实施例,对于步骤十二所述Read2的测序过程参照步骤八进行。
根据本发明的实施例,对于步骤十三所述将测序数据进行拆分获得坐标一一对应的Reads1和Reads2两部分序列的过程包括:
使用python语言按测序循环数将BaseCalling输出的“.fa_”文件中的每条Read从中间平均分为两个部分,分别输出为两份序列坐标一致的“.fa_”文件“Reads1.fa_”、“Reads2.fa_”;
使用python语言将从步骤1)获得的“Reads1.fa_”、“Reads2.fa_”文件中所用Reads中的字符“_”移除,输出“Reads1.fa”、“Reads2.fa”文件,完成将测序数据进行拆分获得坐标一一对应的Reads1和Reads2两部分序列。
根据本发明的实施例,对于步骤十四所述校正模型的构建过程包括:
提取步骤十三获得的Reads1和Reads2序列中同一坐标两次测序读长均≥25bp,分别输出为T1(Read1)和T2(Read2)两个fast文件;
将从步骤1)获得的T1和T2文件中对应的两条reads进行滑动对齐,标记对齐结果中两条reads相同和不同的Base,得到Common Reads;
将从步骤1)获得的T1和T2文件分别和参考序列做mapping得到Sam1和Sam2文件;
根据从步骤3)获得的Sam1和Sam2中对应的且mapping到同一位置的reads,找到参考序列中的最长公共子串RefReads;
将从步骤2)获得的Common Reads中两次测序不同的base与从步骤4)获得的Ref Reads进行比较,使用朴素贝叶斯方法计算在不同的前后碱基组合下中间碱基发生Deletion或Insertion的概率,完成校正模型的构建。
上述任一实施例的方法,结合测序数据特别是单分子测序平台数据的特点,提供了一套可选的联合使用接头D7-S1-T/D9-S2和测序引物D7S1T-R2P,进行Two-Pass测序获得Reads1和Reads2的测序方法。另一方面,本发明上述相关实施例也提供了一种适用于Two-Pass测序方法获得的数据(Reads1和Reads2)进行分析获得Consensus Reads的分析方法。该分析方法可以显著降低输出的Consensus Reads(一致序列/共同序列)中的噪声序列以及碱基错误率。
在本发明的第三方面,本发明还提出了一种能够实施上述测序结果分析方法的测序结果分析系统。参考图7~9,根据本发明的实施例,该系统包括:测序设备,所述测序设备适于通过前面所述的方法获得测序结果,所述测序结果包括第一测序数据和第二测序数据,所述第一测序数据和所述第二测序数据均由多个读段构成,所述第一测序数据中的至少一部分所述读段在所述第二测序数据中存在对应读段;分析设备,所述分析设备适于基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以便获得最终序列信息。
利用该系统能够有效地实施前面所述的测序结果分析方法,从而通过对两轮测序的结果进行相互校正能够提高测序结果的准确率。另外,如前所述,在进行第一轮测序,即第一测序之后,通过对残留在芯片表面的新生测序链的3'末端进行封闭处理,能够有效地避免在第二轮测序即第二测序过程中产生干扰信号。由此,能够进一步提高测序结果的准确性。
另外,本发明还提供了一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现前面所述方法的步骤。
本发明还提供了一种电子设备,其包括:前面所述的计算机可读存储介质;以及一个或者多个处理器,用于执行所述计算机可读存储介质中的程序。
本发明实施方式还提供一种计算机程序产品,包括指令,所述指令在所述计算机执行所述程序时,使所述计算机执行上述任一实施方式中的测序方法和/或测序结果分析方法。
下面参考具体实施例,对本发明进行描述,需要说明的是,这些实施例仅仅是描述性的,而不以任何方式限制本发明。
实施例
本实施例提供了一套降低测序平台特别是单分子测序平台输出序列噪声及错误率的测序与分析方法。
本实施例提供的测序与分析方法包括:
联合使用接头D7-S1-T/D9-S2和测序引物D7S1T-R2P,进行Two-Pass测序获得Reads1和Reads2的测序方法。其中,所述接头D7-S1-T/D9-S2由寡核苷酸链D7-S1-T和带有5带磷酸基团修饰的D9-S2组成。所述D7-S1-T的序列为SEQ ID NO:1,所述D9-S2的序列为SEQ ID NO:2,所述测序引物D7S1T-R2P的序列为SEQ ID NO:3。具体地,该示例所涉及的引物序列和名称总结如表1所示。
表1
Figure PCTCN2021091279-appb-000002
一套对上述Two-Pass测序方法获得的Reads1和Reads2进行分析获得Consensus Reads的分析方法。该分析方法可以显著降低输出的Consensus Reads中的噪声序列以及碱基错误率。
进一步地,本实施例提供的所述一套联合联合使用接头D7-S1-T/D9-S2和测序引物D7S1T-R2P,利用测序平台例如GenoCare TM单分子进行Two-Pass测序获得Reads1和Reads2的测序方法包括:
步骤一:构建Two-Pass测序文库。使用
Figure PCTCN2021091279-appb-000003
Universal DNA Library Prep Kit for Illumina V2(ND606-01)将退火后的D7-S1-T/D9-S2接头与准备好的片段化的人类gDNA进行连接,连接后无需进行PCR扩增,直接使用VAHTS DNA Clean Beads(N411-01)进行纯化获得目的文库。
具体地,本实施例中构建Two-Pass测序文库的步骤包括:
1)人类gDNA片段化:使用Covaris设置参数Peak Power,75;Duty Factor,25;Cycles/Burst,50;Time(s),250对0.1~1ug人类gDNA进行超声打断获得100~300bp的DNA片段。可选地,本步骤也可以使用酶切的方法实现。
2)DNA片段进行末端修复和加A尾,反应体系如表2所示。
表2
H 2O (16.2-X)μL
End Prep Mix 3.8μL
DNA片段(总量50ng) XμL
Total 20μL
反应条件为:20条反应15分钟,接着在65,条件下反应10分钟。
3)末端修复加A产物与接头进行连接,反应体系如表3所示。
表3
末端修复加A产物 20μL
D7-S1-T/D9-S2接头(20μM) 5μL
Ligation Mix 25μL
Total 50μL
反应条件为,混匀后室温放置15min。
4)连接产物纯化
纯化使用
Figure PCTCN2021091279-appb-000004
DNA Clean Beads(N411-01)的试剂和说明书所示步骤进行纯化,对该试剂盒的操作说明书稍作改动,以回收产物10μL,完成测序文库的构建。具体步骤如下:
a)将连接后的PCR体系转移至1.5mL EP管中,加入0.8×(40μL)磁珠,吹打混匀10次,室温放置3分钟;
b)将1.5mL EP管放置在磁力架上,静置2-3分钟,移去上清;
c)用200μL体积80%乙醇洗涤,漂洗磁珠,室温孵育30sec,小心移除上清;
d)开盖干燥磁珠约5-10分钟至残余乙醇完全挥发;
e)加入22μL体积的去离子水从磁力架上去取进行洗脱,充分混匀后室温静置3分钟,置于磁力架上3分钟,待液体澄清后,回收产物20μL,待,再加入1.2×(24μL))磁珠,吹打混匀10次,室温放置3分钟;
f)将1.5mL EP管放置在磁力架上,静置2-3分钟,移去上清;
g)重复步骤c)-d)一次;
h)加入11μL体积的去离子水从磁力架上取下进行洗脱,充分混匀后室温静置3分钟,置于磁力架上3分钟,待液体澄清后,回收产物10μL,完成测序文库构建。
5)定量及检测
使用Qubit 3.0仪器和Qubit dsDNA HS检测试剂盒对所述构建的文库进行浓度检测。
使用Labchip DNA HS检测试剂盒和LabChip仪器对所述构建的文库进行片段分布检测。
步骤二:将步骤一获得的文库与测序芯片表面探针进行杂交。
芯片选择:
1)芯片选择:所用的芯片为环氧基修饰的芯片,通过探针上的氨基和芯片表面的环氧基团反应的方法来固定探针,序列为5片选择:所用的芯片为环氧基修饰的芯片,通过探针上的氨基和芯片表面的环氧基团反应的方法来固定探针,序列为述任一实施方式中的测序方法-3’(SEQ ID NO:4)。该示例对表面修饰、探针固定方式不作限制,例如可参照公开专利申请CN109610006A进行,该专利申请的全文并入至此处。
文库与芯片上探针杂交过程如下:
1)取3μL体积20nM浓度的步骤一构建的测序文库,加入3μL的的去离子水,混合均匀,于95℃变性5分钟;
2)将从步骤1)获得的变性文库迅速置于冰水混合物冷却2分钟以上;
3)向步骤2)的产物中加入24μL体积的杂交液,将文库稀释至2nM的工作浓度。其中杂交液3xSSC缓冲液,3xSSC溶液是将20*SSC缓冲液((西格玛,#S6639-1L))用无核酸酶水(Rnase-free水)稀释而成。
4)将从步骤3)获得的30μL体积稀释的杂交文库通入从芯片的一条通道中,于42℃杂交反应30分钟,然后冷却至室温;
5)向步骤4)获得的杂交通道中通入200μL体积的清洗液1,去除未杂交至芯片表面的文库;
向芯片杂交通道通入200μL体积的清洗液2,替换通道内的清洗液1,完成文库与测序芯片表面接头的杂交。
清洗液1组分包括:150mM的氯化钠,15mM的柠檬酸钠,150mM的4-羟乙基哌嗪乙磺酸,0.1%的十二烷基硫酸钠。
清洗液2的组分包括:150mM的氯化钠,150mM的4-羟乙基哌嗪乙磺酸。
步骤三:初始模板进行互补链合成。
初始模板为步骤二中与探针进行杂家的文库,初始模板互补链合成的具体步骤如下:
1)将步骤二中完成文库杂交的芯片置于测序仪;
2)向芯片杂交通道泵入750μL体积的延伸试剂,其中,延伸试剂组分为:120U/mL Bst DNA聚合酶(NEB,#M0275M),0.2mM dNTP(dATP、dTTP、dCTP、dGTP各0.2μM的混合物),1M甜菜碱,20mM的三羟甲基氨基甲烷,10mM的氯化钠,10mM的氯化钾,10mM的硫酸铵,3mM的氯化镁,0.1%的Triton X-100,pH值为8.3;
3)将芯片升温至60±0.5℃,反应10分钟;
4)向芯片杂交通道泵入220μL体积的清洗液1,去除延伸试剂;
5)向芯片杂交通道泵入440μL体积的清洗液2,去除步骤4)中的清洗液1,完成初始模板互补链的合成。
步骤四(可选地):对步骤三中延伸不完全的新生链的3'末端进行封闭,封闭的具体步骤如下:
1)将芯片降温至37±0.5℃,维持90秒;
2)向步骤三所述延伸后的通道中泵入750μL体积的封闭试剂1,反应10分钟。所述封闭试剂1的组分为:100U/mL Klenow DNA聚合酶大片段(3合酶大片段(L Klenow,#M0212M)12.5μM ddNTP mix(ddATP、ddTTP、ddCTP、ddGTP各12.5μM的混合物),5mM的氯化锰,20mM的三羟甲基氨基甲烷,10mM的氯化钠,10mM的氯化钾,10mM的硫酸铵,3mM的氯化镁,0.1%的Triton X-100,pH值为 8.3;
3)向步骤2)所述封闭后的通道中通入220μL体积的清洗液1,去除封闭反应后剩余的封闭液,完成对延伸不完全的新生链的3'末端的封闭。
步骤五:变性去除初始模板,去除初始模板的过程如下:
1)将芯片降温至55±0.5℃;
2)向步骤四所述封闭后的通道中通入800μL体积的甲酰胺,变性2分钟;
3)向步骤2)所述变性后通道中通入220μL体积的清洗液1,去除变性后的初始模板;
4)重复步骤2)和步骤3)一次,完成对初始模板的去除。
步骤六:对芯片表面残余接头的3'末端进行封闭,封闭芯片表面残余接头3'末端的过程包括:
1)将芯片降温至37±0.5℃;
2)向步骤五所述封闭后的通道中通入440μL体积的清洗液2,替换通道内剩余的清洗液1;
3)向所述步骤2)处理后的通道中通入750μL体积的封闭试剂2,反应15分钟。其中,封闭试剂2的组分为:100U/mL Terminal Transferase(NEB,M0315L),1×Terminal Transferase Buffer,0.25mM氯化钴,100μM ddNTP mix(ddATP、ddTTP、ddCTP、ddGTP各100μM的混合物);
4)向步骤3)所述封闭后的通道中通入220μL体积的清洗液1,完成对芯片表面残余接头3'末端的封闭。
步骤七:杂交测序引物D7S1T-R2P,杂交测序引物D7S1T-R2P的过程如下:
1)将芯片升温至55±0.5℃,保持1分钟;
2)向步骤六所述封闭后的通道中通入800μL体积的稀释的测序引物杂交液,杂交反应30分钟。所述稀释的测序引物杂交液为含有0.1μM引物D7S1T-R2P的清洗液3,清洗液3组分包括:450mM的氯化钠,45mM的柠檬酸钠;
3)将芯片降温至37±0.5℃,保持90秒;
4)向步骤2)所述杂交通道中通入220μL体积的清洗液1,去除通道中未被杂交的测序引物;
5)向所述步骤4)处理后的通道中通入440μL体积的清洗液2,替换通道中剩余的清洗液1,完成测序引物的杂交。
步骤八:进行Read1测序,Read1测序的过程如下:
进行80个循环的测序,测序过程中采用四种核苷酸带有两种不同的荧光信号,每轮反应加入两种标记不同荧光信号的核苷酸进行信号检测的方式进行测序。
步骤九:去除新生测序链。
去除新生测序链的过程按照步骤五中的步骤进行。
步骤十:封闭残余新生链的3'末端:。
封闭残余新生链的3'末端的过程按照步骤四中的步骤进行。
步骤十一:杂交所述测序引物D7S1T-R2P。
杂交测序引物D7S1T-R2P的过程按照步骤七中的步骤进行。
步骤十二:进行Read2测序。
Read2测序的过程按照步骤八中的步骤进行。
步骤十三:将测序数据进行拆分获得坐标一一对应的Reads1和Reads2两部分序列。
具体地,本实施例中所述将测序数据进行拆分获得坐标一一对应的Reads1和Reads2两部分序列的过程包括:
使用python语言将160循环测序BaseCalling输出的“.fa_”文件中的每条Read拆分为前80循环和后80循环两个部分,并将所有Reads中的字符“_”移除,分别输出为两份序列坐标一致的“.fa”文件“Reads1.fa”、“Reads2.fa”,完成将测序数据进行拆分获得坐标一一对应的Reads1和Reads2两部分序列。
进一步地,本实施例中提供的所述一套对上述Two-Pass测序方法获得的Reads1和Reads2进行分析获得Consensus Reads的分析方法包括:
步骤十四:构建校正模型。
具体地,本实施例中所述构建校正模型的过程包括:
1)使用python语言,提取步骤十三获得的Reads1和Reads2序列中同一坐标两次测序读长均≥25bp的Reads,分别输出为T1(Read1)和T2(Read2)两个文件。其中同一坐标的对应方法是在生成Reads文件时将同一坐标Reads在不同文件中的Reads ID设置为一致;
2)将T1和T2中位置对应的Reads相互间做Align,在Align结果中标记两条Reads一致和不一致的Base,得到Common Reads。其中位置对应是通过比较两条Reads将的Reads ID是否一致实现;
3)分别将文件T1和T2和Reference做Mapping,得到Sam1和Sam2文件。将Sam1和Sam2中位置对应且mapping到同一位置的Reads,找到Reference中最长公共子串Ref Reads。公共子串指两条对应 的Reads mapping后均覆盖的区域;
4)比较步骤2)中的Common Reads和步骤3)中的RefReads。对于Common Reads中不一致的Base,标记其是否真实存在于Reference中。若存在,对于没有测到的Reads则为Deletion。若不存在,对于测到的Reads则为Insertion;
5)统计步骤4)中的Deletion和Insertion情况,同时统计该不一致位置上前后Base的种类。因此得到在不同Base类型前或后引起Insertion或Deletion的概率。
具体地,本实例中运用的朴素贝叶斯模型如下:
Figure PCTCN2021091279-appb-000005
Figure PCTCN2021091279-appb-000006
其中:表示对于某碱基在前后分别为X和Y碱基时发生Deletion的概率,X,Y∈[A,C,G,T]。P(D)表示对于某碱基发生Deletion的概率;P(I)表示对于某碱基发生Insertion的概率。
通过统计不同碱基下发生Deletion或Insertion时,前后碱基出现频率即可得到P(XY│D)和P(XY│I),从而可以计算得到P(D│XY)和P(I│XY)。
步骤十五:过滤读长得到Fa1。
具体地,本实施例中所述读长过滤的过程包括:
使用Python语言逐行读取Reads1文件中所有reads,若Reads长度大于等于25bp,则输出的文本文件Fa1中。
步骤十六:根据Reads2读长,将Fa1中Reads进行分类。
具体地,本实例中所述根据Reads2读长分类Fa1中Reads的过程包括:
将Fa1中所有Reads对应在Reads2中的Reads读出,根据Reads2中Reads的长度,若Read2≥25bp,则将对应的Fa1中Reads保存于Fa2文件中;若10bp≤Read2<25bp,则将对应的Fa1中Reads保存于Fa3文件中。
步骤十七:根据Q值输出置信Reads。
具体地,本实例中所述根据Q值重新输出置信Reads的过程包括:
1)将步骤十六得到的Fa2中所有Reads取出,并同时取出其对应的Reads2中的Reads。从Reads ID中分割得到该Reads的Quality Score值(简称Q值)。
2)比较两条对应的Reads的Q值,将Q值较大的Reads输出到文件Fa4中,Q值较小的Reads输出到文件Fa5中。若两者Q值相等,则默认将Reads1中的Reads输出至Fa4中,Reads2中的Reads输出至Fa5中。
步骤十八:根据Q值过滤Fa4和Fa5中Reads。
具体地,本实例中所述根据Q值过滤Fa4和Fa5中Reads的过程包括:
取Fa4中Reads,根据其Q值,若大于等于60,则将输出到文件Fa6中,同时将该Reads对应的Fa5中的Reads输出到文件Fa7中。
步骤十九:使用Fa7中Reads矫正Fa6中Reads,得到Consensus Reads Parts1(简称CRP1)。
具体地,本实例中所述使用Fa7中的Reads矫正Fa6中Reads的过程包括:
1)取Fa6中Reads和其对应的在Fa7中的Reads。将两条对应Reads相互配准,得到共同的一致性序列部分。其中两条序列配准使用Smith-Waterman算法,一致性序列指配准后通过在序列中增加、删除或修改部分Base,得到的局部最佳匹配序列。
2)得到一致性序列后,根据步骤十四构建的矫正模型,逐个判断一致性序列中不一致的Base位置。根据该Base位置前后的碱基类型计算该位置出现Deletion或Insertion的概率。若Deletion的概率大于50%,则认为该位置所测Base不应该出现,从而删除该位置Base。反之,保留该位置上的Base。
3)矫正所有不一致Base后,输出矫正后的Reads,即为CRP1。这里的不一致Base特指两条对应Reads中没有同时被测出的Base。若两次均测出该Base,但Base类型不一致,不在本实例矫正的候选范围内,该情况下,最终Base类型以Fa6中Reads的Base类型为准。
步骤二十:根据Q值过滤Fa3中Reads。
具体地,本实例中所述根据Q值过滤Fa3中Reads的过程包括:
取Fa3中所有Reads,分割Fa3中Reads的Reads ID,得到每条Reads的Q值。将Q值≥60的Reads输出到文件Fa8中。
步骤二十一:输出Fa8文件中对应的Reads2中的Reads。
具体地,本实例中所述根据Fa8中Reads输出Reads2中Reads的过程包括:
取Fa8文件中所有Reads,取出其对应的Reads2中的Reads,将其输出到文件Fa9中。
步骤二十二:使用Fa9中Reads矫正Fa8中Reads,得到Consensus Reads Parts2(简称CRP2)。
具体地,本实例中所述的使用Fa9中Reads矫正Fa8中Reads的过程参照所述步骤十九进行。
步骤二十三:根据不同应用对应测序数据准确率的需求,将符合相似度阈值的CRP1和CRP2中的Reads合并输出,得到Consensus Reads。
具体地,本实例中所述不同应用对应测序数据准确率的需求,过滤Consensus Reads Part中Reads并输出的过程包括:
1)根据不同应用对应测序数据准确率的需求,设定对应的相似度阈值。其中对Part1和Part2的相似度阈值可以不同;
2)分别计算CRP1和CRP2中的Reads相似度,相似度是指某Reads在Reads1和Reads2中对应的Reads的相似度。相似度计算步骤是先将两条对应Reads相互配准。再计算配准得到的一致性序列中一致的Base数占总Base数的比值。其中配准方法、一致性序列和不一致Base定义参照步骤十九。
3)根据不同应用对应测序数据准确率的需求,分别将CRP1和CRP2中符合相似度阈值要求的Reads输出到最终的文件中,得到Consensus Reads,参考表4。
表4:不同相似度阈值过滤输出序列与参考基因组mapping分析比较
Figure PCTCN2021091279-appb-000007
注:数据损失主要发生在读长过滤步骤,由于Read1和Read2测序是相互独立事件,所以必然存在部分读长不一致的序列。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (29)

  1. 一种测序方法,其特征在于,包括:
    (1)对芯片表面的测序模板进行第一测序,以便通过形成第一新生测序链获得第一测序数据,所述测序模板通过测序接头连接在所述芯片表面上;
    (2)对至少一部分所述第一新生测序链的3'末端进行第一封闭处理;和
    (3)对所述测序模板进行第二测序,以便通过形成第二新生测序链获得第二测序数据。
  2. 根据权利要求1所述的测序方法,其特征在于,(2)包括:去除所述芯片表面的所述第一新生测序链,并对残余在所述芯片表面的所述第一新生测序链的3'末端进行所述第一封闭处理。
  3. 根据权利要求1或2所述的测序方法,其特征在于,在(1)之前,包括:
    (1-a)使测序文库中的文库分子与芯片表面的测序接头进行杂交;
    (1-b)利用所述文库分子作为初始模板,通过合成互补链形成所述测序模板;
    (1-c)除去所述初始模板,并对所述芯片表面的核酸分子的3'末端进行第二封闭处理。
  4. 根据权利要求3所述的测序方法,其特征在于,在(1-c)之前,进一步包括:
    (1-b-1)对步骤(1-b)中延伸不完全的所述互补链的3'末端进行第三封闭处理。
  5. 根据权利要求4所述的测序方法,其特征在于,所述第一封闭处理、所述第二封闭处理和所述第三封闭处理分别独立地通过使3'末端羟基与延伸反应阻断剂相连而进行的。
  6. 根据权利要求5所述的测序方法,其特征在于,所述延伸反应阻断剂为ddNTP或其衍生物。
  7. 根据权利要求6所述的测序方法,其特征在于,所述第一封闭处理、所述第二封闭处理和所述第三封闭处理分别独立地采用DNA聚合酶和末端转移酶的至少之一进行。
  8. 根据权利要求7所述的测序方法,其特征在于,所述第一封闭处理和所述第三封闭处理分别独立地通过聚合酶连接所述ddNTP或其衍生物,所述第二封闭处理通过所述末端转移酶连接所述ddNTP或其衍生物。
  9. 一种测序结果分析方法,其特征在于,
    所述测序结果包括第一测序数据和第二测序数据,所述第一测序数据和所述第二测序数据均由多个测序读段构成,所述第一测序数据中的至少一部分所述测序读段在所述第二测序数据中存在对应测序读段,所述第一测序数据和所述第二测序数据是通过权利要求1~8任一项所述的方法得到的,
    所述测序结果分析方法包括:
    (a)基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以便获得最终序列信息。
  10. 一种测序结果分析方法,其特征在于,
    所述测序结果包括第一测序数据和第二测序数据,所述第一测序数据和所述第二测序数据均由多个读段构成,所述第一测序数据中的至少一部分所述读段在所述第二测序数据中存在对应读段所述测序结果分析方法包括:
    (a)基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以便获得最终序列信息。
  11. 根据权利要求10所述的方法,其特征在于,所述第一测序数据和所述第二测序数据是通过权利要求1~8任一项所述的方法得到的。
  12. 根据权利要求9-11任一所述的方法,其特征在于,所述相互校正包括下列步骤:
    在所述第一测序数据和所述第二测序数据中选择高质量读段和所述高质量读段的对应读段,所述读段的长度不低于预定长度,所述读段具有不低于预定质量阈值的测序质量;和
    将所述高质量读段与所述高质量读段的对应读段进行比对,并基于所述比对结果进行序列信息校正。
  13. 根据权利要求9-11任一所述的方法,其特征在于,(a)包括:
    (a-1)根据所述读段的长度,基于所述第一测序数据,构建第一读段集合,所述第一读段集合中的每一个读段长度均不低于第一预定长度;
    (a-2)根据所述对应读段的长度,基于所述第一读段集合,构建第二读段集合和第三读段集合,所述第二读段集合中每一个读段的所述对应读段的长度均不低于第二预定长度,所述第三读段集合中每一个读段的所述对应读段的长度均处于预定长度范围内;
    (a-3)根据所述第二读段集合中的所述读段及其所述对应读段的测序质量,基于所述第二读段集合及其所述对应读段,构建第四读段集合和第五读段集合,其中,所述第四读段集合和所述第五读段集合分别是按照下列原则确定的:
    将所述第二读段集合中的所述读段与其所述对应读段进行测序质量比较,
    选择测序质量高的一方作为所述第四读段集合的元素,选择测序质量低的一方作为所述第五读段集合 的元素,
    对于测序质量相同的情形,则选择来自所述第二读段集合的所述读段作为所述第四读段集合的元素,则选择所述对应读段作为所述第五读段集合的元素;
    (a-4)利用测序质量,对所述第四读段集合进行过滤处理,以便构建第六读段集合,所述第六读段集合中的所述读段的测序质量均不低于第一预定质量阈值;
    (a-5)利用所述第六读段集合,从所述第五读段集合中选择与所述第六读段集合中的所述读段对应的所述读段,以便构建第七读段集合;
    (a-6)将所述第六读段集合与所述第七读段集合进行读段比对,并在所述第六读段集合的所述读段上确定第一差异位点;和
    (a-7)利用预先确定的测序误差预测模型,对所述第一差异位点进行校正,以便确定第一序列信息,所述测序误差预测模型用于确定在测序过程中,差异位点发生插入或者缺失的概率。
  14. 根据权利要求13所述的方法,其特征在于,测序结果分析方法进一步包括:
    (a-4a)利用测序质量,对所述第三读段集合进行过滤处理,以便构建第八读段集合,其中,所述第八读段集合中的所述读段的测序质量均不低于第二预定质量阈值;
    (a-5a)利用所述第八读段集合,从所述第二测序数据中选择与所述第七读段集合中的所述读段对应的所述读段,以便构建第九读段集合;
    (a-6a)将所述第八读段集合与所述第九读段集合进行读段比对,并在所述第八读段集合的所述读段上确定第二差异位点;和
    (a-7a)利用所述测序误差预测模型,对所述第二差异位点进行校正,以便确定第二序列信息。
  15. 根据权利要求13或14所述的方法,其特征在于,所述测序误差预测模型是基于所述第一测序数据和所述第二测序数据与参考基因组的比对结果,对朴素贝叶斯模型进行训练获得的。
  16. 根据权利要求13或14所述的方法,其特征在于,针对所述第一差异位点和所述第二差异位点:
    如果来自所述第六读段集合的读段在所述差异位点存在碱基,来自所述第七读段集合的对应读段在所述差异位点不存在碱基,并且在所述差异位点发生缺失的概率为50%以上,则保留所述第六读段集合的读段在所述差异位点的碱基作为最终测序结果;
    如果来自所述第六读段集合的读段在所述差异位点不存在碱基,来自所述第七读段的读段集合的定读段在所述差异位点存在碱基,并且在所述差异位点发生插入的概率为50%以上,则保留所述第六读段集合的读段在所述差异位点的碱基作为最终测序结果;和
    如果来自所述第六读段集合的读段在所述差异位点存在碱基,来自所述第七读段的读段集合的定读段在所述差异位点也存在碱基,则选择所述第六读段集合的读段在所述差异位点的碱基作为最终测序结果。
  17. 根据权利要求13或14所述的方法,其特征在于,所述第一预定长度和所述第二预定长度分别独立地不低于20bp,优选不低于25bp;和/或
    所述预定长度范围为10~25bp;和/或
    所述第一预定质量阈值和所述第二预定质量阈值分别独立地不低于50,优选不低于60。
  18. 一种测序结果分析系统,其特征在于,包括
    测序设备,所述测序设备适于通过权利要求1~8任一项所述的方法获得测序结果,所述测序结果包括第一测序数据和第二测序数据,所述第一测序数据和所述第二测序数据均由多个测序读段构成,所述第一测序数据中的至少一部分所述测序读段在所述第二测序数据中存在对应测序读段;
    分析设备,所述分析设备适于基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以便获得最终序列信息。
  19. 一种测序结果分析系统,其特征在于,包括
    测序设备,所述测序设备用于获得测序结果,所述测序结果包括第一测序数据和第二测序数据,所述第一测序数据和所述第二测序数据均由多个读段构成,所述第一测序数据中的至少一部分所述读段在所述第二测序数据中存在对应读段;
    分析设备,所述分析设备适于基于所述第一测序数据和所述第二测序数据各自的至少一部分进行相互校正,以便获得最终序列信息。
  20. 根据权利要求19所述的系统,其特征在于,所述测序设备适于通过权利要求1~8任一项所述的方法获得所述测序结果。
  21. 根据权利要求18-20任一所述的系统,其特征在于,所述相互校正包括下列步骤:
    在所述第一测序数据和所述第二测序数据中选择高质量读段和所述高质量读段的对应读段,所述读段的长度不低于预定长度,所述读段具有不低于预定质量阈值的测序质量;和
    将所述高质量读段与所述高质量读段的对应读段进行比对,并基于所述比对结果进行序列信息校正。
  22. 根据权利要求19或20所述的系统,其特征在于,所述分析设备进一步包括:
    第一读段集合确定模块,根据所述读段的长度,基于所述第一测序数据,构建第一读段集合,所述第一读段集合中的每一个读段长度均不低于第一预定长度;
    第二读段集合和第三读段集合确定模块,根据所述对应读段的长度,基于所述第一读段集合,构建第二读段集合和第三读段集合,所述第二读段集合中每一个读段的所述对应读段的长度均不低于第二预定长度,所述第三读段集合中每一个读段的所述对应读段的长度均处于预定长度范围内;
    第四读段集合和第五读段集合确定模块,根据所述第二读段集合中所述读段及其所述对应读段的测序质量,基于所述第二读段集合及其所述对应读段,构建第四读段集合和第五读段集合,其中,所述第四读段集合和所述第五读段集合分别是按照下列原则确定的:
    将所述第二读段集合中的所述读段与其所述对应读段进行测序质量比较,
    选择测序质量高的一方作为所述第四读段集合的元素,选择测序质量低的一方作为所述第五读段集合的元素,
    对于测序质量相同的情形,则选择来自所述第二读段集合的所述读段作为所述第四读段集合的元素,则选择所述对应读段作为所述第五读段集合的元素;
    第六读段集合确定模块,利用测序质量,对所述第四读段集合进行过滤处理,以便构建第六读段集合,所述第六读段集合中的所述读段的测序质量均不低于第一预定质量阈值;
    第七读段集合确定模块,利用所述第六读段集合,从所述第五读段集合中选择与所述第六读段集合中的所述读段对应的所述读段,以便构建第七读段集合;
    第一差异位点确定模块,将所述第六读段集合与所述第七读段集合进行读段比对,并在所述第六读段集合的所述读段上确定第一差异位点;和
    第一序列信息确定模块,利用预先确定的测序误差预测模型,对所述第一差异位点进行校正,以便确定第一序列信息,所述测序误差预测模型用于确定在测序过程中,差异位点发生插入或者缺失的概率。
  23. 根据权利要求22所述的系统,其特征在于,所述测序结果分析系统进一步包括:
    第八读段集合确定模块,利用测序质量,对所述第三读段集合进行过滤处理,以便构建第八读段集合,其中,所述第八读段集合中的所述读段的测序质量均不低于第二预定质量阈值;
    第九读段集合确定模块,利用所述第八读段集合,从所述第二测序数据中选择与所述第七读段集合中的所述读段对应的所述读段,以便构建第九读段集合;
    第二差异位点确定模块,将所述第八读段集合与所述第九读段集合进行读段比对,并在所述第八读段集合的所述读段上确定第二差异位点;
    第二序列信息确定模块,利用所述测序误差预测模型,对所述第二差异位点进行校正,以便确定第二序列信息。
  24. 根据权利要求22或23所述的系统,其特征在于,所述测序误差预测模型是基于所述第一测序数据和所述第二测序数据与参考基因组的比对结果,对朴素贝叶斯模型进行训练获得的。
  25. 根据权利要求22或23所述的系统,其特征在于,针对所述第一差异位点和所述第二差异位点:
    如果来自所述第六读段集合的读段在所述差异位点存在碱基,来自所述第七读段集合的对应读段在所述差异位点不存在碱基,并且在所述差异位点发生缺失的概率为50%以上,则保留所述第六读段集合的读段在所述差异位点的碱基作为最终测序结果;
    如果来自所述第六读段集合的读段在所述差异位点不存在碱基,来自所述第七读段的读段集合的定读段在所述差异位点存在碱基,并且在所述差异位点发生插入的概率为50%以上,则保留所述第六读段集合的读段在所述差异位点的碱基作为最终测序结果;和
    如果来自所述第六读段集合的读段在所述差异位点存在碱基,来自所述第七读段的读段集合的定读段在所述差异位点也存在碱基,则选择所述第六读段集合的读段在所述差异位点的碱基作为最终测序结果。
  26. 根据权利要求22或23所述的系统,其特征在于,所述第一预定长度和所述第二预定长度分别独立地不低于20bp,优选不低于25bp;和/或
    所述预定长度范围为10~25bp;和/或
    所述第一预定质量阈值和所述第二预定质量阈值分别独立地不低于50,优选不低于60。
  27. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1~17中任一项所述方法的步骤。
  28. 一种电子设备,其特征在于,包括:
    权利要求27中所述的计算机可读存储介质;以及
    一个或者多个处理器,用于执行所述计算机可读存储介质中的程序。
  29. 一种计算机程序产品,包括指令,所述指令在所述计算机执行所述程序时,使所述计算机执行权利要求1~17中任一项所述的方法。
PCT/CN2021/091279 2020-04-30 2021-04-30 测序方法及其分析方法和系统、计算机可读存储介质和电子设备 WO2021219114A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/922,340 US20230178183A1 (en) 2020-04-30 2021-04-30 Sequencing method, analysis method therefor and analysis system thereof, computer-readable storage medium, and electronic device
EP21797265.2A EP4144745A4 (en) 2020-04-30 2021-04-30 SEQUENCING METHOD, ANALYSIS METHOD AND ANALYSIS SYSTEM, COMPUTER READABLE STORAGE MEDIUM AND ELECTRONIC DEVICE

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN202010362587.6 2020-04-30
CN202010362587 2020-04-30
CN202010867569 2020-08-25
CN202010867569.3 2020-08-25
CN202010865293.5A CN113593636B (zh) 2020-04-30 2020-08-25 测序结果分析方法、系统及计算机可读存储介质和电子设备
CN202010865293.5 2020-08-25

Publications (1)

Publication Number Publication Date
WO2021219114A1 true WO2021219114A1 (zh) 2021-11-04

Family

ID=78243116

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091279 WO2021219114A1 (zh) 2020-04-30 2021-04-30 测序方法及其分析方法和系统、计算机可读存储介质和电子设备

Country Status (4)

Country Link
US (1) US20230178183A1 (zh)
EP (1) EP4144745A4 (zh)
CN (1) CN113593637B (zh)
WO (1) WO2021219114A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024093421A1 (zh) * 2022-10-31 2024-05-10 深圳市真迈生物科技有限公司 表面处理方法、测序方法和试剂盒

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1131157A (zh) * 1994-12-22 1996-09-18 株式会社日立制作所 Dna制备方法
CN102296065A (zh) * 2011-08-04 2011-12-28 盛司潼 用于构建测序文库的系统与方法
CN103543270A (zh) * 2012-07-09 2014-01-29 国家纳米科学中心 一种蛋白质原位表达芯片及其制备方法和应用
CN107227344A (zh) * 2017-05-16 2017-10-03 北京量化健康科技有限公司 一种高密度、高稳定性的核酸芯片及其制备方法
US20180163251A1 (en) * 2014-12-18 2018-06-14 Bgi Shenzhen Target region enrichment method based on multiplex pcr, and reagent
CN109610006A (zh) 2018-10-12 2019-04-12 深圳市瀚海基因生物科技有限公司 一种芯片的制备方法、dna或蛋白质的固定方法和芯片

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101460953B (zh) * 2006-03-31 2012-05-30 索雷克萨公司 用于合成分析的序列的系统和装置
EP2931892B1 (en) * 2012-12-12 2018-09-12 The Broad Institute, Inc. Methods, models, systems, and apparatus for identifying target sequences for cas enzymes or crispr-cas systems for target sequences and conveying results thereof
CN105624272B (zh) * 2014-10-29 2019-08-09 深圳华大基因科技有限公司 基因组预定区域核酸测序文库的构建方法及装置
US10095831B2 (en) * 2016-02-03 2018-10-09 Verinata Health, Inc. Using cell-free DNA fragment size to determine copy number variations
CN109207571B (zh) * 2017-06-30 2022-01-04 深圳华大生命科学研究院 一种检测核酸内切酶酶切位点的方法
AU2018375008B2 (en) * 2017-12-01 2024-06-27 Illumina, Inc. Methods and systems for determining somatic mutation clonality
CN108165618B (zh) * 2017-12-08 2021-06-08 东南大学 一种包含核苷酸和3’端可逆封闭核苷酸的dna测序方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1131157A (zh) * 1994-12-22 1996-09-18 株式会社日立制作所 Dna制备方法
CN102296065A (zh) * 2011-08-04 2011-12-28 盛司潼 用于构建测序文库的系统与方法
CN103543270A (zh) * 2012-07-09 2014-01-29 国家纳米科学中心 一种蛋白质原位表达芯片及其制备方法和应用
US20180163251A1 (en) * 2014-12-18 2018-06-14 Bgi Shenzhen Target region enrichment method based on multiplex pcr, and reagent
CN107227344A (zh) * 2017-05-16 2017-10-03 北京量化健康科技有限公司 一种高密度、高稳定性的核酸芯片及其制备方法
CN109610006A (zh) 2018-10-12 2019-04-12 深圳市瀚海基因生物科技有限公司 一种芯片的制备方法、dna或蛋白质的固定方法和芯片

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4144745A4

Also Published As

Publication number Publication date
CN113593637A (zh) 2021-11-02
CN113593637B (zh) 2024-05-03
EP4144745A1 (en) 2023-03-08
US20230178183A1 (en) 2023-06-08
EP4144745A4 (en) 2024-02-21

Similar Documents

Publication Publication Date Title
US20220090184A1 (en) Size-Selection of Cell-Free DNA for Increasing Family Size During Next-Generation Sequencing
US20240084376A1 (en) Error suppression in sequenced dna fragments using redundant reads with unique molecular indices (umis)
CN107058551B (zh) 检测微卫星位点不稳定性的方法及装置
CN108300716B (zh) 接头元件、其应用和基于不对称多重pcr进行靶向测序文库构建的方法
WO2016037416A1 (zh) 泡状接头及其在核酸文库构建及测序中的应用
CN105442054B (zh) 对血浆游离dna进行多目标位点扩增建库的方法
CN113593636B (zh) 测序结果分析方法、系统及计算机可读存储介质和电子设备
JP2018524993A (ja) 染色体異常を検出するための核酸及び方法
WO2017219512A1 (zh) 一种游离dna文库构建方法及试剂盒
WO2013075629A1 (zh) 一种检测核酸羟甲基化修饰的方法及其应用
WO2016049878A1 (zh) 一种基于snp分型的亲子鉴定方法及应用
CN105567681B (zh) 一种基于高通量基因测序无创活检病毒的方法及标签接头
Cheng et al. Methods to improve the accuracy of next-generation sequencing
CN111073961A (zh) 一种基因稀有突变的高通量检测方法
WO2018184495A1 (zh) 一步法构建扩增子文库的方法
WO2017202389A1 (zh) 一种适用于超微量dna测序的接头及其应用
WO2020232635A1 (zh) 基于甲基化dna目标区域构建测序文库及系统和应用
CN113278717A (zh) 一种靶向测序法检测血流感染的引物池、试剂盒及方法
WO2021219114A1 (zh) 测序方法及其分析方法和系统、计算机可读存储介质和电子设备
CN107077533B (zh) 测序数据处理装置和方法
CN110468189B (zh) 基于单样本二代测序检测样本体细胞变异的方法及装置
CN109686404B (zh) 检测样本混淆的方法及装置
CN107077538B (zh) 测序数据处理装置和方法
CN112259165B (zh) 用于检测微卫星不稳定性状态的方法及系统
TW201321520A (zh) 用於病毒檢測的方法和系統

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21797265

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021797265

Country of ref document: EP

Effective date: 20221130