WO2021203461A1 - Position anchoring bar code system for nanopore sequencing library construction - Google Patents

Position anchoring bar code system for nanopore sequencing library construction Download PDF

Info

Publication number
WO2021203461A1
WO2021203461A1 PCT/CN2020/085645 CN2020085645W WO2021203461A1 WO 2021203461 A1 WO2021203461 A1 WO 2021203461A1 CN 2020085645 W CN2020085645 W CN 2020085645W WO 2021203461 A1 WO2021203461 A1 WO 2021203461A1
Authority
WO
WIPO (PCT)
Prior art keywords
barcode
sequence
anchor
anchored
sequencing
Prior art date
Application number
PCT/CN2020/085645
Other languages
French (fr)
Chinese (zh)
Inventor
戴岩
胡龙
张烨
肖念清
任用
Original Assignee
江苏先声医学诊断有限公司
北京先声医学检验实验室有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江苏先声医学诊断有限公司, 北京先声医学检验实验室有限公司 filed Critical 江苏先声医学诊断有限公司
Publication of WO2021203461A1 publication Critical patent/WO2021203461A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1093General methods of preparing gene libraries, not provided for in other subgroups
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B50/00Methods of creating libraries, e.g. combinatorial synthesis
    • C40B50/06Biochemical methods, e.g. using enzymes or whole viable microorganisms
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B70/00Tags or labels specially adapted for combinatorial chemistry or libraries, e.g. fluorescent tags or bar codes
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B80/00Linkers or spacers specially adapted for combinatorial chemistry or libraries, e.g. traceless linkers or safety-catch linkers

Definitions

  • the invention relates to the field of gene sequencing, in particular to a position-anchored barcode system for nanopore sequencing database construction.
  • Illumina second-generation sequencing is in full swing in China, but there are the following problems when applied to microbial detection: First, the read length of second-generation sequencing is less than a few hundred bp, and there will be high homologous sequences between different species of microorganisms, resulting in The accuracy of metagenomic species analysis is poor, and irrelevant microbial information is fed back in the data report, which causes greater diagnostic interference for doctors; secondly, the identification of deeper disease-causing genes and drug-resistant genes requires assembly and splicing of sequencing sequences Therefore, complex analysis requires higher time and capital costs to make up for the read length defect of the second-generation sequencing data; in addition, the second-generation sequencing-related instruments are expensive, cumbersome to operate, high initial investment, and the entire sequencing time is long, which is difficult.
  • the third-generation sequencing technology PacBio has greatly improved the read length of sequencing, and can detect long fragment data of 8-12kb, or even 40-70kb, but its disadvantage is that the library construction process is more complicated. Moreover, like the second-generation sequencing, there is the disadvantage of a long sequencing cycle. After a round of sequencing, it takes tens of hours to complete the data offline. With the subsequent analysis time, it is difficult to meet the rapid identification of pathogenic microorganisms.
  • Nanopore sequencing technology just makes up for the disadvantages of other sequencing platforms, not only reading long sequence fragments, but also short library building and sequencing time.
  • the equipment is small and portable, and the data generation and bio-information analysis can be real-time, which perfectly solves the limitation of the sequencing site and the delay of report feedback. Therefore, this technology is very suitable for the analysis and identification of clinical infection microbial pathogens.
  • the on-board chip for nanopore sequencing is very expensive and the price is very unfriendly to users.
  • Using barcode information to distinguish multiple samples is a common cost-saving strategy for high-throughput DNA sequencing. For each DNA sample, a unique barcode sequence is introduced during the library building process.
  • the sequencing reads are classified according to the barcode sequence to distinguish different samples on the computer.
  • the expensive chip in the nanopore sequencing technology makes multiple samples on the machine have obvious economic advantages, allowing users to share the fixed cost of a flow cell.
  • a series of kits launched by Oxford Nanopore Company provide 12 different barcodes with a length of 24bp. These barcodes are connected to the two ends of the sample DNA sequence during the library construction process and then sequenced on the computer, so that a chip can obtain 12 different samples at the same time. Sequence information. However, when distinguishing samples based on the barcode sequence later, it was found that the 24bp long barcode that comes with the library building kit is seriously confused.
  • the technical problem to be solved by the present invention is to improve the accuracy of the barcode comparison process of the existing nanopore sequencing data sample.
  • the invention categorizes and statistically analyzes the errors that occur during sequence comparison of the nanopore sequencing platform by digging in-depth large amounts of data, and quantifies the influence of different error rate types on sequence identification.
  • the sequencing error of the indel type (Indel) will greatly increase the error rate of sequence identification, while the base mismatch type (Mismatch) has little effect on the improvement of the error rate of sequence identification, so it is blindly improved in the design of sample barcodes.
  • the length of the barcode has a limited effect on the accuracy improvement, and adding a position anchor sequence to the barcode of the same length improves the accuracy even more.
  • the present invention constructs a set of "position-anchored barcode system” containing position-anchored sequences, and validates it based on Nanopore’s library building kit SQK-PBK004, and builds a library of 10 pure bacteria on the computer.
  • the original barcode system and the position-anchored barcode system are used to classify and compare the off-machine data.
  • the results show that the position-anchored barcode system has better sample classification accuracy, which is more than 3 orders of magnitude higher than the original barcode system.
  • the first object of the present invention is to provide a position-anchored barcode system that improves the resolution accuracy of sample nanopore sequencing.
  • the second object of the present invention is to provide a preparation method and application of the above-mentioned position-anchored barcode system.
  • the present invention provides a position-anchored barcode system for nanopore sequencing library construction, which is characterized in that the system includes the following structure:
  • the BARCODE is a barcode sequence
  • the ANCHOR is an anchor sequence.
  • the system includes the following structure: FLANK1-[BARCODE-ANCHOR] n -BARCODE n+1 -FLANK2,
  • the FLANK is a flanking sequence
  • the BARCODE sequences are the same or different; preferably, the BARCODE sequences are different;
  • the ANCHOR sequences are the same or different; preferably, the ANCHOR sequences are different.
  • the length of the ANCHOR sequence is 5-50 bp; preferably, the length of the ANCHOR sequence is 10-35 bp;
  • the homology between the ANCHOR sequence and the BARCODE sequence is ⁇ 70%; preferably ⁇ 50%.
  • the length of the FLANK is 10-30 bp; preferably, the length of the FLANK sequence is 15-25 bp;
  • the position-anchored barcode system for nanopore sequencing library construction is characterized in that the system includes any of the following structures:
  • the ANCHOR sequences are different or the same, preferably the ANCHOR sequences are different;
  • the BARCODE sequences are different or the same, preferably the BARCODE sequences are different.
  • the present invention also provides a method for preparing the above-mentioned position-anchored barcode system for nanopore sequencing library construction, characterized in that: the method includes directly synthesizing the nucleotide sequence of the position-anchored barcode system, or through analysis After the segments are synthesized, they are connected to prepare the position-anchored barcode system.
  • the preparation method is as follows: on the basis of the existing nanopore sequencing library barcode, bridge primers are used to realize the existing barcode adapter and The series of barcode linkers is designed; preferably, the bridging primer sequence is ANCHOR in the structure of the position-anchored barcode system.
  • the existing barcode connector is derived from the original barcode of the SQK-PBK004 kit of ONT.
  • the present invention also provides an application of the above-mentioned position-anchored barcode system for nanopore sequencing library building in improving the accuracy of sequencing samples classification.
  • the present invention also provides an application of the above-mentioned position-anchored barcode system for nanopore sequencing library building in reducing false positives of sequencing samples.
  • the invention also provides an application of the above-mentioned position-anchored barcode system for nanopore sequencing library construction in sequencing library construction.
  • the present invention also provides an application of the above-mentioned position-anchored barcode system for nanopore sequencing library construction in sequencing.
  • the present invention also provides a method for constructing a sequencing library, which is characterized in that the position-anchored barcode system of the above-mentioned nanopore sequencing library is used to construct a sequencing library.
  • the present invention also provides a sequencing adapter, characterized in that the sequence of the sequencing adapter includes the position anchor barcode system described above.
  • the present invention also provides a composite, characterized in that the composition is connected to the above-mentioned position-anchored barcode system.
  • the present invention also provides a composition, characterized in that the composition contains the above-mentioned position-anchored barcode system.
  • the present invention also provides a kit for nanopore sequencing library construction, characterized in that the kit contains the above-mentioned position-anchored barcode system or the above-mentioned sequencing adapter.
  • the present invention proves for the first time that indel type errors are the main reason for overall sequence alignment errors. In contrast, base mismatch types have less impact on overall sequence alignment errors.
  • the present invention restricts indel type errors from extending to the overall comparison result by introducing an anchor sequence in the barcode system, greatly reduces the degradation of the comparison score caused by indels, and screens out long-distance barcode interference to achieve Accurate bar code resolution; compared to only increasing the length of the bar code sequence, although increasing the bar code length can appropriately reduce sample classification errors caused by base mismatches, the accuracy of the overall sequence comparison result is very limited.
  • the position-anchored barcode system has an extremely significant effect on improving the accuracy of the results.
  • the present invention is based on the library building process of the nanopore platform SQK-PBK004, cleverly uses its built-in barcode to connect the independently developed barcode sequence, and uses the connecting part sequence as the anchor sequence to design FLANK1-BARCODE 1 -ANCHOR 2 -BARCODE 2- FLANK2 type position anchoring barcode system, this system can improve the classification accuracy from 0.999 to 0.999999 when distinguishing different samples.
  • the position-anchored barcode system of the present invention can design barcodes of different lengths and the number of anchor sequences according to different requirements, so as to achieve the balance of classification accuracy and microbial detection rate for different requirements.
  • the position-anchored barcode system of the present invention has better resolution, higher accuracy, and reduced false positive identification, can improve the accuracy of nanopore sequencing as a whole, reduce sequencing costs, and is suitable for popularization and use.
  • Figure 1 Error rate statistics based on the sequencing data of the kit barcode system;
  • Figure A shows the average error rate and median error rate of each site in the actual sequencing of the barcode adapter sequence of the 10 sets of kits;
  • Figure B shows the barcode of the 10 sets of kits In the actual sequencing of the adapter sequence, the average error rate and the median error rate of the three types of errors of insertion, deletion, and mismatch at each site are aligned;
  • Figure C shows the corresponding relationship between the barcode adapter site of the kit and the alignment error;
  • Figure D shows A summary of the subcategories of the alignment error types of the barcode adapter in the kit, which shows the distribution of alignment errors that occur at different sites.
  • the abscissa is the base position of the barcode sequence, and the ordinate is the error type.
  • the color of the small cells in the figure indicates that The error rate of the error type at the site, the darker the color means the higher the error rate.
  • the color blocks in the Block comment area indicate different components of the connector sequence, and the error type is clustered by Euclidean distance;
  • Figure 2 The influence of different alignment error types on the accuracy of overall sequence classification; Figure A shows the total error rate is 8%; Figure B shows the total error rate is 16%;
  • the terms “including”, “including”, “having”, “containing” or “involving” are inclusive or open-ended, and do not exclude other unlisted elements or method steps .
  • the term “consisting of” is considered a preferred embodiment of the term “comprising”. If in the following a certain group is defined as comprising at least a certain number of embodiments, this should also be understood as revealing a group preferably consisting of only these embodiments.
  • the "position-anchored barcode system" of the present invention refers to a multi-barcode sequencing tag system containing two or more BARCODE sequences in series.
  • the BARCODEs are anchored by a specific ANCHOR sequence.
  • This system can be applied to nanopore sequencing.
  • the construction of sequencing library in can improve the accuracy of sequencing samples classification, and reduce the application of sequencing sample classification false positives.
  • Its specific structure can be the [BARCODE-ANCHOR] n -BARCODE n+1 described in the present invention, where n ⁇ 1, the BARCODE is a barcode sequence, and the ANCHOR is an anchor sequence. It can be understood that any composition, compound, or system including the above structure is within the scope of the present invention.
  • the present invention is explained by taking the barcode of the SQK-PBK004 kit in the prior art as an example, it is only an exemplary description and does not limit the present invention.
  • the present invention has passed specific bio-information theory analysis and wet experiment verification, which proves that any barcode system containing [BARCODE-ANCHOR] n -BARCODE n+1 structure can be used for the construction of sequencing library, which can improve the accuracy of sequencing samples classification , To reduce the false positives of sequencing samples.
  • the structure of the position-anchored barcode system may be as follows: FLANK1-[BARCODE-ANCHOR] n -BARCODE n+1 -FLANK2, the FLANK is a linker sequence, and the linker sequence is a sequencing library
  • FLANK1 and 2 can be the same or different in sequence according to actual needs.
  • the length of the position-anchored barcode system of the present invention is appropriately selected in the field according to actual needs.
  • the n is 1, 2 or 3.
  • BARCODE is used as a marker sequence in sequencing.
  • the sequence can be the same or different; in some preferred embodiments, the BARCODE sequence is different.
  • the ANCHOR sequence is used as an anchor component, and the starting sequence may be the same or different.
  • the ANCHOR sequence is different.
  • the length of the ANCHOR sequence can be a length known in the art, for example, it can be 5-50 bp. In some preferred embodiments, the length of the ANCHOR sequence is 10-35 bp.
  • the ANCHOR sequence is used as the anchor component of BARCODE. Its sequence should be distinguished from the BARCODE sequence. There is no special restriction.
  • the homology between the ANCHOR sequence and the BARCODE sequence can be ⁇ 80%, ⁇ 70%, ⁇ 60%, ⁇ 50%, ⁇ 40%, ⁇ 30%, ⁇ 20%, ⁇ 10%; in some preferred embodiments, the homology is ⁇ 50%.
  • the structure can be specifically as follows:
  • the "barcode connector” in the present invention refers to a complete section containing a barcode sequence and flanking sequences at both ends.
  • the self-designed barcode connector is defined as a BBRCD connector
  • the barcode connector in the original kit SQK-PBK004 is defined as an ABRCD connector.
  • bar code sequence in the present invention refers to a specific sequence of a bar code, which is included in the bar code connector and is a part of the sequence of the bar code connector.
  • the independently designed barcode sequence is defined as BBRCD
  • the barcode sequence in the original kit SQK-PBK004 is defined as ABRCD.
  • anchor sequence in the present invention refers to a nucleotide sequence used to anchor BARCODE. Its length can be any verified length known in the art, for example, it can be 5-50 bp. In the embodiment, it can be 10-35bp; its sequence should be distinguished from the BARCODE sequence, and there is no particular limitation.
  • the homology of ANCHOR sequence and BARCODE sequence can be ⁇ 80%, ⁇ 70%, ⁇ 60%, ⁇ 50%, ⁇ 40 %, ⁇ 30%, ⁇ 20%, ⁇ 10%; in some preferred embodiments, the homology is ⁇ 50%.
  • the ANCHOR sequence mentioned in the embodiment of the present invention includes SEQ ID NO. 50, SEQ ID NO. 51, SEQ ID NO. 13, and the like.
  • FLANK refers to the flanking sequences at both ends of the barcode system, which are conventional components of sequencing barcode adapters.
  • FLANK1 in the present invention is a Y-type sequencing adapter that connects the motor protein. Ensure that the DNA passes through the nanopore to achieve normal sequencing;
  • FLANK2 is used to connect the sequence of the sequencing sample, and its length can be any verified length known in the art, for example, it can be 10-30 bp, and in some preferred embodiments, it can be 15-25 bp .
  • the present invention considers that the main reason for the confusion of the barcode is the sequence difference between the sequenced barcode and the preset real barcode, so the difference between the barcode sequence obtained by the sequencing and the real barcode sequence is sorted and sorted first.
  • the present invention uses 10 sets of barcodes of ONT's SQK-PBK004 kit to build a separate library of sample DNA, intercept 250bp of the 5'end of the sequencing data to ensure that the barcode region is included, and then perform a global connection with the corresponding preset barcode adapter. Overlap alignment.
  • the comparison difference of each position of the barcode adapter sequence is summarized, and the error position distribution and error type of the sequencing barcode are counted.
  • the error types are divided into three categories, namely insertion (I), deletion (deletion, D) and base mismatch (mismatch, X).
  • FIG. 1B there is a significant difference between the average and median error rates of each point of the barcode connector, so the present invention further summarizes the alignment errors of different error types at different barcode sequence positions.
  • Figure 1C shows that, except for the high error rate of the first 6 bases at the 5'end caused by the initial instability of sequencing, there is no significant error rate difference in the remaining positions.
  • XN represents all mismatch types except GA and AG mismatches
  • I1 represents a single base randomly inserted
  • I2 and II represent randomly inserted 2, 3 to 5 bases, and 3, 4 respectively.
  • the ratio of 5 bases is 8:1.8:0.2
  • X2,XX indicates that the site randomly introduces unmatched bases and then inserts one or more bases.
  • Example 2 The influence of different error types and barcode length on the overall accuracy of barcode comparison
  • the present invention simulates a total of 6 groups of barcode sequences with lengths of 20bp, 40bp, 60bp, 80bp, 100bp, and 120bp, each of which simulates 12 different barcode elements. .
  • the present invention takes 80bp as an example to illustrate the specific overview of the simulation.
  • the present invention presets 12 ideal barcode sequences, and the sequence information is shown in Table 2:
  • a flow cell has multiple samples on the machine, there will be multiple barcode DNA in the actual off-machine data, and the barcodes between the samples will be confused with each other.
  • the present invention simulates 100,000 joint sequences for each preset barcode under each set of length, and all simulated The sequence is mixed together to simulate the situation that 12 samples are on the machine at the same time.
  • the analog sequence name uses the preset barcode name itself plus a digital number. The data is classified through the same biometric analysis process, and finally the barcode information obtained by the classification is compared with the analog sequence name. Determine whether crosstalk occurs.
  • Example 3 The influence of inserting anchor sequence on the overall accuracy of barcode comparison
  • the present invention intends to modify the barcode fragment in the following two ways, taking 80bp as an example:
  • the position range of the short bar code is locked by the position information obtained in the comparison process of the anchor sequence, and finally the data is classified according to the result of the short bar code combination.
  • this embodiment is taken as an example, based on the original ONT company’s nanopore library PCR barcode kit SQK-PBK004, clever use of its existing single barcode adapter (sequence structure FLANK1-ABRCD -FLANK3'), through the connection reaction, connect it with the barcode adapter (sequence structure FLANK5'-BBRCD-FLANK2) independently designed in the present invention to form the position-anchored barcode system of the present invention: FLANK1-ABRCD-ANCHOR- BBRCD-FLANK2 (where the ANCHOR sequence is the sequence after FLANK3' and FLANK5' are connected by a ligation reaction).
  • FLANK2 continues to connect with the sample DNA of known source through the ligation reaction, so that the final DNA to be sequenced has a position-anchored barcode system.
  • This design allows us to infer the connected barcode sequence based on the results of the known sample DNA, thereby quantifying the classification accuracy of the barcode system.
  • the present invention obtains 10 barcode sequences BBRCD with excellent distinguishability from each other through combinatorial comparison; and then adds a 13bp conservative flanking sequence FLANK to the 5'end.
  • the sequence has good PCR primer characteristics, moderate GC content, no hairpin and dimer structure, etc., and the Y-shaped structure of the original PCR linker is used to avoid the situation where multiple barcodes are connected in series in the connection step; ANCHOR sequence, in this experiment It is also the 3'end sequence (SEQ ID NO.13) of the PCR bypass primer, which is consistent with the self-designed barcode adapter FLANK5', and the 5'end base sequence is consistent with the FLANK3' of the original barcode adapter of the kit, thus achieving a
  • the PCR reaction obtains a sequenced DNA fragment with 5'and 3'ends in series with double barcode adapters simultaneously.
  • the PCR bridging primer sequence is as follows:
  • the underline indicates the 13bp sequence consistent with the self-designed barcode linker FLANK5'.
  • This example is aimed at 10 cases of standard pure bacteria strains Brevibacillus borstelensis, Pseudomonas aeruginosa, Escherichia coli, Salmonella enterica, Klebsiella pneumoniae, Listeria monocytogenes, Staphylococcus aureus; Acinetobacter baumannii, Stemallophila subtilis, and the library optimization diagrams used in the construction process
  • Each pure bacteria sample introduces a different position-anchored barcode sequence for library preparation, and specifically prepares 10 sets of position-anchored barcode sequences, as shown in Table 5.
  • Example 4 Based on the experimental procedure of Example 4, single-sample sequencing was performed on 10 different strains, and it was ensured that each sample was connected to only one barcode. Since only a single barcode is used in each sequencing, if the barcode of the corresponding sample is compared with the barcode that is not corresponding to the sample, it is considered to be a misclassification.
  • the biometric analysis of the present invention uses the official software guppy of Oxford Nanopore Company to evaluate the accuracy of sample classification of the original barcode system, and uses an independent software process to evaluate the accuracy of sample classification of the position-anchored barcode system.
  • the present invention examines the resolving power of 10 groups of position-anchored bar codes, and the final accuracy rate of read classification is statistically calculated within the range of these 10 groups of bar codes.
  • the classification accuracy of the position-anchored barcode system reaches 99.9999% on average, as shown in Figure 5.
  • the reads classified into barcode01, barcode02, barcode05, barcode09 and barcode10 are consistent with the classification accuracy of the simulated data, which are all 100%.
  • guppy's 99.9% resolution accuracy it has increased by 3 orders of magnitude, which means that the base for accurately distinguishing samples has increased from a thousand to a million for a single barcode, and the false positive rate caused by misclassification of reads has been reduced by 1000 times.
  • the classification accuracy of position-anchored barcodes is significantly better than that of single barcodes.

Abstract

Provided are a position anchoring bar code system for nanopore sequencing library construction, a preparation method and a use of the system. The position anchoring bar code system has higher resolution and higher classification accuracy and can remarkably reduce the identification of false positive rate, improving the overall nanopore sequencing precision and reducing sequencing cost.

Description

一种用于纳米孔测序建库的位置锚定条码系统Position-anchored barcode system for nanopore sequencing library construction
本申请要求于2020年04月09日提交中国专利局、申请号为202010276679.2、发明名称为“一种用于纳米孔测序建库的位置锚定条码系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 9, 2020, the application number is 202010276679.2, and the invention title is "a position-anchored barcode system for nanopore sequencing library construction", all of which The content is incorporated in this application by reference.
技术领域Technical field
本发明涉及基因测序领域,具体涉及一种用于纳米孔测序建库的位置锚定条码系统。The invention relates to the field of gene sequencing, in particular to a position-anchored barcode system for nanopore sequencing database construction.
背景技术Background technique
当下全球临床感染患者众多且感染源种类繁多,在中国,感染性疾病甚至占全部疾病总发病数的49%。目前常规的临床诊断方法是医生的经验性判断加上镜检,生化分析等确定症状的感染源,但人为因素,检测周期以及检测范围的限制极易造成误检和漏判,尤其不利于急性感染的诊治。随着高通量测序和基因组学的蓬勃发展,宏基因组测序技术因其可以快速全面客观地鉴定样本中微生物的组成,在感染诊断领域蒸蒸日上,其愈发广泛地应用于感染病原微生物的检测,为临床决策和后续用药提供了更精准的诊断基础。At present, there are many clinically infected patients worldwide and various sources of infection. In China, infectious diseases even account for 49% of the total incidence of all diseases. The current routine clinical diagnosis method is the doctor’s empirical judgment plus microscopy, biochemical analysis, etc. to determine the source of infection of the symptoms, but human factors, detection cycle and detection range limitations can easily cause false detections and missed judgments, especially not conducive to acute Diagnosis and treatment of infection. With the vigorous development of high-throughput sequencing and genomics, metagenomic sequencing technology can quickly and comprehensively identify the composition of microorganisms in samples. It is thriving in the field of infection diagnosis, and it is increasingly widely used in the detection of infectious pathogenic microorganisms. It provides a more accurate diagnostic basis for clinical decision-making and follow-up medication.
Illumina二代测序在国内发展如火如荼,但应用于微生物检测存在以下问题:首先,二代测序的读长短,均在几百bp以下,而微生物不同种属间会存在较高的同源序列,造成宏基因组物种分析的准确度差,在数据报告中反馈无关的微生物信息,反而给医生造成更大的诊断干扰;其次,更深层次的致病基因及耐药基因的鉴定需要对测序序列进行组装拼接,因而复杂的分析需要更高的时间和资金成本弥补二代测序数据的读长缺陷;此外,二代测序相关的仪器价格昂贵,操作繁琐,前期投资较高,而且整个测序时长较长,难以适用急性感染的需求。三代测序技术PacBio在测序读长方面有了很大提高,可以检测8-12kb,甚至是40-70kb的长片段数据,但其缺点在于建库流程较为复杂。而且,同二代测序一样存在测序周期长的缺点,一轮测序结束,需要几十个小时完成数据下机,加上后续分析时间,很难满足病原微生物的快速鉴定。The development of Illumina second-generation sequencing is in full swing in China, but there are the following problems when applied to microbial detection: First, the read length of second-generation sequencing is less than a few hundred bp, and there will be high homologous sequences between different species of microorganisms, resulting in The accuracy of metagenomic species analysis is poor, and irrelevant microbial information is fed back in the data report, which causes greater diagnostic interference for doctors; secondly, the identification of deeper disease-causing genes and drug-resistant genes requires assembly and splicing of sequencing sequences Therefore, complex analysis requires higher time and capital costs to make up for the read length defect of the second-generation sequencing data; in addition, the second-generation sequencing-related instruments are expensive, cumbersome to operate, high initial investment, and the entire sequencing time is long, which is difficult. Applicable to the needs of acute infections. The third-generation sequencing technology PacBio has greatly improved the read length of sequencing, and can detect long fragment data of 8-12kb, or even 40-70kb, but its disadvantage is that the library construction process is more complicated. Moreover, like the second-generation sequencing, there is the disadvantage of a long sequencing cycle. After a round of sequencing, it takes tens of hours to complete the data offline. With the subsequent analysis time, it is difficult to meet the rapid identification of pathogenic microorganisms.
纳米孔测序技术则恰好弥补了其他测序平台的劣势,不仅测序片段读长长,而且建库和测序时间短。此外,设备小巧便携,数据产生能与生信分析实时,完美解决了测序场地的限制,以及报告反馈的延误。因此这一技术非常适宜于临床感染微生物病原体分析鉴定中。但纳米孔测序的上机芯片十分昂贵,价格对用户很不友好。利用条码序列(barcode) 信息分辨多个样本,是高通量DNA测序节约成本的常用策略。对于每个DNA样本,在建库过程中引入独特的条码序列,多个条码DNA样本经同一流通池同时测序后,根据条码序列对测序reads进行分类区分不同的上机样本。纳米孔测序技术中昂贵的芯片使得多样本上机具有明显的经济优势,允许用户分摊一个流通池的固定成本。牛津纳米孔公司推出的一系列试剂盒提供了12个长24bp的不同条码,这些条码在建库过程中连接到样本DNA序列两端再上机测序,使得一张芯片同时获得12个不同样本的序列信息。但后续根据条码序列区分样本时发现,建库试剂盒自带的长24bp的条码混淆现象严重,其原因在于牛津纳米孔测序仪在电流信号转化为碱基的过程中(即basecalling)会导致reads中单碱基的错误率高达10-15%,所以下游数据分析中根据条码序列对reads分类时,会因为条码鉴定出错导致样本间数据的交叉污染,从而导致微生物假阳性鉴定,给临床决策带来极大困扰。Nanopore sequencing technology just makes up for the disadvantages of other sequencing platforms, not only reading long sequence fragments, but also short library building and sequencing time. In addition, the equipment is small and portable, and the data generation and bio-information analysis can be real-time, which perfectly solves the limitation of the sequencing site and the delay of report feedback. Therefore, this technology is very suitable for the analysis and identification of clinical infection microbial pathogens. However, the on-board chip for nanopore sequencing is very expensive and the price is very unfriendly to users. Using barcode information to distinguish multiple samples is a common cost-saving strategy for high-throughput DNA sequencing. For each DNA sample, a unique barcode sequence is introduced during the library building process. After multiple barcode DNA samples are sequenced simultaneously in the same flow cell, the sequencing reads are classified according to the barcode sequence to distinguish different samples on the computer. The expensive chip in the nanopore sequencing technology makes multiple samples on the machine have obvious economic advantages, allowing users to share the fixed cost of a flow cell. A series of kits launched by Oxford Nanopore Company provide 12 different barcodes with a length of 24bp. These barcodes are connected to the two ends of the sample DNA sequence during the library construction process and then sequenced on the computer, so that a chip can obtain 12 different samples at the same time. Sequence information. However, when distinguishing samples based on the barcode sequence later, it was found that the 24bp long barcode that comes with the library building kit is seriously confused. The reason is that the Oxford Nanopore Sequencer will cause reads in the process of converting current signals into bases (ie basecalling). The single-base error rate is as high as 10-15%. Therefore, when classifying reads based on barcode sequence in downstream data analysis, cross-contamination of data between samples due to barcode identification errors will result in false positive identification of microorganisms, which will bring clinical decision-making. It was extremely troubled.
基于此,提出本发明。Based on this, the present invention is proposed.
发明内容Summary of the invention
本发明要解决的技术问题是提高现有纳米孔测序数据样本条码比对过程中的准确度问题。The technical problem to be solved by the present invention is to improve the accuracy of the barcode comparison process of the existing nanopore sequencing data sample.
考虑到纳米孔测序平台的样本条码比对经常发生错误,极大影响了后续数据处理流程。本发明通过深入挖掘大量数据,对纳米孔测序平台的序列比对时发生的错误进行归类和统计分析,量化了不同错误率类型对序列鉴定的影响。惊奇发现插入缺失类型(Indel)的测序错误会极大提升序列鉴定的错误率,而碱基错配类型(Mismatch)对序列鉴定的错误率提升影响较小,因此在样本条码的设计中一味提升条码长度对准确度提升的影响有限,而在相同长度的条码中添加位置锚定序列对准确度提升更大。基于此发现,本发明构建一套包含位置锚定序列的“位置锚定条码系统”,并基于纳米孔公司的建库试剂盒SQK-PBK004进行验证,对10个纯菌进行建库上机,分别通过原条码系统和位置锚定条码系统对下机数据进行分类比较,结果表明位置锚定条码系统具有更好的样本分类准确率,相比原条码系统得到了3个数量级以上的提高。Considering that the barcode comparison of samples on the nanopore sequencing platform often causes errors, which greatly affects the subsequent data processing procedures. The invention categorizes and statistically analyzes the errors that occur during sequence comparison of the nanopore sequencing platform by digging in-depth large amounts of data, and quantifies the influence of different error rate types on sequence identification. Surprisingly, I found that the sequencing error of the indel type (Indel) will greatly increase the error rate of sequence identification, while the base mismatch type (Mismatch) has little effect on the improvement of the error rate of sequence identification, so it is blindly improved in the design of sample barcodes. The length of the barcode has a limited effect on the accuracy improvement, and adding a position anchor sequence to the barcode of the same length improves the accuracy even more. Based on this discovery, the present invention constructs a set of "position-anchored barcode system" containing position-anchored sequences, and validates it based on Nanopore’s library building kit SQK-PBK004, and builds a library of 10 pure bacteria on the computer. The original barcode system and the position-anchored barcode system are used to classify and compare the off-machine data. The results show that the position-anchored barcode system has better sample classification accuracy, which is more than 3 orders of magnitude higher than the original barcode system.
因此,本发明的第一目的是提供一种提高样本纳米孔测序分辨准确率的位置锚定条码体系。Therefore, the first object of the present invention is to provide a position-anchored barcode system that improves the resolution accuracy of sample nanopore sequencing.
本发明的第二目的是提供一种上述位置锚定条码体系的制备方法及其应用。The second object of the present invention is to provide a preparation method and application of the above-mentioned position-anchored barcode system.
为实现上述目的,本发明提供如下技术方案:In order to achieve the above objectives, the present invention provides the following technical solutions:
本发明提供一种用于纳米孔测序建库的位置锚定条码系统,其特征在于,所述系统包括如下结构:The present invention provides a position-anchored barcode system for nanopore sequencing library construction, which is characterized in that the system includes the following structure:
[BARCODE-ANCHOR] n-BARCODE n+1 [BARCODE-ANCHOR] n -BARCODE n+1
其中,n≥1,Where n≥1,
所述BARCODE为条码序列,The BARCODE is a barcode sequence,
所述ANCHOR为锚定序列。The ANCHOR is an anchor sequence.
在一些实施方式中,所述系统包括如下结构:FLANK1-[BARCODE-ANCHOR] n-BARCODE n+1-FLANK2, In some embodiments, the system includes the following structure: FLANK1-[BARCODE-ANCHOR] n -BARCODE n+1 -FLANK2,
所述FLANK为侧翼序列,The FLANK is a flanking sequence,
在一些实施方式中,所述1≤n≤10;优选的,所述n为1,2,3。In some embodiments, the 1≤n≤10; preferably, the n is 1,2,3.
在一些实施方式中,所述BARCODE序列相同或者不同;优选的,所述BARCODE序列不同;In some embodiments, the BARCODE sequences are the same or different; preferably, the BARCODE sequences are different;
在一些实施方式中,所述ANCHOR序列相同或者不同;优选的,所述ANCHOR序列不同。In some embodiments, the ANCHOR sequences are the same or different; preferably, the ANCHOR sequences are different.
在一些实施方式中,所述ANCHOR序列长度为5-50bp;优选的,所述ANCHOR序列长度为10-35bp;In some embodiments, the length of the ANCHOR sequence is 5-50 bp; preferably, the length of the ANCHOR sequence is 10-35 bp;
在一些实施方式中,所述ANCHOR序列与BARCODE序列的同源性<70%;优选的<50%。In some embodiments, the homology between the ANCHOR sequence and the BARCODE sequence is <70%; preferably <50%.
在一些实施方式中,所述FLANK长度为10-30bp;优选的,所述FLANK序列长度为15-25bp;In some embodiments, the length of the FLANK is 10-30 bp; preferably, the length of the FLANK sequence is 15-25 bp;
在一些实施方式中,所述用于纳米孔测序建库的位置锚定条码系统,其特征在于,所述系统包括如下任一结构:In some embodiments, the position-anchored barcode system for nanopore sequencing library construction is characterized in that the system includes any of the following structures:
FLANK1-BARCODE 1-ANCHOR 1-BARCODE 2-FLANK2; FLANK1-BARCODE 1 -ANCHOR 1 -BARCODE 2 -FLANK2;
FLANK1-BARCODE 1-ANCHOR 1-BARCODE 2-ANCHOR 2-BARCODE 3-FLANK2; FLANK1-BARCODE 1 -ANCHOR 1 -BARCODE 2 -ANCHOR 2 -BARCODE 3 -FLANK2;
FLANK1-BARCODE 1-ANCHOR 1-BARCODE 2-ANCHOR 2-BARCODE 3-ANCHOR 3-BARCODE 5-FLANK2; FLANK1-BARCODE 1 -ANCHOR 1 -BARCODE 2 -ANCHOR 2 -BARCODE 3 -ANCHOR 3 -BARCODE 5 -FLANK2;
在一些实施方式中,所述ANCHOR序列不同或相同,优选所述ANCHOR序列不同;In some embodiments, the ANCHOR sequences are different or the same, preferably the ANCHOR sequences are different;
在一些实施方式中,所述BARCODE序列不同或相同,优选所述BARCODE序列不同。In some embodiments, the BARCODE sequences are different or the same, preferably the BARCODE sequences are different.
本发明还提供一种上述用于纳米孔测序建库的位置锚定条码系统的制备方法,其特征在于:所述方法包括直接合成所述位置锚定条码系统的核苷酸序列,或通过分段合成后连接制备所述位置锚定条码系统。The present invention also provides a method for preparing the above-mentioned position-anchored barcode system for nanopore sequencing library construction, characterized in that: the method includes directly synthesizing the nucleotide sequence of the position-anchored barcode system, or through analysis After the segments are synthesized, they are connected to prepare the position-anchored barcode system.
在本发明的一些实施方式中,当制备包含现有条码接头的位置锚定条码系统时,其制备方法如下:在现有纳米孔测序建库条形码基础上,利用搭桥引物实现现有条码接头与设计条码接头的串联;优选的,所述搭桥引物序列为位置锚定条码系统结构中的ANCHOR。In some embodiments of the present invention, when preparing a position-anchored barcode system containing existing barcode adapters, the preparation method is as follows: on the basis of the existing nanopore sequencing library barcode, bridge primers are used to realize the existing barcode adapter and The series of barcode linkers is designed; preferably, the bridging primer sequence is ANCHOR in the structure of the position-anchored barcode system.
在本发明的一些实施方式中,所述现有条码接头来源于ONT公司的SQK-PBK004试剂盒的原始条码。In some embodiments of the present invention, the existing barcode connector is derived from the original barcode of the SQK-PBK004 kit of ONT.
本发明还提供一种上述用于纳米孔测序建库的位置锚定条码系统的在提高测序样本分类准确度中的应用。The present invention also provides an application of the above-mentioned position-anchored barcode system for nanopore sequencing library building in improving the accuracy of sequencing samples classification.
本发明还提供一种上述用于纳米孔测序建库的位置锚定条码系统的在降低测序样本分类假阳性中的应用。The present invention also provides an application of the above-mentioned position-anchored barcode system for nanopore sequencing library building in reducing false positives of sequencing samples.
本发明还提供一种上述用于纳米孔测序建库的位置锚定条码系统在测序文库构建中的应用。The invention also provides an application of the above-mentioned position-anchored barcode system for nanopore sequencing library construction in sequencing library construction.
本发明还提供一种上述用于纳米孔测序建库的位置锚定条码系统在测序中的应用。The present invention also provides an application of the above-mentioned position-anchored barcode system for nanopore sequencing library construction in sequencing.
本发明还提供一种测序文库构建的方法,其特征在于,利用上述纳米孔测序建库的位置锚定条码系统构建测序文库。The present invention also provides a method for constructing a sequencing library, which is characterized in that the position-anchored barcode system of the above-mentioned nanopore sequencing library is used to construct a sequencing library.
本发明还提供一种测序接头,其特征在于,所述测序接头序列中包含上述所述的位置锚定条形码系统。The present invention also provides a sequencing adapter, characterized in that the sequence of the sequencing adapter includes the position anchor barcode system described above.
本发明还提供一种复合物,其特征在于,所述组合物连接于上述所述的位置锚定条形码系统。The present invention also provides a composite, characterized in that the composition is connected to the above-mentioned position-anchored barcode system.
本发明还提供一种组合物,其特征在于,所述组合物中包含上述的位置锚定条形码系统。The present invention also provides a composition, characterized in that the composition contains the above-mentioned position-anchored barcode system.
本发明还提供一种用于纳米孔测序建库的试剂盒,其特征在于,所述试剂盒中包含上述所述的位置锚定条形码系统,或包含上述所述的测序接头。The present invention also provides a kit for nanopore sequencing library construction, characterized in that the kit contains the above-mentioned position-anchored barcode system or the above-mentioned sequencing adapter.
本发明的有益技术效果:The beneficial technical effects of the present invention:
1)本发明首次证明了插入缺失类型的错误是整体序列比对错误的主要原因,相比而言,碱基错配类型的错误对整体序列比对错误的影响较小。在实践中,本发明通过在条码系统中引入锚定序列,限制indel类型错误扩展到整体比对结果,极大程度降低插入缺失引起的比对得分降值,筛除远距离的条码干扰,达到精准的条码分辨;相比仅仅通过提高条码序列长度的方式,虽然增加条码长度可适当降低碱基不匹配引起的样本分类错误,但对整体序列比对结果准确度提升非常有限,而本发明的位置锚定条形码系统对结果准确度提升效果极其显著。1) The present invention proves for the first time that indel type errors are the main reason for overall sequence alignment errors. In contrast, base mismatch types have less impact on overall sequence alignment errors. In practice, the present invention restricts indel type errors from extending to the overall comparison result by introducing an anchor sequence in the barcode system, greatly reduces the degradation of the comparison score caused by indels, and screens out long-distance barcode interference to achieve Accurate bar code resolution; compared to only increasing the length of the bar code sequence, although increasing the bar code length can appropriately reduce sample classification errors caused by base mismatches, the accuracy of the overall sequence comparison result is very limited. The position-anchored barcode system has an extremely significant effect on improving the accuracy of the results.
2)本发明基于纳米孔平台SQK-PBK004建库流程,巧妙利用其自带的条码连接自主研发的条码序列,并利用连接部位序列作为锚定序列,设计了FLANK1-BARCODE 1-ANCHOR 2-BARCODE 2-FLANK2类型的位置锚定条码系统,该系统分辨不同样本时可将分类准确率从0.999提高至0.999999。 2) The present invention is based on the library building process of the nanopore platform SQK-PBK004, cleverly uses its built-in barcode to connect the independently developed barcode sequence, and uses the connecting part sequence as the anchor sequence to design FLANK1-BARCODE 1 -ANCHOR 2 -BARCODE 2- FLANK2 type position anchoring barcode system, this system can improve the classification accuracy from 0.999 to 0.999999 when distinguishing different samples.
3)本发明的位置锚定条形码系统在实际应用中,可根据不同需求设计不同长度的条码和锚定序列个数,实现不同需求的分类准确度和微生物检出率平衡。3) In practical applications, the position-anchored barcode system of the present invention can design barcodes of different lengths and the number of anchor sequences according to different requirements, so as to achieve the balance of classification accuracy and microbial detection rate for different requirements.
4)本发明的位置锚定条形码系统具有更好分辨率,更高准确率,降低假阳性鉴定,可从整体上提高纳米孔测序精度,降低测序成本,适于推广使用。4) The position-anchored barcode system of the present invention has better resolution, higher accuracy, and reduced false positive identification, can improve the accuracy of nanopore sequencing as a whole, reduce sequencing costs, and is suitable for popularization and use.
附图说明Description of the drawings
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the specific embodiments or the description of the prior art. Obviously, the appendix in the following description The drawings are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1.基于试剂盒条码系统测序数据的错误率统计;图A表示10组试剂盒条码接头序列实际测序中每个位点的平均错误率和中位错误率;图B表示10组试剂盒条码接头序列实际测序中每个位点比对插入、缺失、错配三种错误类型的平均错误率和中位错误率;图C表示试剂盒条码接头位点与比对错误对应关系;图D表示试剂盒条码接头比对错误类型的亚类汇总,其展示的是不同位点下发生的比对错误分布,横坐标为条码序列碱基位置,纵坐标为错误类型,图中小格颜色深浅表示该位点处错误类型的错误率,颜色越深意 味着出错率越高,Block注释区的色块表示接头序列的不同元件,错误类型通过欧式距离作聚类分析;Figure 1. Error rate statistics based on the sequencing data of the kit barcode system; Figure A shows the average error rate and median error rate of each site in the actual sequencing of the barcode adapter sequence of the 10 sets of kits; Figure B shows the barcode of the 10 sets of kits In the actual sequencing of the adapter sequence, the average error rate and the median error rate of the three types of errors of insertion, deletion, and mismatch at each site are aligned; Figure C shows the corresponding relationship between the barcode adapter site of the kit and the alignment error; Figure D shows A summary of the subcategories of the alignment error types of the barcode adapter in the kit, which shows the distribution of alignment errors that occur at different sites. The abscissa is the base position of the barcode sequence, and the ordinate is the error type. The color of the small cells in the figure indicates that The error rate of the error type at the site, the darker the color means the higher the error rate. The color blocks in the Block comment area indicate different components of the connector sequence, and the error type is clustered by Euclidean distance;
图2.不同比对错误类型对整体序列分类准确度的影响;图A表示总错误率为8%;图B表示总错误率16%;Figure 2. The influence of different alignment error types on the accuracy of overall sequence classification; Figure A shows the total error rate is 8%; Figure B shows the total error rate is 16%;
图3.不包含锚定序列(ANCHOR=0组)与包含1段锚定序列(ANCHOR=1组),包含2段锚定序列(ANCHOR=2组)对条码比对整体准确性的影响;Figure 3. The influence of no anchor sequence (ANCHOR=0 group), 1 anchor sequence (ANCHOR=1 group), and 2 anchor sequences (ANCHOR=2 group) on the overall accuracy of barcode alignment;
图4.原始和优化的建库流程的示意图;Figure 4. Schematic diagram of the original and optimized database construction process;
图5.原始条码系统与位置锚定条码系统的样本分类准确度比较,阴影部分表示准确分类的结果,其他结果为误分类的结果。Figure 5. Comparison of sample classification accuracy between the original barcode system and the position-anchored barcode system. The shaded part indicates the result of accurate classification, and the other results are the result of misclassification.
具体实施方式Detailed ways
下面将结合附图对本发明的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, rather than all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
部分术语定义Definition of some terms
除非在下文中另有定义,本发明具体实施方式中所用的所有技术术语和科学术语的含义意图与本领域技术人员通常所理解的相同。虽然相信以下术语对于本领域技术人员很好理解,但仍然阐述以下定义以更好地解释本发明。Unless otherwise defined below, the meanings of all technical and scientific terms used in the specific embodiments of the present invention are intended to be the same as those commonly understood by those skilled in the art. Although it is believed that the following terms are well understood by those skilled in the art, the following definitions are still set forth to better explain the present invention.
如本发明中所使用,术语“包括”、“包含”、“具有”、“含有”或“涉及”为包含性的(inclusive)或开放式的,且不排除其它未列举的元素或方法步骤。术语“由...组成”被认为是术语“包含”的优选实施方案。如果在下文中某一组被定义为包含至少一定数目的实施方案,这也应被理解为揭示了一个优选地仅由这些实施方案组成的组。As used in the present invention, the terms "including", "including", "having", "containing" or "involving" are inclusive or open-ended, and do not exclude other unlisted elements or method steps . The term "consisting of" is considered a preferred embodiment of the term "comprising". If in the following a certain group is defined as comprising at least a certain number of embodiments, this should also be understood as revealing a group preferably consisting of only these embodiments.
本发明中的术语“大约”、“大体”表示本领域技术人员能够理解的仍可保证论及特征的技术效果的准确度区间。该术语通常表示偏离指示数值的±10%,优选±5%。The terms "approximately" and "generally" in the present invention represent the accuracy range that can be understood by those skilled in the art and can still guarantee the technical effect of discussing the feature. The term usually indicates a deviation of ±10% from the indicated value, preferably ±5%.
在提及单数形式名词时使用的不定冠词或定冠词例如“一个”或“一种”,“所述”,包括该名词的复数形式。The indefinite or definite article used when referring to a noun in the singular form such as "a" or "an", "the" includes the plural form of the noun.
此外,说明书和权利要求书中的术语第一、第二、第三、(a)、(b)、(c)以及诸如此类,是用于区分相似的元素,不是描述顺序或时间次序必须的。应理解,如此应用的术语在适 当的环境下可互换,并且本发明描述的实施方案能以不同于本发明描述或举例说明的其它顺序实施。In addition, the terms first, second, third, (a), (b), (c) and the like in the specification and claims are used to distinguish similar elements, and are not necessary for the order of description or time. It should be understood that the terms so applied are interchangeable under appropriate circumstances, and the embodiments described in the present invention can be implemented in other orders than described or exemplified in the present invention.
以下术语或定义仅仅是为了帮助理解本发明而提供。这些定义不应被理解为具有小于本领域技术人员所理解的范围。The following terms or definitions are only provided to help understand the present invention. These definitions should not be construed as having a scope less than that understood by those skilled in the art.
本发明中的部分技术术语解释如下:Some technical terms in the present invention are explained as follows:
本发明所述的“位置锚定条码系统”是指包含两个及两个以上BARCODE序列串联的多条码测序标签系统,所述BARCODE间通过特定ANCHOR序列锚定,该系统可应用于纳米孔测序中的测序文库构建,能够提高测序样本分类准确度,降低测序样本分类假阳性中的应用。其具体结构可以为本发明所述的[BARCODE-ANCHOR] n-BARCODE n+1,其中,n≥1,所述BARCODE为条码序列,所述ANCHOR为锚定序列。可以理解的是任何包含了上述结构的组合物、复合物、或系统等都在本发明的范围之内。虽然本发明以现有技术中SQK-PBK004试剂盒的条形码为例展开诠释,但其仅为示例性说明,并不能限制本发明。本发明已经通过具体的生信理论分析和湿实验验证,证明了任何包含[BARCODE-ANCHOR] n-BARCODE n+1结构的条码系统都能够用于测序文库的构建,能够提高测序样本分类准确度,降低测序样本分类假阳性。在本发明的一些优选实施方式中,所述位置锚定条码系统结构可以为如下:FLANK1-[BARCODE-ANCHOR] n-BARCODE n+1-FLANK2,所述FLANK为接头序列,接头序列为测序文库构建的常规组件,本领域可以理解该组件的加入,FLANK1和2根据实际需要,序列上可以相同或者不同。 The "position-anchored barcode system" of the present invention refers to a multi-barcode sequencing tag system containing two or more BARCODE sequences in series. The BARCODEs are anchored by a specific ANCHOR sequence. This system can be applied to nanopore sequencing. The construction of sequencing library in, can improve the accuracy of sequencing samples classification, and reduce the application of sequencing sample classification false positives. Its specific structure can be the [BARCODE-ANCHOR] n -BARCODE n+1 described in the present invention, where n≥1, the BARCODE is a barcode sequence, and the ANCHOR is an anchor sequence. It can be understood that any composition, compound, or system including the above structure is within the scope of the present invention. Although the present invention is explained by taking the barcode of the SQK-PBK004 kit in the prior art as an example, it is only an exemplary description and does not limit the present invention. The present invention has passed specific bio-information theory analysis and wet experiment verification, which proves that any barcode system containing [BARCODE-ANCHOR] n -BARCODE n+1 structure can be used for the construction of sequencing library, which can improve the accuracy of sequencing samples classification , To reduce the false positives of sequencing samples. In some preferred embodiments of the present invention, the structure of the position-anchored barcode system may be as follows: FLANK1-[BARCODE-ANCHOR] n -BARCODE n+1 -FLANK2, the FLANK is a linker sequence, and the linker sequence is a sequencing library The construction of conventional components can be understood in this field. FLANK1 and 2 can be the same or different in sequence according to actual needs.
鉴于本发明实施例中通过深入挖掘大量数据,对纳米孔测序平台的序列比对时发生的错误进行归类和统计分析,量化了不同错误率类型对序列鉴定的影响,发现插入缺失类型(Indel)的测序错误会极大提升序列鉴定的错误率,而碱基错配类型(Mismatch)对序列鉴定的错误率提升影响较小,因此在样本条码的设计中一味提升条码长度对准确度提升的影响有限。另外,本发明部分实施例中还证实来序列长度问题,比如实施例2中提及“为了达到99.99%的整体比对正确率,引入0.16的碱基错配类型错误时,条码长度只需要达到40bp;而引入0.16的插入缺失类型错误时,条码长度却需要达到80bp”。可见本领域根据实际需要,适当选择本发明所述位置锚定条码系统的长度,比如本发明的一些实施方式中,所述1≤n≤10,比如n=1,2,3,4,5,6,7,8,9,10;优选的,所述n为1,2或3。In view of the fact that by digging a large amount of data in-depth in the embodiments of the present invention, the errors in the sequence comparison of the nanopore sequencing platform are classified and statistically analyzed, and the influence of different error rate types on sequence identification is quantified, and the indel type (Indel ) Sequencing errors will greatly increase the error rate of sequence identification, and the type of base mismatch (Mismatch) has little effect on the improvement of the error rate of sequence identification. Therefore, in the design of sample barcodes, blindly increasing the length of the barcode will improve the accuracy. The impact is limited. In addition, some examples of the present invention have also confirmed the problem of sequence length. For example, in Example 2, it is mentioned that "in order to achieve an overall alignment accuracy of 99.99%, when a base mismatch type error of 0.16 is introduced, the barcode length only needs to reach 40bp; when the indel type error of 0.16 is introduced, the barcode length needs to reach 80bp". It can be seen that the length of the position-anchored barcode system of the present invention is appropriately selected in the field according to actual needs. For example, in some embodiments of the present invention, the 1≤n≤10, such as n=1, 2, 3, 4, 5. ,6,7,8,9,10; Preferably, the n is 1, 2 or 3.
可以理解的是,BARCODE作为测序中的标记序列,在本发明的位置锚定条码系统中,其序列可以是相同的,也可以是不同的;在一些优选的实施方式中,所述BARCODE序列不同。同样ANCHOR序列作为锚定组件,起序列也可以是相同的,或 者是不同的,在一些优选实施方式中,所述ANCHOR序列不同。另外,ANCHOR序列长度可以为本领域已知的长度,比如可以为5-50bp,在一些优选的实施方式中,所述ANCHOR序列长度为10-35bp。It is understandable that BARCODE is used as a marker sequence in sequencing. In the position-anchored barcode system of the present invention, the sequence can be the same or different; in some preferred embodiments, the BARCODE sequence is different. . Similarly, the ANCHOR sequence is used as an anchor component, and the starting sequence may be the same or different. In some preferred embodiments, the ANCHOR sequence is different. In addition, the length of the ANCHOR sequence can be a length known in the art, for example, it can be 5-50 bp. In some preferred embodiments, the length of the ANCHOR sequence is 10-35 bp.
ANCHOR序列作为BARCODE的锚定组件,其序列应区分于BARCODE序列,没有特别限制,所述ANCHOR序列与BARCODE序列的同源性可以<80%,<70%,<60%,<50%,<40%,<30%,<20%,<10%;在一些优选的实施方式中,所述同源性<50%。The ANCHOR sequence is used as the anchor component of BARCODE. Its sequence should be distinguished from the BARCODE sequence. There is no special restriction. The homology between the ANCHOR sequence and the BARCODE sequence can be <80%, <70%, <60%, <50%, < 40%, <30%, <20%, <10%; in some preferred embodiments, the homology is <50%.
可以理解,作为本发明的一些示例性的位置锚定条码系统,其结构可以具体为如下所示:It can be understood that, as some exemplary position-anchored barcode systems of the present invention, the structure can be specifically as follows:
FLANK1-BARCODE 1-ANCHOR 1-BARCODE 2-FLANK2; FLANK1-BARCODE 1 -ANCHOR 1 -BARCODE 2 -FLANK2;
FLANK1-BARCODE 1-ANCHOR 1-BARCODE 2-ANCHOR 2-BARCODE 3-FLANK2; FLANK1-BARCODE 1 -ANCHOR 1 -BARCODE 2 -ANCHOR 2 -BARCODE 3 -FLANK2;
FLANK1-BARCODE 1-ANCHOR 1-BARCODE 2-ANCHOR 2-BARCODE 3-ANCHOR 3-BARCODE 4-FLANK2; FLANK1-BARCODE 1 -ANCHOR 1 -BARCODE 2 -ANCHOR 2 -BARCODE 3 -ANCHOR 3 -BARCODE 4 -FLANK2;
FLANK1-BARCODE 1-ANCHOR 1-BARCODE 2-ANCHOR 2-BARCODE 3-ANCHOR 3-BARCODE 4-ANCHOR 4-BARCODE 5-FLANK2; FLANK1-BARCODE 1 -ANCHOR 1 -BARCODE 2 -ANCHOR 2 -BARCODE 3 -ANCHOR 3 -BARCODE 4 -ANCHOR 4 -BARCODE 5 -FLANK2;
............
本发明所述的“条码接头”是指包含了条码序列,两端侧翼序列的完整区段。比如,本发明实施例中,自主设计的条码接头定义为BBRCD接头,而原始试剂盒SQK-PBK004中的条码接头定义为ABRCD接头。The "barcode connector" in the present invention refers to a complete section containing a barcode sequence and flanking sequences at both ends. For example, in the embodiment of the present invention, the self-designed barcode connector is defined as a BBRCD connector, and the barcode connector in the original kit SQK-PBK004 is defined as an ABRCD connector.
本发明所述的“条码序列”是指条码的具体序列,其包含于条码接头之中,是条码接头的一部分序列。比如本发明实施例中,自主设计的条码序列定义为BBRCD,而原始试剂盒SQK-PBK004中的条码序列定义为ABRCD。The "bar code sequence" in the present invention refers to a specific sequence of a bar code, which is included in the bar code connector and is a part of the sequence of the bar code connector. For example, in the embodiment of the present invention, the independently designed barcode sequence is defined as BBRCD, and the barcode sequence in the original kit SQK-PBK004 is defined as ABRCD.
本发明所述的“锚定序列(ANCHOR)”是指用于锚定BARCODE的核苷酸序列,其长度可以为本领域已知的任意核实长度,比如可以是5-50bp,在一些优选的实施方式中,可以是10-35bp;其序列应区分于BARCODE序列,没有特别限制,ANCHOR序列与BARCODE序列的同源性可以<80%,<70%,<60%,<50%,<40%,<30%,<20%,<10%;在一些优选的实施方式中,所述同源性<50%。示例性的,比如本发明实施例中提及的ANCHOR序列有SEQ ID NO.50,SEQ ID NO.51和SEQ ID NO.13等。The "anchor sequence (ANCHOR)" in the present invention refers to a nucleotide sequence used to anchor BARCODE. Its length can be any verified length known in the art, for example, it can be 5-50 bp. In the embodiment, it can be 10-35bp; its sequence should be distinguished from the BARCODE sequence, and there is no particular limitation. The homology of ANCHOR sequence and BARCODE sequence can be <80%, <70%, <60%, <50%, <40 %, <30%, <20%, <10%; in some preferred embodiments, the homology is <50%. Exemplarily, for example, the ANCHOR sequence mentioned in the embodiment of the present invention includes SEQ ID NO. 50, SEQ ID NO. 51, SEQ ID NO. 13, and the like.
本发明所述的“FLANK”是指条码系统两端的侧翼序列,其为测序条码接头的常规组件,比如对于纳米孔测序平台而言,本发明中FLANK1为连接包含马达蛋白的Y型 测序接头,保证DNA通过纳米孔实现正常测序;FLANK2用于连接测序样本序列,其长度可以为本领域已知的任意核实长度,比如可以是10-30bp,在一些优选的实施方式中,可以是15-25bp。示例性的,如本发明实施例中提及的FLANK序列SEQ ID NO.16和SEQ ID NO.26等。The "FLANK" in the present invention refers to the flanking sequences at both ends of the barcode system, which are conventional components of sequencing barcode adapters. For example, for a nanopore sequencing platform, FLANK1 in the present invention is a Y-type sequencing adapter that connects the motor protein. Ensure that the DNA passes through the nanopore to achieve normal sequencing; FLANK2 is used to connect the sequence of the sequencing sample, and its length can be any verified length known in the art, for example, it can be 10-30 bp, and in some preferred embodiments, it can be 15-25 bp . Exemplarily, such as the FLANK sequence SEQ ID NO. 16 and SEQ ID NO. 26 mentioned in the embodiment of the present invention.
本发明通过附图和如下实施例进一步描述,所述的附图和实施例只是为了例证本发明的特定实施方案,不应理解为以任何方式限制本发明范围之意。除非另外说明,本发明中所公开的实验方法均采用本技术领域常规技术,实施例中所用的试剂和原材料均可由市场购得。The present invention is further described by the accompanying drawings and the following examples. The accompanying drawings and examples are only to illustrate specific embodiments of the present invention and should not be construed as limiting the scope of the present invention in any way. Unless otherwise specified, the experimental methods disclosed in the present invention all adopt conventional techniques in this technical field, and the reagents and raw materials used in the examples are all commercially available.
实施例1比对错误统计Example 1 Comparison error statistics
本发明考虑到造成条码混淆的主要原因在于测序条码与预设真实条码的序列差异,因此首先对测序得到的条码序列与真实条码序列的差异进行分类整理。为此,本发明分别使用ONT公司的SQK-PBK004试剂盒的10组条码对样本DNA单独建库上机,截取测序数据5’端250bp以保证包含条码区,然后与相应预设条码接头进行全局比对(overlap alignment)。最后根据输出的多重比对文件总结整理条码接头序列每个位置的比对差异,统计测序条码的错误位置分布和错误类型。其中,错误类型分为三大类,分别是插入(insertion,I)、缺失(deletion,D)和碱基错配(mismatch,X)。The present invention considers that the main reason for the confusion of the barcode is the sequence difference between the sequenced barcode and the preset real barcode, so the difference between the barcode sequence obtained by the sequencing and the real barcode sequence is sorted and sorted first. To this end, the present invention uses 10 sets of barcodes of ONT's SQK-PBK004 kit to build a separate library of sample DNA, intercept 250bp of the 5'end of the sequencing data to ensure that the barcode region is included, and then perform a global connection with the corresponding preset barcode adapter. Overlap alignment. Finally, according to the output multiple alignment file, the comparison difference of each position of the barcode adapter sequence is summarized, and the error position distribution and error type of the sequencing barcode are counted. Among them, the error types are divided into three categories, namely insertion (I), deletion (deletion, D) and base mismatch (mismatch, X).
结果:首先,整体而言,10组测序数据经fastq格式初步过滤后reads总计28543061条,但实际参与错误统计的reads数只有24075634条,所以在错误统计过程中过滤掉了15.65%左右无比对结果的reads。在参与统计的10组条码序列实际测序中每个位点的平均错误率为8.01%,中位错误率为5.80%(图1A),其中一组比对数据平均错误率高达10.31%,平均错误率最低的一组统计结果也有6.53%。Results: First of all, overall, after the 10 sets of sequencing data were initially filtered in the fastq format, a total of 285,43061 reads, but the actual number of reads participating in error statistics was only 24,075,634, so about 15.65% of the results were filtered out in the error statistics process. Reads. In the actual sequencing of the 10 groups of barcode sequences participating in the statistics, the average error rate of each site was 8.01%, and the median error rate was 5.80% (Figure 1A). The average error rate of one group of comparison data was as high as 10.31%, and the average error The group with the lowest rate also has 6.53%.
其次,从比对的三大错误类型展开(图1B),10组条码序列实际测序中发生插入的错误率平均为2.28%,发生缺失的错误率平均为3.58%,发生碱基错配的平均错误率为2.16%;错误率中位值分别为1.70%,2.16%和1.57%。Secondly, starting from the three types of alignment errors (Figure 1B), the average insertion error rate in the actual sequencing of 10 sets of barcode sequences is 2.28%, the average deletion error rate is 3.58%, and the average base mismatch The error rate was 2.16%; the median error rates were 1.70%, 2.16% and 1.57%.
由图1B,条码接头各位点的错误率平均值与中位值存在显著差异,于是本发明进一步汇总了不同错误类型在不同的条码序列位置上的比对错误。图1C表示除了测序初始不稳定导致的5’端前6个碱基错误率偏高以外,其余位置没有明显的错误率差异。From Fig. 1B, there is a significant difference between the average and median error rates of each point of the barcode connector, so the present invention further summarizes the alignment errors of different error types at different barcode sequence positions. Figure 1C shows that, except for the high error rate of the first 6 bases at the 5'end caused by the initial instability of sequencing, there is no significant error rate difference in the remaining positions.
更进一步地,由于位置对错误类型的影响较小,本发明在不考虑位置的情况下,细致汇总了错误类型的亚类(图1D):参考序列的预设碱基加上查询序列的突变碱基如GA 表示G错误匹配为A;比对位点缺失用缺失碱基加字幕D表示,如GD;比对位点插入用I表示,插入的碱基附在I后面,如IC表示该位点发生碱基C的插入;插入2个为I2,插入3个及以上为II;在位点发生比对不匹配后又发生1个碱基插入用X2表示,又发生2个及以上插入为XX。根据图1D不同错误类型的距离,除了GA与AG错配类型,其他碱基之间的错配概率相近,且无明显偏向性;观察平均色度最深的4行发现恰是4种碱基的比对缺失概率,平均错误率都在0.09左右,也没有明显碱基缺失偏向性;最后观察比对插入的错误类型,色块分布最均匀的5行表明插入不同单个碱基的概率距离临近,且与插入2个碱基的错误类型I2位于同一聚类水平。综上,不同错误类型亚类的数据如下:Furthermore, since the position has little influence on the error type, the present invention carefully summarizes the error type subclasses without considering the position (Figure 1D): the preset base of the reference sequence plus the mutation of the query sequence Bases such as GA indicate that G is mismatched as A; deletions in alignment sites are indicated by missing bases and subtitles D, such as GD; insertions in alignment sites are indicated by I, and the inserted bases are appended to I, such as IC The insertion of base C occurs at the site; 2 insertions are I2, and 3 or more insertions are II; after the alignment mismatch occurs at the site, another base insertion occurs, which is indicated by X2, and 2 or more insertions occur. For XX. According to the distance of different error types in Figure 1D, except for the GA and AG mismatch types, the mismatch probability between other bases is similar, and there is no obvious bias; observing the 4 lines with the deepest average chromaticity, it is found that there are exactly 4 bases The average error rate for the comparison of the missing probability is about 0.09, and there is no obvious base missing bias; finally, the error type of the comparison insertion is observed. The 5 rows with the most uniform color block distribution indicate that the probability of inserting different single bases is close. And it is at the same cluster level as the error type I2 of inserting 2 bases. In summary, the data for different error types subcategories are as follows:
表1.Table 1.
错误类型Type of error 错误率Error rate
DD 0.035800.03580
I1I1 0.015750.01575
XNXN 0.013400.01340
GAGA 0.005480.00548
AGAG 0.002670.00267
I2I2 0.003680.00368
IIII 0.001300.00130
X2X2 0.001720.00172
XXXX 0.000320.00032
上表中XN表示除了GA与AG错配之外其他所有比对不匹配类型,I1表示随机插入单个碱基,I2与II表示分别随机插入2个,3到5个碱基,插入3,4,5个碱基的比例为8∶1.8∶0.2;X2,XX表示位点随机引入不匹配碱基后再插入1个或多个碱基。In the above table, XN represents all mismatch types except GA and AG mismatches, I1 represents a single base randomly inserted, I2 and II represent randomly inserted 2, 3 to 5 bases, and 3, 4 respectively. , The ratio of 5 bases is 8:1.8:0.2; X2,XX indicates that the site randomly introduces unmatched bases and then inserts one or more bases.
实施例2不同错误类型及条码长度对条码比对整体准确性的影响Example 2 The influence of different error types and barcode length on the overall accuracy of barcode comparison
按照上述实施例1的不同错误类型及其对应错误率数值,本发明分别模拟了长20bp,40bp,60bp,80bp,100bp,120bp共6组长度的条码序列,其中每组模拟12个不同条码元件。本发明以80bp为例,阐述模拟具体概况。首先本发明预设了12个理想条码序列,序列信息如表2:According to the different error types and corresponding error rate values of the above-mentioned embodiment 1, the present invention simulates a total of 6 groups of barcode sequences with lengths of 20bp, 40bp, 60bp, 80bp, 100bp, and 120bp, each of which simulates 12 different barcode elements. . The present invention takes 80bp as an example to illustrate the specific overview of the simulation. First, the present invention presets 12 ideal barcode sequences, and the sequence information is shown in Table 2:
表2.Table 2.
Figure PCTCN2020085645-appb-000001
Figure PCTCN2020085645-appb-000001
Figure PCTCN2020085645-appb-000002
Figure PCTCN2020085645-appb-000002
Figure PCTCN2020085645-appb-000003
Figure PCTCN2020085645-appb-000003
然后以每个位点总错误率为0.08的概率引入三种不同类型错误,分别是仅插入缺失,仅碱基错配,既有插入缺失又有碱基错配。每种错误类型的细节错误率分布比例同表1一致,按总错误率的值同比提高或降低。一个流通池多个样本上机,实际下机数据中会有多个条码DNA,样本间的条码会互相混淆,本发明针对每组长度下每个预设条码模拟100,000条接头序列,将所有模拟序列混合在一起模拟12个样本同时上机的情况,模拟序列名称用预设条码名本身加数字编号,通过同样生信分析流程对数据分类,最后通过比较分类得到的条码信息与模拟序列名称来判断是否发生串扰。Then three different types of errors are introduced with the probability of a total error rate of 0.08 for each site, which are indels only, base mismatches, and both indels and base mismatches. The detailed error rate distribution ratio of each error type is consistent with Table 1, and the total error rate is increased or decreased year-on-year. A flow cell has multiple samples on the machine, there will be multiple barcode DNA in the actual off-machine data, and the barcodes between the samples will be confused with each other. The present invention simulates 100,000 joint sequences for each preset barcode under each set of length, and all simulated The sequence is mixed together to simulate the situation that 12 samples are on the machine at the same time. The analog sequence name uses the preset barcode name itself plus a digital number. The data is classified through the same biometric analysis process, and finally the barcode information obtained by the classification is compared with the analog sequence name. Determine whether crosstalk occurs.
结果:如图2A所示,由图2A,总错误率为0.08时,不论引入的错误类型是什么,随着条码长度的增加,条码序列整体比对的正确率均逐渐增加;相同长度时,引入插入缺失类型错误的条码序列整体比对的正确率显著低于引入碱基错配类型错误的条码序列整体比对的正确率。这两个结论在总体错误率提升至0.16时仍然成立(图2B)。举例而言,为了达到99.99%的整体比对正确率,引入0.16的碱基错配类型错误时,条码长度只需要达到40bp;而引入0.16的插入缺失类型错误时,条码长度却需要达到80bp。由此可见,插入缺失类型错误对比对正确率影响大于碱基错配,而条码长度对比对正确率影响有限。Result: As shown in Figure 2A, from Figure 2A, when the total error rate is 0.08, regardless of the type of error introduced, as the length of the barcode increases, the accuracy of the overall alignment of the barcode sequence gradually increases; for the same length, The accuracy rate of the overall alignment of the barcode sequence that introduces the wrong type of indel is significantly lower than that of the overall alignment of the barcode sequence that introduces the wrong type of base mismatch. These two conclusions are still valid when the overall error rate is increased to 0.16 (Figure 2B). For example, in order to achieve an overall alignment accuracy rate of 99.99%, when a base mismatch type error of 0.16 is introduced, the bar code length only needs to reach 40 bp; while an indel type error of 0.16 is introduced, the bar code length needs to reach 80 bp. It can be seen that the comparison of indel type errors has a greater impact on the accuracy rate than base mismatches, while the comparison of barcode length has a limited impact on the accuracy rate.
实施例3插入锚定序列对条码比对整体准确性的影响Example 3 The influence of inserting anchor sequence on the overall accuracy of barcode comparison
基于上述实施例2的结论,本发明拟对条码片段作如下两种方式改造,以80bp为例:Based on the conclusion of the above embodiment 2, the present invention intends to modify the barcode fragment in the following two ways, taking 80bp as an example:
1)1条锚定序列插入:将条码序列中间12个碱基替换为12bp序列相同的锚定序列(即表2下划线所示),即原预设条码片段替换成同样长度的“短条码-锚定序列-短条码”形式;1) Insert an anchor sequence: replace the 12 bases in the middle of the barcode sequence with the anchor sequence with the same 12bp sequence (that is, underlined in Table 2), that is, replace the original preset barcode fragment with the same length "short barcode- Anchor sequence-short barcode" form;
2)2条锚定序列插入:将表2中条码序列两端20bp处分别替换为12bp锚定序列1和锚定序列2(即表2中加粗所示),即同样长度的短条码-锚定序列1-短条码-锚定序列2-短条码。2) Insert two anchor sequences: replace the 20bp at both ends of the barcode sequence in Table 2 with 12bp anchor sequence 1 and anchor sequence 2 (shown in bold in Table 2), that is, short barcodes of the same length- Anchor Sequence 1-Short Barcode-Anchor Sequence 2-Short Barcode.
分别通过锚定序列在比对过程中得到的位置信息锁定短条码位置范围,最后根据短条码组合结果实现数据的分类。The position range of the short bar code is locked by the position information obtained in the comparison process of the anchor sequence, and finally the data is classified according to the result of the short bar code combination.
结果:将表2中有下划线的12bp替换为特定ANCHOR锚定序列GGTGCTGTTAAC(SEQ ID NO.50),分别在每个位点引入实施例1中的错误类型和错误分 布,总错误率随机分布在E±0.5E范围内(这里E的值分别为0.08,0.12,0.16),每个E值下模拟6000万条条码序列;Result: Replace the underlined 12bp in Table 2 with the specific ANCHOR anchor sequence GGTGCTGTTAAC (SEQ ID NO.50), introduce the error type and error distribution in Example 1 at each location, and the total error rate is randomly distributed Within the range of E±0.5E (where the value of E is 0.08, 0.12, 0.16), each E value simulates 60 million bar code sequences;
同样思路,将表2中加粗序列的12bp分别替换为锚定序列GGTGCTGTTAAC(SEQ ID NO.50)和锚定序列GTACGGAAGTCG(SEQ ID NO.51)进行序列模拟。In the same way, the 12bp of the bolded sequence in Table 2 is replaced with the anchor sequence GGTGCTGTTAAC (SEQ ID NO.50) and the anchor sequence GTACGGAAGTCG (SEQ ID NO.51) for sequence simulation.
随后,分别在比对过程利用锚定序列锁定条码位置范围对数据进行分类,并对无锚定序列(ANCHOR=0组),1条锚定序列(ANCHOR=1组)和2条锚定序列(ANCHOR=2组)的分类准确定作比较。由图3可知,考虑序列的锚定后,分类准确率显著提高:Subsequently, in the comparison process, the data is classified by using the anchor sequence to lock the barcode position range, and the non-anchor sequence (ANCHOR=0 group), 1 anchor sequence (ANCHOR=1 group) and 2 anchor sequences are used. The classification of (ANCHOR=2 groups) is accurately determined for comparison. It can be seen from Figure 3 that after considering the anchoring of the sequence, the classification accuracy is significantly improved:
E值为0.08时,80bp的ANCHOR=0组分类正确率只能达到99.9999%;包含锚定序列的ANCHOR=1组和ANCHOR=2组分类准确率都提高至100%;When the E value is 0.08, the classification accuracy rate of the 80bp ANCHOR=0 group can only reach 99.9999%; the classification accuracy of the ANCHOR=1 group and the ANCHOR=2 group containing the anchor sequence are both improved to 100%;
E值提升至0.12时,分辨率水平排序如下:ANCHOR=2组>ANCHOR=1组>ANCHOR=0组;When the E value is increased to 0.12, the resolution level is sorted as follows: ANCHOR=2 group>ANCHOR=1 group>ANCHOR=0 group;
E值提升为0.16时,三者分类准确度差异增大,而且针对本发明生成的模拟数据,ANCHOR=2组的分类准确度仍然是100%。When the E value is increased to 0.16, the difference in classification accuracy of the three increases, and for the simulation data generated by the present invention, the classification accuracy of the ANCHOR=2 group is still 100%.
由实施例2和3,增加条码长度可一定程度提高样本分辨准确率,但是提升率随长度的增长逐渐下降,难以达成数量级性质的增长;通过插入锚定序列锚定条码区,可显著减弱比对中的插入缺失降分,模拟数据中可达到100%正确率,随着锚定序列的增加,数据分类准确度也显著增加。根据牛津纳米孔测序仪basecall过程中单碱基10-15%的错误率以及一次完整测序的总reads数,推测实际下机测序数据中条码序列引入一段锚定序列可实现至少3个数量级的正确率提高。According to Examples 2 and 3, increasing the barcode length can improve the sample resolution accuracy to a certain extent, but the improvement rate gradually decreases with the increase of length, and it is difficult to achieve an order of magnitude increase; by inserting an anchor sequence to anchor the barcode area, the ratio can be significantly reduced. The indels in the alignment are reduced, and the simulation data can reach 100% accuracy. With the increase of anchor sequences, the accuracy of data classification also increases significantly. According to the 10-15% error rate of a single base during the basecall process of the Oxford Nanopore Sequencer and the total number of reads for a complete sequence, it is speculated that the barcode sequence in the actual offline sequencing data can be at least 3 orders of magnitude correct by introducing an anchor sequence The rate increases.
实施例4位置锚定条码系统的制备及文库构建Example 4 Preparation of Position Anchor Barcode System and Library Construction
为进一步通过实验验证上述理论,本实施例作为一个示例,在原始的ONT公司的纳米孔建库PCR条码试剂盒SQK-PBK004的基础上,巧妙利用其已有的单条码接头(序列结构为FLANK1-ABRCD-FLANK3’),通过连接反应,将其与本发明中自主设计的条形码接头(序列结构为FLANK5’-BBRCD-FLANK2)连接起来,形成本发明的位置锚定条码系统:FLANK1-ABRCD-ANCHOR-BBRCD-FLANK2(其中,ANCHOR序列是由FLANK3’和FLANK5’通过连接反应连接后的序列)。FLANK2通过连接反应继续与已知来源的样本DNA进行连接,可使得最终待测序DNA上拥有位置锚定条码系统。这样的设计,使得我们可以通过已知的样本DNA的结果反推连接的条码序列,从而量化条码系统分类准确度。In order to further verify the above theory through experiments, this embodiment is taken as an example, based on the original ONT company’s nanopore library PCR barcode kit SQK-PBK004, clever use of its existing single barcode adapter (sequence structure FLANK1-ABRCD -FLANK3'), through the connection reaction, connect it with the barcode adapter (sequence structure FLANK5'-BBRCD-FLANK2) independently designed in the present invention to form the position-anchored barcode system of the present invention: FLANK1-ABRCD-ANCHOR- BBRCD-FLANK2 (where the ANCHOR sequence is the sequence after FLANK3' and FLANK5' are connected by a ligation reaction). FLANK2 continues to connect with the sample DNA of known source through the ligation reaction, so that the final DNA to be sequenced has a position-anchored barcode system. This design allows us to infer the connected barcode sequence based on the results of the known sample DNA, thereby quantifying the classification accuracy of the barcode system.
具体构建过程如下:The specific construction process is as follows:
本发明根据ONT公司其他试剂盒的条码序列信息,通过组合比对得到了10个彼此具有极好区分度的条码序列BBRCD;然后在5’端加上一段长13bp的保守侧翼序列FLANK,这段序列具有很好的PCR引物特征,GC含量适中,无发卡和二聚体结构等,沿用了原始PCR接头的Y型结构来避免多个条码在连接步骤连续串联的情况;ANCHOR序列,本试验中也是PCR搭桥引物的3’端序列(SEQ ID NO.13),其与自主设计的条码接头FLANK5’一致,5’端碱基序列则与试剂盒原始条码接头的FLANK3’一致,从而实现通过一个PCR反应得到一段5’和3’端同时串联双条码接头的测序DNA片段。According to the barcode sequence information of other kits of ONT company, the present invention obtains 10 barcode sequences BBRCD with excellent distinguishability from each other through combinatorial comparison; and then adds a 13bp conservative flanking sequence FLANK to the 5'end. The sequence has good PCR primer characteristics, moderate GC content, no hairpin and dimer structure, etc., and the Y-shaped structure of the original PCR linker is used to avoid the situation where multiple barcodes are connected in series in the connection step; ANCHOR sequence, in this experiment It is also the 3'end sequence (SEQ ID NO.13) of the PCR bypass primer, which is consistent with the self-designed barcode adapter FLANK5', and the 5'end base sequence is consistent with the FLANK3' of the original barcode adapter of the kit, thus achieving a The PCR reaction obtains a sequenced DNA fragment with 5'and 3'ends in series with double barcode adapters simultaneously.
所述PCR搭桥引物序列如下:The PCR bridging primer sequence is as follows:
F1:5’-TTCTGTTGGTGCTGATATTGC CCGACTTCCGTAC-3’(SEQ ID NO.13) F1: 5'-TTCTGTTGGTGCTGATATTGC CCGACTTCCGTAC -3' (SEQ ID NO.13)
F2:5’-ACTTGCCTGTCGCTCTATCTTC CCGACTTCCGTAC-3’(SEQ ID NO.14) F2: 5'-ACTTGCCTGTCGCTCTATCTTC CCGACTTCCGTAC -3' (SEQ ID NO.14)
注:下划线表示与自主设计条码接头FLANK5’一致的13bp序列。Note: The underline indicates the 13bp sequence consistent with the self-designed barcode linker FLANK5'.
所述自主设计的条码接头FLANK5’-BBRCD-FLANK2的序列如表3所示:The sequence of the self-designed barcode connector FLANK5'-BBRCD-FLANK2 is shown in Table 3:
表3table 3
Figure PCTCN2020085645-appb-000004
Figure PCTCN2020085645-appb-000004
Figure PCTCN2020085645-appb-000005
Figure PCTCN2020085645-appb-000005
所述SQK-PBK004试剂盒原始条码接头FLANK1-ABRCD-FLANK3’的序列如表4所示:The sequence of the original barcode adapter FLANK1-ABRCD-FLANK3' of the SQK-PBK004 kit is shown in Table 4:
表4Table 4
Figure PCTCN2020085645-appb-000006
Figure PCTCN2020085645-appb-000006
本实施例针对10例标准纯菌菌株Brevibacillus borstelensis,Pseudomonas aeruginosa,Escherichia coli,Salmonella enterica,Klebsiella pneumoniae,Listeria monocytogenes,Staphylococcus aureus;Acinetobacter baumannii,Bacillus subtilis和Stenotrophomonas maltophilia,使用图4中优化的建库流程对每个纯菌样本引入不同的位置锚定条码序列进行文库制备,具体制备10组位置锚定条码序列,如表5所示。This example is aimed at 10 cases of standard pure bacteria strains Brevibacillus borstelensis, Pseudomonas aeruginosa, Escherichia coli, Salmonella enterica, Klebsiella pneumoniae, Listeria monocytogenes, Staphylococcus aureus; Acinetobacter baumannii, Stemallophila subtilis, and the library optimization diagrams used in the construction process Each pure bacteria sample introduces a different position-anchored barcode sequence for library preparation, and specifically prepares 10 sets of position-anchored barcode sequences, as shown in Table 5.
表5table 5
位置锚定Location anchoring ABRCDABRCD BBRCDBBRCD
条码序列Barcode sequence  To  To
barcode01barcode01 ABRCD01ABRCD01 BBRCD01BBRCD01
barcode02barcode02 ABRCD02ABRCD02 BBRCD02BBRCD02
barcode03barcode03 ABRCD03ABRCD03 BBRCD03BBRCD03
barcode04barcode04 ABRCD04ABRCD04 BBRCD04BBRCD04
barcode05barcode05 ABRCD05ABRCD05 BBRCD05BBRCD05
barcode06barcode06 ABRCD06ABRCD06 BBRCD06BBRCD06
barcode07barcode07 ABRCD07ABRCD07 BBRCD07BBRCD07
barcode08barcode08 ABRCD08ABRCD08 BBRCD08BBRCD08
barcode09barcode09 ABRCD09ABRCD09 BBRCD09BBRCD09
barcode10barcode10 ABRCD10ABRCD10 BBRCD10BBRCD10
所述10组锚定条码的具体核苷酸序列如表6所示,其中下划线部分为ANCHOR序列。The specific nucleotide sequences of the 10 sets of anchor barcodes are shown in Table 6, wherein the underlined part is the ANCHOR sequence.
表6Table 6
Figure PCTCN2020085645-appb-000007
Figure PCTCN2020085645-appb-000007
Figure PCTCN2020085645-appb-000008
Figure PCTCN2020085645-appb-000008
具体文库制备步骤如下:The specific library preparation steps are as follows:
(一)、退火(1) Annealing
1.将上述合成好的接头冻干粉用退火液(1mM EDTA;50mM NaCl;5mM Tris-HCl pH 7.5)稀释至100μM;1. Dilute the lyophilized powder of the above synthesized linker with annealing solution (1mM EDTA; 50mM NaCl; 5mM Tris-HCl pH 7.5) to 100μM;
2.将互补链等摩尔混匀(各取4ul),95℃孵育5min,PCR仪上进行缓慢降温至室温(25℃左右);2. Mix the complementary strands equimolarly (take 4ul each), incubate at 95°C for 5min, slowly cool down to room temperature (about 25°C) on the PCR machine;
3.用退火液将退火好的接头用退火液稀释至640mM;3. Use annealing solution to dilute the annealed joints to 640mM with annealing solution;
4. 4℃保存备用。4. Store at 4°C for later use.
(二)、末端修复(Two), end repair
1. 0.2ml PCR管中取核酸样本100ng加水补齐至50μl1. Take 100ng of nucleic acid sample from 0.2ml PCR tube and add water to make up to 50μl
2.加入7μl Ultra II End-prep reaction buffer,3μl Ultra II End-prep enzyme mix室温旋转混匀,20℃孵育5min,65℃孵育5min后放置室温;2. Add 7μl Ultra II End-prep reaction buffer, 3μl Ultra II End-prep enzyme mix, rotate and mix at room temperature, incubate at 20°C for 5 minutes, incubate at 65°C for 5 minutes, and then leave it at room temperature;
3.加入1×beads(AMPure XP beads)室温旋转混匀孵育5min,瞬离放磁力架上去上清;3. Add 1×beads (AMPure XP beads), rotate and mix at room temperature, incubate for 5 min, and then put on the magnetic stand to remove the supernatant;
4. 200μl新鲜配制的80%酒精洗2次beads;4. Wash the beads twice with 200μl of freshly prepared 80% alcohol;
5.加入17μl nuclease-free water,室温旋转混匀孵育2min,瞬离放磁力架上至澄清;5. Add 17μl nuclease-free water, rotate and mix at room temperature and incubate for 2min, and then put it on the magnetic stand until it is clear;
6.小心吸取15μl上清至0.2ml PCR管中6. Carefully pipette 15μl of supernatant into 0.2ml PCR tube
(三)、接头连接(Three), joint connection
1. 0.2ml PCR管中(15μl End-prepped DNA)加入10μl稀释好的接头,25μl Blunt/TA Ligase Master Mix,枪吹打混匀,21℃孵育25~30min;1. Add 10μl of the diluted adapter, 25μl Blunt/TA Ligase Master Mix to a 0.2ml PCR tube (15μl End-prepped DNA), spray and mix with a gun, and incubate at 21°C for 25-30min;
2.加入0.4×beads室温旋转混匀孵育5min,瞬离放磁力架上去上清;2. Add 0.4×beads at room temperature, rotate and mix, and incubate for 5 minutes, then release the supernatant on the magnetic stand;
3. 200μl新鲜配制的80%酒精洗2次beads;3. Wash the beads twice with 200μl of freshly prepared 80% alcohol;
4.加入15μl nuclease-free water,室温旋转混匀孵育2min,瞬离放磁力架上至澄清;4. Add 15μl of nuclease-free water, rotate and mix at room temperature and incubate for 2 minutes, and then put it on the magnetic stand until it is clear;
5.小心吸取13.5μl上清至0.2ml PCR管中。5. Carefully pipette 13.5μl of supernatant into a 0.2ml PCR tube.
(四)、模板扩增(Four), template amplification
 To 体积(μl)Volume (μl)
Platinum SuperFi PCR Master MixPlatinum SuperFi PCR Master Mix 2525
SuperFi GC Enhancer SuperFi GC Enhancer 1010
DNADNA 13.513.5
RYP-FRYP-F 0.250.25
RYP-RRYP-R 0.250.25
ABRCD primer ABRCD primer 11
2.配制好涡旋混匀,瞬离后放置PCR仪上设置好程序;2. Prepare the vortex and mix well, and place it on the PCR machine after instant separation to set the program;
Figure PCTCN2020085645-appb-000009
Figure PCTCN2020085645-appb-000009
3.反应结束加入0.6×beads室温旋转混匀孵育5min,瞬离放磁力架上去上清;3. At the end of the reaction, add 0.6×beads at room temperature, rotate and mix, and incubate for 5 minutes, then release the supernatant on the magnetic stand;
4. 200μl新鲜配制的80%酒精洗2次beads;4. Wash the beads twice with 200μl of freshly prepared 80% alcohol;
5.加入12μl 10mM Tris-HCl(50mM NaCl)pH 8.0洗脱液,室温旋转混匀孵育2min,瞬离放磁力架上至澄清;小心吸取11μl上清至1.5ml低吸附的离心管中5. Add 12μl of 10mM Tris-HCl (50mM NaCl) pH 8.0 eluent, rotate and mix at room temperature and incubate for 2min, and then release it on the magnetic stand until it is clear; carefully pipet 11μl of supernatant into a 1.5ml low-adsorption centrifuge tube
6.QC:取1μl Qubit检测。6. QC: Take 1μl Qubit for detection.
7.向10ul上清洗脱产物中加入1μl RAP,室温反应5min,准备上机。7. Add 1μl RAP to 10ul of the cleaned and removed product, react at room temperature for 5 minutes, and prepare to go on the machine.
8.按照标准纳米孔测序上机流程对10例不同菌株进行单样本上机测序。8. According to the standard nanopore sequencing process, 10 cases of different strains were sequenced on a single sample.
实施例5实际样本检测分析Example 5 Detection and analysis of actual samples
基于实施例4的实验步骤对10例不同菌株进行单样本测序,并且保证每例样本只连接一种条形码。由于每例测序只使用了单一条码,故而在相应样本的数据中如果比对到非该样本对应的条码,即认为是错误分类。本发明的生信分析使用牛津纳米孔公司官方软件guppy进行原始条码系统的样本分类准确性评估,使用自主软件流程进行位置锚定条码系统的样本分类准确性评估。本发明考察了10组位置锚定条码的分辨能力,最终reads分类的准确率在这10组条码范围内统计计算。Based on the experimental procedure of Example 4, single-sample sequencing was performed on 10 different strains, and it was ensured that each sample was connected to only one barcode. Since only a single barcode is used in each sequencing, if the barcode of the corresponding sample is compared with the barcode that is not corresponding to the sample, it is considered to be a misclassification. The biometric analysis of the present invention uses the official software guppy of Oxford Nanopore Company to evaluate the accuracy of sample classification of the original barcode system, and uses an independent software process to evaluate the accuracy of sample classification of the position-anchored barcode system. The present invention examines the resolving power of 10 groups of position-anchored bar codes, and the final accuracy rate of read classification is statistically calculated within the range of these 10 groups of bar codes.
结果:使用原始条码系统进行样本分类结果混淆明显,图5中利用guppy分类到barcode06的准确率仅有99.954%,其中0.036%混淆到barcode07,约0.01%混淆为其他条码,平均准确率为99.984%,混淆比例高达0.016%。这说明单条码分辨样本的低准确度 极易造成多样本同时上机的情况下某样本中高丰度微生物的“泄露”,进而造成其他样本的高假阳性检出,误导后续临床诊断及治疗决策。Results: The results of sample classification using the original barcode system are obviously confused. The accuracy rate of barcode06 classification by guppy in Figure 5 is only 99.954%, of which 0.036% is confused for barcode07, and about 0.01% is confused for other barcodes, with an average accuracy of 99.984% , The confusion rate is as high as 0.016%. This shows that the low accuracy of the single barcode to distinguish the sample can easily cause the "leakage" of high-abundance microorganisms in a sample when multiple samples are on the machine at the same time, which in turn will cause high false positive detections in other samples and mislead subsequent clinical diagnosis and treatment decisions. .
通过位置锚定条码体系分类准确率平均达到99.9999%,如图5所示,其中分类到barcode01,barcode02,barcode05,barcode09和barcode10的reads与模拟数据的分类准确率一致,都是100%。相比guppy的99.9%分辨准确率提升了3个数量级,意味着精确区分样本的基数从单条码的千条提高了至百万条,将reads分类错误引起的假阳性率降低了1000倍。综上,位置锚定条码分类准确性显著优于单条码。The classification accuracy of the position-anchored barcode system reaches 99.9999% on average, as shown in Figure 5. The reads classified into barcode01, barcode02, barcode05, barcode09 and barcode10 are consistent with the classification accuracy of the simulated data, which are all 100%. Compared with guppy's 99.9% resolution accuracy, it has increased by 3 orders of magnitude, which means that the base for accurately distinguishing samples has increased from a thousand to a million for a single barcode, and the false positive rate caused by misclassification of reads has been reduced by 1000 times. In summary, the classification accuracy of position-anchored barcodes is significantly better than that of single barcodes.
以上对本申请具体实施方式的描述并不限制本申请,本领域技术人员可以根据本申请作出各种改变或变形,只要不脱离本申请的精神,均应属于本申请所附权利要求的范围。。The above description of the specific embodiments of the application does not limit the application, and those skilled in the art can make various changes or modifications according to the application, as long as they do not deviate from the spirit of the application, they shall fall within the scope of the appended claims of the application. .

Claims (15)

  1. 一种用于纳米孔测序建库的位置锚定条码系统,其特征在于,所述系统包括如下结构:A position-anchored barcode system for nanopore sequencing library construction, characterized in that the system includes the following structure:
    [BARCODE-ANCHOR] n-BARCODE n+1 [BARCODE-ANCHOR] n -BARCODE n+1
    其中,n≥1,Where n≥1,
    所述BARCODE为条码序列,The BARCODE is a barcode sequence,
    所述ANCHOR为锚定序列。The ANCHOR is an anchor sequence.
  2. 权利要求1所述的位置锚定条码系统,其特征在于,所述1≤n≤10;优选的,所述n为1,2或3。The position-anchored barcode system according to claim 1, wherein said 1≤n≤10; preferably, said n is 1, 2 or 3.
  3. 权利要求2所述的位置锚定条码系统,其特征在于,所述结构为The position-anchored barcode system of claim 2, wherein the structure is
    FLANK1-[BARCODE-ANCHOR] n-BARCODE n+1-FLANK2, FLANK1-[BARCODE-ANCHOR] n -BARCODE n+1 -FLANK2,
    所述FLANK为侧翼序列。The FLANK is a flanking sequence.
  4. 权利要求2或3任一所述的位置锚定条码系统,其特征在于,所述BARCODE序列相同或者不同;优选的,所述BARCODE序列不同。The position-anchored barcode system according to any one of claims 2 or 3, wherein the BARCODE sequence is the same or different; preferably, the BARCODE sequence is different.
  5. 权利要求2-4任一所述的位置锚定条码系统,其特征在于,所述ANCHOR序列相同或者不同;优选的,所述ANCHOR序列不同。The position-anchored barcode system according to any one of claims 2-4, wherein the ANCHOR sequence is the same or different; preferably, the ANCHOR sequence is different.
  6. 权利要求2-5任一所述的位置锚定条码系统,其特征在于,所述ANCHOR序列长度为5-50bp;优选的,所述ANCHOR序列长度为10-35bp。The position-anchored barcode system of any one of claims 2-5, wherein the ANCHOR sequence is 5-50 bp in length; preferably, the ANCHOR sequence is 10-35 bp in length.
  7. 权利要求2-6任一所述的位置锚定条码系统,其特征在于,所述ANCHOR序列与The position-anchored barcode system according to any one of claims 2-6, wherein the ANCHOR sequence and
    BARCODE序列的同源性<70%;优选的,所述ANCHOR序列与BARCODE序列的同源性<50%。The homology of the BARCODE sequence is less than 70%; preferably, the homology of the ANCHOR sequence and the BARCODE sequence is less than 50%.
  8. 权利要求2-6任一所述的位置锚定条码系统,其特征在于,所述结构为如下任意一个:The position-anchored barcode system according to any one of claims 2-6, wherein the structure is any one of the following:
    FLANK1-BARCODE 1-ANCHOR 1-BARCODE 2-FLANK2; FLANK1-BARCODE 1 -ANCHOR 1 -BARCODE 2 -FLANK2;
    FLANK1-BARCODE 1-ANCHOR 1-BARCODE 2-ANCHOR 2-BARCODE 3-FLANK2; FLANK1-BARCODE 1 -ANCHOR 1 -BARCODE 2 -ANCHOR 2 -BARCODE 3 -FLANK2;
    或FLANK1-BARCODE 1-ANCHOR 1-BARCODE 2-ANCHOR 2-BARCODE 3-ANCHOR 3-BARCODE 5-FLANK2。 Or FLANK1-BARCODE 1 -ANCHOR 1 -BARCODE 2 -ANCHOR 2 -BARCODE 3 -ANCHOR 3 -BARCODE 5 -FLANK2.
  9. 权利要求1-8任一所述位置锚定条码系统的制备方法,其特征在于:所述方法包括直接合成所述位置锚定条码系统的核苷酸序列,或通过分段合成后连接制备所述位置锚定条码系统。The method for preparing a position-anchored barcode system according to any one of claims 1-8, characterized in that: the method comprises directly synthesizing the nucleotide sequence of the position-anchoring barcode system, or linking the preparation site through segmented synthesis. The position anchors the barcode system.
  10. 一种测序文库构建的方法,其特征在于,采用权利要求1-8任一所述的位置锚定条码系统构建测序文库。A method for constructing a sequencing library, which is characterized in that the position-anchored barcode system according to any one of claims 1-8 is used to construct a sequencing library.
  11. 一种测序接头,其特征在于,所述测序接头中包含权利要求1-8任一所述的位置锚定条形码系统。A sequencing adapter, characterized in that the sequencing adapter comprises the position-anchored barcode system according to any one of claims 1-8.
  12. 一种复合物,其特征在于,所述复合物连接于权利要求1-8任一所述的位置锚定条形码系统。A composite, characterized in that the composite is connected to the position-anchored barcode system according to any one of claims 1-8.
  13. 一种组合物,其特征在于,所述组合物包含权利要求1-8任一所述的位置锚定条形码系统。A composition, characterized in that it comprises the position-anchored barcode system according to any one of claims 1-8.
  14. 一种用于纳米孔测序建库的试剂盒,其特征在于,所述试剂盒中包权利要求1-8任一所述的位置锚定条形码系统,或权利要求11所述的测序接头。A kit for nanopore sequencing library construction, wherein the kit includes the position-anchored barcode system according to any one of claims 1-8, or the sequencing adapter according to claim 11.
  15. 权利要求1-8任一所述的位置锚定条形码系统的应用,其特征在于,所述应用为以下任一应用:The application of the position-anchored barcode system according to any one of claims 1-8, wherein the application is any one of the following applications:
    1)在提高测序样本分类准确度中的应用;1) Application in improving the classification accuracy of sequencing samples;
    2)在降低测序样本分类假阳性中的应用;2) Application in reducing false positives in the classification of sequencing samples;
    3)在测序文库构建中的应用;3) Application in the construction of sequencing library;
    4)在测序中的应用。4) Application in sequencing.
PCT/CN2020/085645 2020-04-09 2020-04-20 Position anchoring bar code system for nanopore sequencing library construction WO2021203461A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010276679.2 2020-04-09
CN202010276679.2A CN111440846B (en) 2020-04-09 2020-04-09 Position anchoring bar code system for nanopore sequencing library building

Publications (1)

Publication Number Publication Date
WO2021203461A1 true WO2021203461A1 (en) 2021-10-14

Family

ID=71651430

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/085645 WO2021203461A1 (en) 2020-04-09 2020-04-20 Position anchoring bar code system for nanopore sequencing library construction

Country Status (2)

Country Link
CN (1) CN111440846B (en)
WO (1) WO2021203461A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112029823B (en) * 2020-09-03 2021-07-23 江苏先声医疗器械有限公司 Metagenome library building method of nanopore sequencing platform and kit thereof
CN112176032B (en) * 2020-10-16 2021-10-26 广州市达瑞生物技术股份有限公司 Primer combination for nanopore sequencing and library building of respiratory pathogens and application thereof
CN114480740B (en) * 2022-02-18 2023-10-24 杭州柏熠科技有限公司 Targeting sequencing library construction and detection method suitable for 15 plant quarantine viruses

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105986324A (en) * 2015-02-11 2016-10-05 深圳华大基因研究院 Construction method and application of cyclic small RNA library
CN106282161A (en) * 2016-08-12 2017-01-04 成都诺恩生物科技有限公司 Special capture and repeat replication low frequency DNA base variation method and application
CN110475864A (en) * 2017-02-02 2019-11-19 纽约基因组研究中心公司 For identification or the method and composition of quantization target in the biological sample
WO2020036926A1 (en) * 2018-08-17 2020-02-20 Cellecta, Inc. Multiplex preparation of barcoded gene specific dna fragments

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2513793B (en) * 2012-01-26 2016-11-02 Nugen Tech Inc Compositions and methods for targeted nucleic acid sequence enrichment and high efficiency library generation
EP3143159B1 (en) * 2014-05-13 2019-01-02 Life Technologies Corporation Systems and methods for validation of sequencing results
CN105989249B (en) * 2014-09-26 2019-03-15 南京无尽生物科技有限公司 For assembling the method, system and device of genome sequence
GB201616590D0 (en) * 2016-09-29 2016-11-16 Oxford Nanopore Technologies Limited Method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105986324A (en) * 2015-02-11 2016-10-05 深圳华大基因研究院 Construction method and application of cyclic small RNA library
CN106282161A (en) * 2016-08-12 2017-01-04 成都诺恩生物科技有限公司 Special capture and repeat replication low frequency DNA base variation method and application
CN110475864A (en) * 2017-02-02 2019-11-19 纽约基因组研究中心公司 For identification or the method and composition of quantization target in the biological sample
WO2020036926A1 (en) * 2018-08-17 2020-02-20 Cellecta, Inc. Multiplex preparation of barcoded gene specific dna fragments

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANNA L. MCNAUGHTON, HANNAH E. ROBERTS, DAVID BONSALL, MARIATERESA DE CESARE, JOLYNNE MOKAYA, SHEILA F. LUMLEY, TANYA GOLUBCHIK, PA: "Illumina and Nanopore methods for whole genome sequencing of hepatitis B virus (HBV)", SCIENTIFIC REPORTS, vol. 9, 7081, 1 December 2019 (2019-12-01), pages 1 - 14, XP055856580 *
BRANDON D. WILSON, MICHAEL EISENSTEIN, H. TOM SOH: "High-Fidelity Nanopore Sequencing of Ultra-Short DNA Targets", ANALYTICAL CHEMISTRY, vol. 91, no. 10, 21 May 2019 (2019-05-21), US, pages 6783 - 6789, XP055856583, ISSN: 0003-2700, DOI: 10.1021/acs.analchem.9b00856 *
ZHANG DE-FANG, MA QIU-YUE, YIN TONG-MING, XIA TAO: "The Third Generation Sequencing Technology and Its Application", CHINA BIOTECHNOLOGY, vol. 33, no. 5, 31 December 2013 (2013-12-31), pages 125 - 131, XP055856586, DOI: 10.13523/j.cb.20130520 *

Also Published As

Publication number Publication date
CN111440846A (en) 2020-07-24
CN111440846B (en) 2020-12-18

Similar Documents

Publication Publication Date Title
WO2021203461A1 (en) Position anchoring bar code system for nanopore sequencing library construction
CN108893466B (en) Sequencing joint, sequencing joint group and detection method of ultralow frequency mutation
CN106367485B (en) Double label connector groups of a kind of more positioning for detecting gene mutation and its preparation method and application
CN107002292B (en) A kind of construction method and reagent in the twin adapter single stranded circle library of nucleic acid
CN109971827B (en) Method and kit for constructing blood plasma DNA library
CN106048009B (en) Label joint for ultralow frequency gene mutation detection and application thereof
CN105442054B (en) The method that storehouse is built in the amplification of multiple target site is carried out to plasma DNA
CN112967753B (en) Pathogenic microorganism detection system and method based on nanopore sequencing
CN105899680A (en) Nucleic acid probe and method of detecting genomic fragments
EP2828218A1 (en) Methods of lowering the error rate of massively parallel dna sequencing using duplex consensus sequencing
CN111748551B (en) Blocking sequence, capture kit, library hybridization capture method and library construction method
CN111073961A (en) High-throughput detection method for gene rare mutation
CN108517567B (en) Adaptor, primer group, kit and library construction method for cfDNA library construction
WO2021227129A1 (en) Universal high-throughput sequencing adapter and application thereof
CN111154916B (en) Primer group, detection reagent and kit for respiratory tract pathogen multiple RPA detection
EP3555305A1 (en) Method for increasing throughput of single molecule sequencing by concatenating short dna fragments
WO2023221308A1 (en) Liquid-phase hybrid capture method and test kit thereof
WO2021253372A1 (en) High-compatibility pcr-free library building and sequencing method
CN108359723B (en) Method for reducing deep sequencing errors
JP2024504198A (en) Biomarker group and its application for detecting human microsatellite instability
CN108728515A (en) A kind of analysis method of library construction and sequencing data using the detection ctDNA low frequencies mutation of duplex methods
WO2016119448A2 (en) Artificial exogenous reference molecule for comparing types and natural abundance between microorganisms of different species and genera
WO2016109981A1 (en) High-throughput detection method for dna synthesis product
CN111926394B (en) Database building method and detection kit based on metagenomics
CN112301432B (en) Method and kit for constructing whole genome high-throughput sequencing library

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20930290

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20930290

Country of ref document: EP

Kind code of ref document: A1