CN108203847B - Library, reagent and application for second-generation sequencing quality evaluation - Google Patents

Library, reagent and application for second-generation sequencing quality evaluation Download PDF

Info

Publication number
CN108203847B
CN108203847B CN201711330411.7A CN201711330411A CN108203847B CN 108203847 B CN108203847 B CN 108203847B CN 201711330411 A CN201711330411 A CN 201711330411A CN 108203847 B CN108203847 B CN 108203847B
Authority
CN
China
Prior art keywords
sequencing
library
stranded dna
sequence
seq
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711330411.7A
Other languages
Chinese (zh)
Other versions
CN108203847A (en
Inventor
廖莎
闫东东
章文蔚
徐崇钧
陈奥
陈莹
赵杰
许军强
傅德丰
何琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MGI Tech Co Ltd
Original Assignee
MGI Tech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MGI Tech Co Ltd filed Critical MGI Tech Co Ltd
Publication of CN108203847A publication Critical patent/CN108203847A/en
Application granted granted Critical
Publication of CN108203847B publication Critical patent/CN108203847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B40/00Libraries per se, e.g. arrays, mixtures
    • C40B40/04Libraries containing only organic compounds
    • C40B40/06Libraries containing nucleotides or polynucleotides, or derivatives thereof
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/63Introduction of foreign genetic material using vectors; Vectors; Use of hosts therefor; Regulation of expression
    • C12N15/70Vectors or expression systems specially adapted for E. coli
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Microbiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Physics & Mathematics (AREA)
  • Plant Pathology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Chemical & Material Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application discloses a library, a reagent and application for second-generation sequencing quality evaluation. The library is a single-stranded DNA library with known sequences with different base characteristics, and a linker sequence and an index sequence are connected in the library; the single-stranded DNA library includes AT least one of a high AT content single-stranded DNA, a high GC content single-stranded DNA, a poly-structure single-stranded DNA, and a hairpin structure single-stranded DNA. The library of the application adopts known sequence sequencing with different base characteristics, can evaluate the influence and deviation of the different base characteristics on sequencing, realizes sequencing quality evaluation, corrects the deviation and realizes sequencing optimization. According to the method for improving the nucleic acid sequencing accuracy, sequencing deviations of different base characteristics are obtained by comparing a sequencing result with a known sequence, a sequencing software algorithm is guided to improve, and the sequencing accuracy is improved; the method can effectively reduce sequencing deviation, and provides a simple and effective method for improving sequencing accuracy.

Description

Library, reagent and application for second-generation sequencing quality evaluation
Technical Field
The application relates to the field of nucleic acid sequencing quality evaluation, in particular to a library, a reagent and application for second-generation sequencing quality evaluation.
Background
High throughput sequencing technology is a very important technology, and plays a crucial role in biological research and clinical application, especially in precise medicine. With the increasing importance of the second-generation sequencing in precise medicine, the corresponding requirements for the sequencing accuracy are gradually increased. Currently, the mainstream platforms of the second generation sequencing, such as illumina and proton, can achieve 99.9% of accuracy, but the sequencing accuracy rate may be reduced due to the lengthening of the sequencing read length, the complexity of the base content and the like. In order to better satisfy the role of the second generation sequencing in precise medicine, it is necessary to continuously improve the sequencing technology.
The basic process of the current second-generation sequencing comprises the steps of constructing a sequencing library, amplifying library signals, converting base signals into optical signals which can be identified by a sequencer by virtue of sequencing enzyme, and finally reducing the optical signals into base information by virtue of computer software.
In the above basic sequencing process, sequencing errors are easily introduced in several aspects, which results in a decrease in sequencing accuracy: (1) in the library construction process, base mutation or deletion can be caused by fragment interruption, and wrong base pairing can be introduced by PCR amplification; (2) signal amplification of libraries is also typically performed by PCR, and likewise, enzyme fidelity problems can introduce sequencing errors; (3) in the process of converting a base signal into an electric signal, because dNTP is generally modified dNTP, the sequencing enzyme is also required to be correspondingly modified for matching with the modified dNTP, and the fidelity of the sequencing enzyme is also influenced to a certain extent, so that the sequencing accuracy is reduced; (4) finally, in the process of converting the optical signal into the base information by the data analysis software, signal processing errors can also be caused by factors such as fluorescence background, impurities, weak signals and the like.
Generally, in order to verify the accuracy of the second-generation sequencing, a first-generation sequencing sanger sequencing method is selected for verification. However, this method is cumbersome and not suitable for use in the development of sequencing technology to improve the error rate caused in each link of the sequencing process.
Disclosure of Invention
The application aims to provide a novel library, a reagent and application for evaluating the quality of next-generation sequencing.
One aspect of the present application discloses a library for quality evaluation of next generation sequencing, which is a single stranded DNA library of known sequences with different base characteristics, and to which a linker sequence and an index sequence are ligated; wherein the single-stranded DNA library having known sequences with different base characteristics comprises AT least one of a high AT content single-stranded DNA, a high GC content single-stranded DNA, a poly-structure single-stranded DNA, and a hairpin structure single-stranded DNA.
It should be noted that, during the actual sequencing process of unknown sequences, there may be various base characteristics affecting the sequencing accuracy, such as high AT content, high GC content, poly structure and hairpin structure, etc. the present application creatively adopts artificial synthesis method to synthesize single-stranded DNA library with known sequences with the above different base characteristics; therefore, by comparing the sequencing result with the known sequence, the sequencing deviation of the adopted sequencing platform can be known, and the quality of the second-generation sequencing can be evaluated. By means of the sequencing deviation, the sequencing deviation can be further corrected in a targeted mode, and therefore sequencing accuracy is improved.
It is understood that the library for second-generation sequencing quality assessment of the present application can perform, in addition to the second-generation sequencing quality assessment, as mentioned above, further correction and optimization of the second-generation sequencing to improve the sequencing accuracy or the sequencing quality.
It should be noted that, for convenience of use, and further to reduce the library construction process and reduce the base errors or errors introduced by the library construction process, it is preferable that the linker sequence and the index sequence are ligated in advance in the library, that is, the linker sequence and the index sequence are artificially synthesized directly together when synthesizing the sequence of the library; this avoids the reaction step of adding additional linker and index sequences to the library. The specific sequences of the linker sequence and the index sequence can be referred to an existing sequencing platform, and are not limited herein.
Preferably, the library also has universal primer binding sequences at both ends.
It should be noted that the purpose of the universal primer binding sequence is to allow all libraries of different sequences to be amplified using the same pair of primers, for example, from the six libraries of the present application, using the same universal primer binding sequence, only one pair of primers is required to amplify the six libraries, and one pair of primers is not required for each library.
Preferably, the library of the present application consists of at least one of the sequence shown by SEQ ID NO.7, the sequence shown by SEQ ID NO.8, the sequence shown by SEQ ID NO.9, the sequence shown by SEQ ID NO.10, the sequence shown by SEQ ID NO.11 and the sequence shown by SEQ ID NO. 12.
It should be noted that the libraries of the sequences shown in SEQ ID nos. 7 to 12 are only six libraries that can be verified to be effective in evaluating and optimizing the second-generation sequencing quality in one implementation manner of the present application; one skilled in the art can also artificially synthesize more libraries for quality evaluation or optimization of second generation sequencing based on the present application, according to the guidance of the present application.
In yet another aspect of the present application, a cloning vector is disclosed, the cloning vector comprising a plasmid and an insert, wherein the insert comprises a library of the present application.
Preferably, the plasmid is pMD18-T or pMD 19-T.
In a preferred embodiment of the present invention, the library sequence is obtained infinitely by inserting the synthetic library sequence into a plasmid, and then replicating the library once.
In another aspect, the present invention discloses an engineered bacterium comprising a recipient bacterium and the cloning vector of the present invention introduced and stored in the recipient bacterium.
Preferably, the recipient bacterium employed herein is E.coli.
It should be noted that, after the library is cloned into a plasmid, the library can be infinitely used only by synthesizing a single-stranded DNA library once, and the sequence synthesis cost is reduced without synthesizing again. In the subsequent use, the required library can be obtained only by culturing engineering bacteria and extracting plasmids. And the sequence of the library already comprises a sequencing joint adopted by a corresponding sequencing platform, and sequencing can be carried out through simple library construction. The whole process is simple and convenient, and the stability is high.
In yet another aspect, the present application discloses a reagent for quality evaluation of second generation sequencing, the reagent comprising the library of the present application, the cloning vector of the present application, or the engineered bacterium of the present application.
The library, the cloning vector and the engineering bacteria can be used for evaluating the quality of second-generation sequencing, or can be used for correcting and optimizing second-generation sequencing so as to improve the sequencing accuracy or the sequencing quality; therefore, any one of them can be prepared into a kit for convenient use.
Preferably, the reagent of the present application further comprises a universal primer, wherein the upstream primer of the universal primer is a sequence shown in SEQ ID No.13, and the downstream primer is a sequence shown in SEQ ID No. 14.
It should be noted that the universal primers are designed for the universal primer binding sequences at both ends of the library, and the library or cloning vector can be amplified to obtain the library sequences. For ease of use, the universal primers are included in the kits of the present application as a separate package.
It should also be noted that for cloning vectors, such as pMD18-T or pMD19-T, which have plasmid amplification primers themselves, or can be designed for plasmids to amplify different inserts simultaneously, there is no need to design universal primer binding sequences at both ends of the library, and the plasmid amplification primers can be used directly for library amplification or sequencing, and there is no need for separate universal primers for the sequence shown in SEQ ID NO.13 and the sequence shown in SEQ ID NO. 14. The specific manner in which this is done is not limited herein.
More preferably, the reagent of the present application further comprises a splint oligo having a sequence shown in SEQ ID NO. 15.
It should be noted that the splint oligo functions to circularize the library DNA, and in one implementation of the present application, sequencing is performed using DNA nanosphere technology, thus circularizing the library is required. It is understood that the splint oligo may be omitted if the DNA nanoball technology is not used, and is not particularly limited herein.
The application also discloses applications of the library, the cloning vector, the engineering bacteria or the reagent in the evaluation of the relation between the basic group and the sequencing quality, the evaluation of the preference and the accuracy of the basic group of the amplification enzyme, the evaluation of the accuracy of the sequencing enzyme, the extraction evaluation or improvement of the basic group signal, the detection of the accuracy of the second generation sequencing, the detection of the error rate of each link from the library construction to the sequencing or the optimization of each link from the library construction to the sequencing.
The library of the application, based on the cloning vector, the engineering bacteria and the reagent of the library of the application, can be used for performing quality evaluation on second-generation sequencing; the principle is to compare and analyze the deviation between the sequencing result and the known library sequence, and the deviation can be used for evaluating the sequencing quality, evaluating the accuracy of the amplification enzyme and the sequencing enzyme or carrying out optimization based on the deviation. It is understood that, based on the above principle, the library, cloning vector, engineering bacteria, reagent, etc. of the present application can evaluate, detect, and optimize each step of the second generation sequencing process, which is not limited herein.
The application also discloses a method for improving the accuracy of nucleic acid sequencing, which comprises the steps of sequencing by adopting a single-stranded DNA library of a known sequence with different base characteristics, comparing a sequencing result with the known sequence, carrying out statistical analysis on sequencing deviation existing in different base characteristics, and correcting a sequencing software algorithm according to the sequencing deviation, so that the accuracy of nucleic acid sequencing is improved; the single-stranded DNA library having known sequences with different base characteristics includes AT least one of a high AT content single-stranded DNA, a high GC content single-stranded DNA, a poly-structure single-stranded DNA, and a hairpin structure single-stranded DNA.
Preferably, the poly-structure single-stranded DNA includes at least one of poly a-structure single-stranded DNA, poly T-structure single-stranded DNA, poly G-structure single-stranded DNA, and poly C-structure single-stranded DNA.
Preferably, a single-stranded DNA library is the library of the present application.
It should be noted that the method for improving the accuracy of nucleic acid sequencing is actually based on the library of the present application, and according to the principle of the present application, the quality evaluation is performed on the second-generation sequencing, so as to optimize and improve the accuracy of sequencing. Based on the same principle, on the basis of the method for improving the nucleic acid sequencing accuracy, the method for evaluating the nucleic acid sequencing quality, the method for evaluating the relation between the base and the sequencing quality, the method for evaluating the base preference and the accuracy of the amplification enzyme, the method for evaluating the accuracy of the sequencing enzyme, the method for extracting, evaluating or improving the base signal, the method for detecting the accuracy of the second-generation sequencing, the method for detecting the error rate of each link from library construction to sequencing, the method for optimizing the scheme of each link from library construction to sequencing and the like can be provided, and the method is not particularly limited.
It should be noted that the method of the present application can improve the accuracy of nucleic acid sequencing, and likewise, the method of the present application can also be used to evaluate the base bias and accuracy of the amplification enzyme, for example, by comparing the sequencing results of the single-stranded DNA library before and after amplification with the amplification enzyme, the influence of the amplification enzyme on the sequencing bias can be analyzed, so as to achieve the purpose of evaluating the accuracy of the amplification enzyme, and by analyzing the specific type of the sequencing bias, the base bias of the amplification enzyme can be known. The principle of the accuracy evaluation of the sequencing enzyme is similar. In addition, the method can improve the accuracy of nucleic acid sequencing, and the key point is that after the sequencing result is compared and analyzed with the known sequence, a sequencing software algorithm is corrected, wherein the sequencing software algorithm comprises the processing of base signal extraction, so that the method can be applied to improving or evaluating the base signal extraction.
Due to the adoption of the technical scheme, the beneficial effects of the application are as follows:
the library is designed with various base characteristics with controllable structures, sequencing is carried out by adopting sequences with known base characteristics, the influence and deviation of different base characteristics on the second-generation sequencing can be evaluated, the quality evaluation of the second-generation sequencing is realized, the deviations are corrected in a targeted manner, and then the optimization of the second-generation sequencing is realized. The method for improving the nucleic acid sequencing accuracy creatively adopts the base characteristics of the library, and obtains the sequencing deviation of different base characteristics by comparing the sequencing result with the known library sequence, thereby guiding the improvement of the sequencing software algorithm and further achieving the purpose of improving the sequencing accuracy; by the method, sequencing deviation can be effectively reduced, and a simple and effective method is provided for improving sequencing accuracy.
Drawings
FIG. 1 is a diagram showing the sequencing results of the first 50bp of a sequence library shown in SEQ ID NO.7 in the example of the present application;
FIG. 2 is a diagram showing the sequencing results of the first 50bp of a sequence library shown in SEQ ID NO.8 in the example of the present application;
FIG. 3 is a diagram showing the sequencing results of the first 50bp of a sequence library shown in SEQ ID NO.9 in the example of the present application;
FIG. 4 is a diagram showing the sequencing results of the first 50bp of a sequence library shown in SEQ ID NO.10 in the example of the present application;
FIG. 5 is a diagram showing the sequencing results of the first 50bp of a sequence library shown in SEQ ID NO.11 in the example of the present application;
FIG. 6 is a diagram showing the sequencing results of the first 50bp of the sequence library shown in SEQ ID NO.12 in the example of the present application;
FIG. 7 is a Q30 profile of a high GC library in an example of the present application.
Detailed Description
Through a large number of experiments and researches, the base content complexity of various sequencing objects is an important factor influencing the quality of next-generation sequencing in the actual sequencing process. For example, for a sequence with uniform AT and GC distribution and few poly structures and hairpin structures, both illuma and proton can reach 99.9 percent of accuracy; however, for sequences with high AT content, high GC content, or more poly structures and hairpin structures, the sequencing accuracy is greatly reduced, and even the use requirement of accurate sequencing in precise medical treatment cannot be effectively met.
For this reason, the present application creatively proposes and develops a single-stranded DNA library having known sequences with different base characteristics, wherein the sequences include various base characteristics specially designed, including high AT content, high GC content, poly structure, hairpin structure, etc.; in one implementation of the present application, there are six single-stranded DNAs of the sequences shown in SEQ ID No.7 to SEQ ID No. 12; by adopting the library designed by the application, the known sequence with specific base characteristics is subjected to second-generation sequencing, and the deviation between the sequencing result and the known library sequence is analyzed and compared, so that the accuracy or the sequencing quality of the second-generation sequencing under various base characteristics is analyzed, the deviation obtained by analysis is corrected, and then the second-generation sequencing is optimized.
Before constructing the library of the present application, a set of standard nucleic acids is designed in advance, and these nucleic acids contain various base characteristics required by the library of the present application, and then a part or all of the sequences of the set of standard nucleic acids are selected for library construction. In one implementation of the present application, the standard nucleic acid consists of at least one of six single-stranded DNAs; the sequences of the six single-stranded DNAs are sequentially a sequence shown by SEQ ID NO.1, a sequence shown by SEQ ID NO.2, a sequence shown by SEQ ID NO.3, a sequence shown by SEQ ID NO.4, a sequence shown by SEQ ID NO.5 and a sequence shown by SEQ ID NO. 6. The libraries of sequences shown in SEQ ID NO.7 to 12 in the present application correspond in sequence to the standard nucleic acids of the sequences shown in SEQ ID NO.1 to 6 in the present application.
The present application is described in further detail below with reference to specific examples. The following examples are intended to be illustrative of the present application only and should not be construed as limiting the present application.
Examples
In this example, a set of standard nucleic acid sequences respectively containing base features such as high AT content, high GC content, poly structure and hairpin structure is first designed, then a library is designed for the standard nucleic acid sequences, and BGISEQ linker sequence, index sequence and universal primer binding sequence are added to the library sequences. Artificially synthesizing a designed library sequence, inserting the artificially synthesized library sequence into a pMD19-T plasmid, and introducing the plasmid into Escherichia coli to prepare the engineering bacteria. And extracting plasmids in the engineering bacteria to obtain a library sequence for next generation sequencing and evaluating the sequencing quality. The details are as follows:
design of first, Standard nucleic acids
In this example, six standard nucleic acid sequences were designed based on the base features commonly described in actual sequencing, such as high AT content, high GC content, poly structure, hairpin structure, etc., and different index sequences were used for each standard nucleic acid sequence. Details are shown in table 1.
TABLE 1 sequences of standard nucleic acids
Figure BDA0001506520660000061
Figure BDA0001506520660000071
The six standard nucleic acid sequences of this example include two high GC sequences, two high AT sequences, and two random sequences, both of which are common sequences with similar ACGT content, for comparative analysis. Wherein each standard nucleic acid sequence corresponds to an index sequence, i.e., a barcode sequence, for distinguishing different sequences. The two high GC sequences and the two high AT sequences comprise a hairpin structure and a poly structure.
Second, library sequence design and construction
Most of the six standard nucleic acid sequences designed in this example were selected to construct a library, and a linker sequence suitable for BGISEQ was inserted into the library, and the same universal primer binding sequence was ligated to both ends of each of the six standard nucleic acid sequences. The library sequences designed for the six single-stranded DNA standard nucleic acid sequences of the sequences shown in SEQ ID NO.1 to SEQ ID NO.6 are the sequences shown in SEQ ID NO.7 to SEQ ID NO.12 in sequence.
SEQ ID NO.7:
5’-GATATCTGCAGGCATAGAATGAATATTATTGAATCAATAATTAAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAAACTAGTACGTCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTTACAACTACAGATAATGGGCTGGATACATGGAATGATTATAGATATATTAAGGAATAATGTTAATTAATGCCTAAATTAATTAATCTAAGGGGGTTAATACTTCAGCCTGTGATATC-3’;
SEQ ID NO.8:
5’-GATATCTGCAGGCATGAATAATAATGGAATAGCAATAATTAAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAACGATCAGTACCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTATATAATGTAATACATAATATTAATATATTAATTATTGTATGATTGTTATCTATTACAGTCTAGTACTGACCCGTAGACATATATGCCCCCGATTAATTACTTATCAGCCTGTGATATC-3’;
SEQ ID NO.9:
5’-GATATCTGCAGGCATCGGCCGCGGCGTCCAGTGCGCGGCGCTAGAGCCGGCAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAACGCTATGTACCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTCCGCCGCGGTCGCTTGTCCGGCCGCCGGTCCGGCGCCGGCGGCGCAAAGTGCCAGGCCGAGCCGGCGAACCAGCGGTCCGAAAAACACGGACACTCAGCCTGTGATATC-3’;
SEQ ID NO.10:
5’-GATATCTGCAGGCATCACCGCCGAGGCCGCGGCGGAGACCGCCGGCGCAGGAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAACAGAGTGTACCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTCAAACTACCGGCGCGGCGCTCCTCCGGCCGTCCGCCGCCGACCGGCGGCGGCGTTCCGGTGTGGCACTCCAGGTGGCCGGTTCTCTGCCAAGCGTCAGCCTGTGATATC-3’;
SEQ ID NO.11:
5’-GATATCTGCAGGCATGAAGAACAACCCCGCACGACGCCTACCAACCAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAACTGTATCGTACAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTGCTGTTCGCGGCCGATGTTCGTATAAGATATAAGTTTGGGTATATTCCAGTTTATCGATCGTATCGAAATGTATGAGTTTATACAGGTCCTACTTCAACTCAGCCTGTGATATC-3’;
SEQ ID NO.12:
5’-GATATCTGCAGGCATACTAGACCAGTTCATTATTATAGTGCTAGCCAAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAAACATCAACGTCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTGACGGATTCCCTCGCTTTCTATTGGCTGACAGTACAAGTAACATAGGTTGGGTCGGTTAACCCTGCCGTCACAAGTGGAACGATGTTAATAGTTGCGGTCAGCCTGTGATATC-3’;
In the above six library sequences, "GATATCTGCAGGCAT" is a universal primer binding sequence at the 5 'end, and "TCAGCCTGTGATATC" is a universal primer binding sequence at the 3' end, and universal primers are designed for these two sequences. "AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAANNNNNNNNNNCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTT" is a linker sequence comprising an index sequence, where "NNNNNNNN" is a 10bp index sequence. The index sequence of the sequence library shown in SEQ ID NO.7 is "ACTAGTACGT", the index sequence of the sequence library shown in SEQ ID NO.8 is "CGATCAGTAC", the index sequence of the sequence library shown in SEQ ID NO.9 is "CGCTATGTAC", the index sequence of the sequence library shown in SEQ ID NO.10 is "CAGAGTGTAC", the index sequence of the sequence library shown in SEQ ID NO.11 is "CTGTATCGTA", and the index sequence of the sequence library shown in SEQ ID NO.12 is "ACATCAACGT".
In the universal primer, the upstream primer is a sequence shown by SEQ ID NO.13, and the downstream primer is a sequence shown by SEQ ID NO. 14;
SEQ ID NO.13:5’-GATATCTGCAGGCAT-3’;
SEQ ID NO.14:5’-GATATCACAGGCTGA-3’。
in the method, the DNA nanosphere technology is adopted for sequencing, and library DNA needs to be cyclized, so that the splint oligo is designed and has a sequence shown as SEQ ID NO. 15;
SEQ ID NO.15:5’-ATGCCTGCAGATATCGATATCACAGGCTGA-3’。
the libraries of sequences shown in SEQ ID NO.7 to SEQ ID NO.12 of this example, as well as the universal primers, splint oligo, were all synthesized by Shanghai.
Third, cloning vector and engineering bacterium construction
The synthesized library sequences were cloned, and the cloning vector was introduced into E.coli. The cloning vector and the construction of the engineering bacteria are synthesized by Nanjing Kinsley.
Fourth, library acquisition
Culturing the preserved engineering bacteria in LB culture medium at 37 deg.C overnight, and culturing with Thermo Fisher
Figure BDA0001506520660000091
And extracting the plasmid according to the instruction mode of the kit. And the extracted plasmid is subjected to PCR amplification by adopting a universal primer, and a PCR amplification product can be directly used for sequencing after cyclization.
1. Plasmid extraction
The plasmid extraction of this example employed
Figure BDA0001506520660000092
Plasmid extraction kit, extraction procedure reference
Figure BDA0001506520660000093
The description is not repeated herein.
PCR amplification
PCR amplification system 100. mu.L, comprising: 20. mu.L of 5 XHi-Fi enzyme reaction solution, 5. mu.L of dNTPs mixed solution with each component being 10mM, 1. mu.L of Hi-Fi enzyme with 1U/. mu.L, 6. mu.L of upstream primer with 20. mu.M, 6. mu.L of downstream primer with 20. mu.M, and 1. mu. L, ddH of extracted plasmid template2O61. mu.L, a total of 100. mu.L.
The PCR amplification conditions were 98 ℃ for 3min, followed by 33 cycles: 20s at 98 ℃, 15s at 60 ℃ and 30s at 72 ℃; after the circulation was completed, Hold was performed at 72 ℃ for 5min and 4 ℃.
Circularization of PCR amplification product
In this example, magnetic beads are used to purify PCR amplification products, and then purified PCR amplification products are circularized according to BGIseq500SE50 circularization library construction kit and procedure. The specific steps for circularizing the PCR amplification product are described in the kit instructions, and will not be described herein.
Fifth, library sequencing detection and sequencing accuracy detection
To verify that the synthesized library with known sequence can satisfy the sequencing of BGISEQ platform, six libraries of the sequences shown in SEQ ID No.7 to SEQ ID No.12 obtained in the example are subjected to sequencing verification of SE50+10 according to BGISEQ500SE50 kit.
The cyclization products of the six libraries are taken and subjected to DNB preparation according to the operation flow of BGISEQ 500. Then 15. mu.L of each prepared DNB is taken and mixed into a DNB system of 90. mu.L, the chip is manufactured according to the standard flow, and the SE50+10 sequencing mode is selected for sequencing.
Sequencing results show that the sequencing results of six libraries of sequences shown by SEQ ID NO.7 to SEQ ID NO.12 are distinguished according to the index sequence, the first 50bp results of the sequencing of the six library sequences are the same as the actual standard nucleic acid sequence, the first 50bp results of the sequencing of the six library sequences are shown in figures 1 to 6, and the figures 1 to 6 sequentially correspond to the sequencing results of the six libraries of the sequences shown by SEQ ID NO.7 to SEQ ID NO. 12; the library construction is successful, and the algorithm basecall is accurate.
Sixth, evaluation of sequencing quality
In order to compare the relationship between sequencing quality and bases, sequencing of SE100 was performed on a library of sequences represented by SEQ ID NO.7 with a high AT content (referred to as high AT library for short) and a library of sequences represented by SEQ ID NO.9 with a high GC content (referred to as high GC library for short) using a sequencing kit of BGISEQ500SE100+ 10.
Preparation and chip fabrication of DNB were the same as "five, library sequencing assay and sequencing accuracy assay". Only a library of the sequence shown in SEQ ID NO.7 and a library of the sequence shown in SEQ ID NO.9 were prepared and subjected to on-machine sequencing in SE100 in this experiment.
The sequencing quality of the two libraries was analyzed and compared, as shown in Table 2, the library of the sequence shown in SEQ ID NO.9 with high GC content had a Q30 lower than that of the library of the sequence shown in SEQ ID NO.7 with high AT content and a higher error rate than that of the library with high AT content. For this reason, targeted optimization can be carried out for libraries rich in GC content in a subsequent improvement of the sequencing technology.
TABLE 2 comparison of sequencing quality of two libraries
Name (R) PredQual GC content% Q10% Q10% Q10% EsErr%
High AT library 33 27.05% 99.16 98.02 91.44 0.23
High GC libraries 33 75.47% 98.16 94.18 85.05 0.68
In addition, further analysis of the relationship between bases and quality values, as shown in FIG. 7, FIG. 7 is a Q30 distribution diagram of a high GC library, and it can be clearly seen that at the 60bp, 68bp, 81bp, 91bp, 97bp, the Q30 diagram has a significant downward trend, and all the positions corresponding to the sequence have a common characteristic that when the base G is followed by A, the sequencing quality of A is deteriorated, which provides a direction for the optimization of the subsequent sequencing technology.
Therefore, the standard nucleic acid and the library based on the standard nucleic acid can evaluate the base preference and accuracy of sequencing in the second-generation sequencing, detect the accuracy of the second-generation sequencing and evaluate the quality of the second-generation sequencing; and the sequencing result and the analysis of the base characteristics are optimized in a targeted manner, so that the accuracy of nucleic acid sequencing is improved.
The foregoing is a more detailed description of the present application in connection with specific embodiments thereof, and it is not intended that the present application be limited to the specific embodiments thereof. It will be apparent to those skilled in the art from this disclosure that many more simple derivations or substitutions can be made without departing from the spirit of the disclosure.
SEQUENCE LISTING
<110> Shenzhen Huashengshengsciences institute
<120> library, reagent and application for second-generation sequencing quality evaluation
<130> 17I25566-A23542
<160> 15
<170> PatentIn version 3.3
<210> 1
<211> 150
<212> DNA
<213> Artificial sequence
<400> 1
tacaactaca gataatgggc tggatacatg gaatgattat agatatatta aggaataatg 60
ttaattaatg cctaaattaa ttaatctaag ggggttaata ctatgtgtta attaatctta 120
ttagaatgaa tattattgaa tcaataatta 150
<210> 2
<211> 150
<212> DNA
<213> Artificial sequence
<400> 2
atataatgta atacataata ttaatatatt aattattgta tgattgatat ctattacagt 60
ctagtactga cccgtagaca tatatgcccc cgattaatta cttaggctta ttaataatat 120
ataggaataa taatggaata gcaataatta 150
<210> 3
<211> 150
<212> DNA
<213> Artificial sequence
<400> 3
ccgccgcggt cgcttgtccg gccgccggtc cggcgccggc ggcgcaaagt gccaggccga 60
gccggcgaac cagcggtccg aaaaacacgg acacggtaac ctcaccacga tggccggccg 120
cggcgtccag tgcgcggcgc tagagccggc 150
<210> 4
<211> 150
<212> DNA
<213> Artificial sequence
<400> 4
caaactaccg gcgcggcgct cctccggccg tccgccgccg accggcggcg gcgttccggt 60
gtggcactcc aggtggccgg ttctctgcca agcggcaggc gaaaaatcga cggccaccgc 120
cgaggccgcg gcggagaccg ccggcgcagg 150
<210> 5
<211> 150
<212> DNA
<213> Artificial sequence
<400> 5
gctgttcgcg gccgatgttc gtataagata taagtttggg tatattccag tttatcgatc 60
gtatcgaaat gtatgagttt atacaggtcc tacttcaaca agcggcactt tactaccgtg 120
aagaacaacc ccgcacgacg cctaccaacc 150
<210> 6
<211> 150
<212> DNA
<213> Artificial sequence
<400> 6
gacggattcc ctcgctttct attggctgac agtacaagta acataggttg ggtcggttaa 60
ccctgccgtc acaagtggaa cgatgttaat agttgcggaa ccctatgttc ggcggaatac 120
tagaccagtt cattattata gtgctagcca 150
<210> 7
<211> 244
<212> DNA
<213> Artificial sequence
<400> 7
gatatctgca ggcatagaat gaatattatt gaatcaataa ttaaagtcgg aggccaagcg 60
gtcttaggaa gacaaactag tacgtcaact ccttggctca cagaacgaca tggctacgat 120
ccgactttac aactacagat aatgggctgg atacatggaa tgattataga tatattaagg 180
aataatgtta attaatgcct aaattaatta atctaagggg gttaatactt cagcctgtga 240
tatc 244
<210> 8
<211> 244
<212> DNA
<213> Artificial sequence
<400> 8
gatatctgca ggcatgaata ataatggaat agcaataatt aaagtcggag gccaagcggt 60
cttaggaaga caacgatcag taccaactcc ttggctcaca gaacgacatg gctacgatcc 120
gacttatata atgtaataca taatattaat atattaatta ttgtatgatt gttatctatt 180
acagtctagt actgacccgt agacatatat gcccccgatt aattacttat cagcctgtga 240
tatc 244
<210> 9
<211> 244
<212> DNA
<213> Artificial sequence
<400> 9
gatatctgca ggcatcggcc gcggcgtcca gtgcgcggcg ctagagccgg caagtcggag 60
gccaagcggt cttaggaaga caacgctatg taccaactcc ttggctcaca gaacgacatg 120
gctacgatcc gacttccgcc gcggtcgctt gtccggccgc cggtccggcg ccggcggcgc 180
aaagtgccag gccgagccgg cgaaccagcg gtccgaaaaa cacggacact cagcctgtga 240
tatc 244
<210> 10
<211> 244
<212> DNA
<213> Artificial sequence
<400> 10
gatatctgca ggcatcaccg ccgaggccgc ggcggagacc gccggcgcag gaagtcggag 60
gccaagcggt cttaggaaga caacagagtg taccaactcc ttggctcaca gaacgacatg 120
gctacgatcc gacttcaaac taccggcgcg gcgctcctcc ggccgtccgc cgccgaccgg 180
cggcggcgtt ccggtgtggc actccaggtg gccggttctc tgccaagcgt cagcctgtga 240
tatc 244
<210> 11
<211> 244
<212> DNA
<213> Artificial sequence
<400> 11
gatatctgca ggcatgaaga acaaccccgc acgacgccta ccaaccaagt cggaggccaa 60
gcggtcttag gaagacaact gtatcgtaca actccttggc tcacagaacg acatggctac 120
gatccgactt gctgttcgcg gccgatgttc gtataagata taagtttggg tatattccag 180
tttatcgatc gtatcgaaat gtatgagttt atacaggtcc tacttcaact cagcctgtga 240
tatc 244
<210> 12
<211> 244
<212> DNA
<213> Artificial sequence
<400> 12
gatatctgca ggcatactag accagttcat tattatagtg ctagccaaag tcggaggcca 60
agcggtctta ggaagacaaa catcaacgtc aactccttgg ctcacagaac gacatggcta 120
cgatccgact tgacggattc cctcgctttc tattggctga cagtacaagt aacataggtt 180
gggtcggtta accctgccgt cacaagtgga acgatgttaa tagttgcggt cagcctgtga 240
tatc 244
<210> 13
<211> 15
<212> DNA
<213> Artificial sequence
<400> 13
gatatctgca ggcat 15
<210> 14
<211> 15
<212> DNA
<213> Artificial sequence
<400> 14
gatatcacag gctga 15
<210> 15
<211> 30
<212> DNA
<213> Artificial sequence
<400> 15
atgcctgcag atatcgatat cacaggctga 30

Claims (12)

1. A library for quality assessment of next generation sequencing, characterized by: the library is a single-stranded DNA library with known sequences with different base characteristics, and an adapter sequence and an index sequence are connected in the library; the single-stranded DNA library with known sequences with different base characteristics comprises high AT content single-stranded DNA, high GC content single-stranded DNA, poly structure single-stranded DNA and hairpin structure single-stranded DNA; the two ends of the library are provided with universal primer binding sequences;
the high AT content single-stranded DNA refers to single-stranded DNA with the AT content of more than or equal to 72.95 percent;
the high GC content single-stranded DNA refers to a single-stranded DNA having a GC content of 75.47% or more.
2. The library of claim 1, wherein: the single-stranded DNA library consists of a sequence shown by SEQ ID NO.7, a sequence shown by SEQ ID NO.8, a sequence shown by SEQ ID NO.9, a sequence shown by SEQ ID NO.10, a sequence shown by SEQ ID NO.11 and a sequence shown by SEQ ID NO. 12.
3. A cloning vector comprising a plasmid and an insert, characterized in that: the insert comprises the library of claim 1 or 2.
4. The cloning vector of claim 3, wherein: the plasmid is pMD18-T or pMD 19-T.
5. An engineered bacterium comprising a recipient bacterium and the cloning vector of claim 3 or 4 introduced and stored in the recipient bacterium.
6. The engineered bacterium of claim 5, wherein: the recipient bacterium is escherichia coli.
7. A reagent for quality assessment of next generation sequencing, characterized by: the reagent comprises the library of claim 1 or 2, the cloning vector of claim 3 or 4, or the engineered bacterium of claim 5 or 6.
8. The reagent according to claim 7, characterized in that: the primer sequence of the primer sequence is shown as SEQ ID NO.13, and the primer sequence of the primer sequence is shown as SEQ ID NO. 14.
9. The reagent according to claim 7 or 8, characterized in that: also comprises a splint oligo which is shown as SEQ ID NO. 15.
10. Use of the library of claim 1 or 2, the cloning vector of claim 3 or 4, the engineered bacterium of claim 5 or 6, or the reagent of any one of claims 7 to 9 for base-to-sequencing quality relationship assessment, amplification enzyme base preference and accuracy assessment, sequencing enzyme accuracy assessment, base signal extraction assessment or improvement, secondary sequencing accuracy detection, or pooling to individual link error rate detection in sequencing.
11. A method of increasing the accuracy of nucleic acid sequencing, comprising: the method comprises the steps of sequencing a single-stranded DNA library of a known sequence with different base characteristics, comparing a sequencing result with the known sequence, statistically analyzing sequencing deviation existing in the different base characteristics, and correcting a sequencing software algorithm according to the sequencing deviation, so that the nucleic acid sequencing accuracy is improved; the single-stranded DNA library with known sequences with different base characteristics comprises high AT content single-stranded DNA, high GC content single-stranded DNA, poly structure single-stranded DNA and hairpin structure single-stranded DNA;
the poly-structure single-stranded DNA comprises at least one of poly A-structure single-stranded DNA, poly T-structure single-stranded DNA, poly G-structure single-stranded DNA and poly C-structure single-stranded DNA;
the high AT content single-stranded DNA refers to single-stranded DNA with the AT content of more than or equal to 72.95 percent;
the high GC content single-stranded DNA refers to a single-stranded DNA having a GC content of 75.47% or more.
12. The method of claim 11, wherein: the single-stranded DNA library is the library of claim 1 or 2.
CN201711330411.7A 2016-12-16 2017-12-13 Library, reagent and application for second-generation sequencing quality evaluation Active CN108203847B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611170287 2016-12-16
CN2016111702878 2016-12-16

Publications (2)

Publication Number Publication Date
CN108203847A CN108203847A (en) 2018-06-26
CN108203847B true CN108203847B (en) 2022-01-04

Family

ID=62604671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711330411.7A Active CN108203847B (en) 2016-12-16 2017-12-13 Library, reagent and application for second-generation sequencing quality evaluation

Country Status (2)

Country Link
CN (1) CN108203847B (en)
HK (1) HK1250759A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109629008B (en) * 2018-12-29 2021-12-03 艾吉泰康生物科技(北京)有限公司 Quality control method for second-generation sequencing library-building reagent components and template combination used in quality control method
US11210554B2 (en) 2019-03-21 2021-12-28 Illumina, Inc. Artificial intelligence-based generation of sequencing metadata
US11200446B1 (en) 2020-08-31 2021-12-14 Element Biosciences, Inc. Single-pass primary analysis
US20230279382A1 (en) * 2022-03-04 2023-09-07 Element Biosciences, Inc. Single-stranded splint strands and methods of use
CN116064758A (en) * 2022-11-17 2023-05-05 纳昂达(南京)生物科技有限公司 Selective inhibitory cyclization helper sequences and uses
CN116103383B (en) * 2023-04-03 2023-06-20 北京百力格生物科技有限公司 Method for identifying false base of NGS linker oligo and library thereof
CN117867086B (en) * 2024-03-12 2024-06-25 北京雅康博生物科技有限公司 Standard substance for quantitative high-throughput sequencing library and preparation method and application thereof
CN117887812B (en) * 2024-03-14 2024-07-09 北京雅康博生物科技有限公司 Library for high-throughput sequencing quality control, and preparation method and application thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014134166A1 (en) * 2013-02-26 2014-09-04 Axiomx, Inc. Methods for the production of libraries for directed evolution
CN104293938A (en) * 2014-09-30 2015-01-21 天津华大基因科技有限公司 Method for constructing sequencing library and application of sequencing library
CN105463585A (en) * 2014-09-12 2016-04-06 清华大学 Method for constructing sequencing library based on single-stranded DNA molecule, and applications thereof
CN105986324A (en) * 2015-02-11 2016-10-05 深圳华大基因研究院 Construction method and application of cyclic small RNA library

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AR080686A1 (en) * 2011-03-16 2012-05-02 Univ Nac De Tucuman Unt POLYPEPTIDE THAT HAS INDUCTIVE ACTIVITY OF DEFENSE AGAINST BIOTIC STRESS IN PLANTS, SEQUENCE OF NUCLEOTID CODING, MICROORGANISM COMPOSITIONS AND METHODS
CA2936564C (en) * 2014-01-07 2022-10-18 Fundacio Privada Institut De Medicina Predictiva I Personalitzada Del Cancer Methods for generating double stranded dna libraries and sequencing methods for the identification of methylated cytosines

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014134166A1 (en) * 2013-02-26 2014-09-04 Axiomx, Inc. Methods for the production of libraries for directed evolution
CN105463585A (en) * 2014-09-12 2016-04-06 清华大学 Method for constructing sequencing library based on single-stranded DNA molecule, and applications thereof
CN104293938A (en) * 2014-09-30 2015-01-21 天津华大基因科技有限公司 Method for constructing sequencing library and application of sequencing library
CN105986324A (en) * 2015-02-11 2016-10-05 深圳华大基因研究院 Construction method and application of cyclic small RNA library

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"DNA测序技术的发展历史与最新进展";解增言等;《生物技术通报》;20101231(第8期);第64-70页 *

Also Published As

Publication number Publication date
CN108203847A (en) 2018-06-26
HK1250759A1 (en) 2019-01-11

Similar Documents

Publication Publication Date Title
CN108203847B (en) Library, reagent and application for second-generation sequencing quality evaluation
CN105121664B (en) Mixture and its it is compositions related in nucleic acid sequencing approach
EP3485033B1 (en) Single end duplex dna sequencing
CN106939344B (en) Linker for next generation sequencing
EP3674419A1 (en) Probe and method applying the same for enriching target region in high-throughput sequencing
CN114540473B (en) Novel nucleic acid sequencing system
CN108103164B (en) Method for detecting copy number variation by using multiple fluorescent competitive PCR
CN113564197B (en) Construction method and application of CRISPR/Cas9 mediated plant polygene editing vector
CN111471754A (en) Universal high-throughput sequencing joint and application thereof
CN116121342A (en) Preparation method of microsatellite instability related gene high-throughput amplicon library, multiplex PCR primer pair and application
US10179934B2 (en) High-throughput detection method for DNA synthesis product
CN113913493B (en) Rapid enrichment method of target gene region
CN114524879A (en) Efficient plant wide-target adenine single base editor and construction and application thereof
CN108504651B (en) Library construction method and reagent for large-sample-size mixed library construction of PCR (polymerase chain reaction) products based on high-throughput sequencing
CN116083423B (en) Probe for target enrichment of nucleic acid
EP3643788A1 (en) Pcr primer pair and application thereof
CN111826421A (en) PCR random primer and method for constructing target sequencing library by using same
CN107937407B (en) Specific human gene fragment, primer and probe for detecting specific human gene fragment and application of specific human gene fragment
TWI771847B (en) Method of amplifying and determining target nucleotide sequence
CN109852668A (en) A kind of simplified gene order-checking library and its banking process
CN111926394B (en) Database building method and detection kit based on metagenomics
EP3643787A1 (en) Pcr primer pair and application thereof
CN115161408A (en) DNA methylation detection of maize genomic target segments
CN114214734A (en) Single-molecule target gene library building method and kit thereof
CN110564745B (en) Lung cancer rare ALK fusion mutant gene and detection primer, kit and detection method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1250759

Country of ref document: HK

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 518083 the comprehensive building of Beishan industrial zone and 11 2 buildings in Yantian District, Shenzhen, Guangdong.

Applicant after: Shenzhen Huada Zhizao Technology Co., Ltd

Address before: 518083 the comprehensive building of Beishan industrial zone and 11 2 buildings in Yantian District, Shenzhen, Guangdong.

Applicant before: MGI TECH Co.,Ltd.

GR01 Patent grant
GR01 Patent grant