CN108203847B

CN108203847B - Library, reagent and application for second-generation sequencing quality evaluation

Info

Publication number: CN108203847B
Application number: CN201711330411.7A
Authority: CN
Inventors: 廖莎; 闫东东; 章文蔚; 徐崇钧; 陈奥; 陈莹; 赵杰; 许军强; 傅德丰; 何琳
Original assignee: MGI Tech Co Ltd
Current assignee: MGI Tech Co Ltd
Priority date: 2016-12-16
Filing date: 2017-12-13
Publication date: 2022-01-04
Anticipated expiration: 2037-12-13
Also published as: CN108203847A; HK1250759A1

Abstract

The application discloses a library, a reagent and application for second-generation sequencing quality evaluation. The library is a single-stranded DNA library with known sequences with different base characteristics, and a linker sequence and an index sequence are connected in the library; the single-stranded DNA library includes AT least one of a high AT content single-stranded DNA, a high GC content single-stranded DNA, a poly-structure single-stranded DNA, and a hairpin structure single-stranded DNA. The library of the application adopts known sequence sequencing with different base characteristics, can evaluate the influence and deviation of the different base characteristics on sequencing, realizes sequencing quality evaluation, corrects the deviation and realizes sequencing optimization. According to the method for improving the nucleic acid sequencing accuracy, sequencing deviations of different base characteristics are obtained by comparing a sequencing result with a known sequence, a sequencing software algorithm is guided to improve, and the sequencing accuracy is improved; the method can effectively reduce sequencing deviation, and provides a simple and effective method for improving sequencing accuracy.

Description

Library, reagent and application for second-generation sequencing quality evaluation

Technical Field

The application relates to the field of nucleic acid sequencing quality evaluation, in particular to a library, a reagent and application for second-generation sequencing quality evaluation.

Background

High throughput sequencing technology is a very important technology, and plays a crucial role in biological research and clinical application, especially in precise medicine. With the increasing importance of the second-generation sequencing in precise medicine, the corresponding requirements for the sequencing accuracy are gradually increased. Currently, the mainstream platforms of the second generation sequencing, such as illumina and proton, can achieve 99.9% of accuracy, but the sequencing accuracy rate may be reduced due to the lengthening of the sequencing read length, the complexity of the base content and the like. In order to better satisfy the role of the second generation sequencing in precise medicine, it is necessary to continuously improve the sequencing technology.

The basic process of the current second-generation sequencing comprises the steps of constructing a sequencing library, amplifying library signals, converting base signals into optical signals which can be identified by a sequencer by virtue of sequencing enzyme, and finally reducing the optical signals into base information by virtue of computer software.

In the above basic sequencing process, sequencing errors are easily introduced in several aspects, which results in a decrease in sequencing accuracy: (1) in the library construction process, base mutation or deletion can be caused by fragment interruption, and wrong base pairing can be introduced by PCR amplification; (2) signal amplification of libraries is also typically performed by PCR, and likewise, enzyme fidelity problems can introduce sequencing errors; (3) in the process of converting a base signal into an electric signal, because dNTP is generally modified dNTP, the sequencing enzyme is also required to be correspondingly modified for matching with the modified dNTP, and the fidelity of the sequencing enzyme is also influenced to a certain extent, so that the sequencing accuracy is reduced; (4) finally, in the process of converting the optical signal into the base information by the data analysis software, signal processing errors can also be caused by factors such as fluorescence background, impurities, weak signals and the like.

Generally, in order to verify the accuracy of the second-generation sequencing, a first-generation sequencing sanger sequencing method is selected for verification. However, this method is cumbersome and not suitable for use in the development of sequencing technology to improve the error rate caused in each link of the sequencing process.

Disclosure of Invention

The application aims to provide a novel library, a reagent and application for evaluating the quality of next-generation sequencing.

One aspect of the present application discloses a library for quality evaluation of next generation sequencing, which is a single stranded DNA library of known sequences with different base characteristics, and to which a linker sequence and an index sequence are ligated; wherein the single-stranded DNA library having known sequences with different base characteristics comprises AT least one of a high AT content single-stranded DNA, a high GC content single-stranded DNA, a poly-structure single-stranded DNA, and a hairpin structure single-stranded DNA.

It should be noted that, during the actual sequencing process of unknown sequences, there may be various base characteristics affecting the sequencing accuracy, such as high AT content, high GC content, poly structure and hairpin structure, etc. the present application creatively adopts artificial synthesis method to synthesize single-stranded DNA library with known sequences with the above different base characteristics; therefore, by comparing the sequencing result with the known sequence, the sequencing deviation of the adopted sequencing platform can be known, and the quality of the second-generation sequencing can be evaluated. By means of the sequencing deviation, the sequencing deviation can be further corrected in a targeted mode, and therefore sequencing accuracy is improved.

It is understood that the library for second-generation sequencing quality assessment of the present application can perform, in addition to the second-generation sequencing quality assessment, as mentioned above, further correction and optimization of the second-generation sequencing to improve the sequencing accuracy or the sequencing quality.

It should be noted that, for convenience of use, and further to reduce the library construction process and reduce the base errors or errors introduced by the library construction process, it is preferable that the linker sequence and the index sequence are ligated in advance in the library, that is, the linker sequence and the index sequence are artificially synthesized directly together when synthesizing the sequence of the library; this avoids the reaction step of adding additional linker and index sequences to the library. The specific sequences of the linker sequence and the index sequence can be referred to an existing sequencing platform, and are not limited herein.

Preferably, the library also has universal primer binding sequences at both ends.

It should be noted that the purpose of the universal primer binding sequence is to allow all libraries of different sequences to be amplified using the same pair of primers, for example, from the six libraries of the present application, using the same universal primer binding sequence, only one pair of primers is required to amplify the six libraries, and one pair of primers is not required for each library.

Preferably, the library of the present application consists of at least one of the sequence shown by SEQ ID NO.7, the sequence shown by SEQ ID NO.8, the sequence shown by SEQ ID NO.9, the sequence shown by SEQ ID NO.10, the sequence shown by SEQ ID NO.11 and the sequence shown by SEQ ID NO. 12.

It should be noted that the libraries of the sequences shown in SEQ ID nos. 7 to 12 are only six libraries that can be verified to be effective in evaluating and optimizing the second-generation sequencing quality in one implementation manner of the present application; one skilled in the art can also artificially synthesize more libraries for quality evaluation or optimization of second generation sequencing based on the present application, according to the guidance of the present application.

In yet another aspect of the present application, a cloning vector is disclosed, the cloning vector comprising a plasmid and an insert, wherein the insert comprises a library of the present application.

Preferably, the plasmid is pMD18-T or pMD 19-T.

In a preferred embodiment of the present invention, the library sequence is obtained infinitely by inserting the synthetic library sequence into a plasmid, and then replicating the library once.

In another aspect, the present invention discloses an engineered bacterium comprising a recipient bacterium and the cloning vector of the present invention introduced and stored in the recipient bacterium.

Preferably, the recipient bacterium employed herein is E.coli.

It should be noted that, after the library is cloned into a plasmid, the library can be infinitely used only by synthesizing a single-stranded DNA library once, and the sequence synthesis cost is reduced without synthesizing again. In the subsequent use, the required library can be obtained only by culturing engineering bacteria and extracting plasmids. And the sequence of the library already comprises a sequencing joint adopted by a corresponding sequencing platform, and sequencing can be carried out through simple library construction. The whole process is simple and convenient, and the stability is high.

In yet another aspect, the present application discloses a reagent for quality evaluation of second generation sequencing, the reagent comprising the library of the present application, the cloning vector of the present application, or the engineered bacterium of the present application.

The library, the cloning vector and the engineering bacteria can be used for evaluating the quality of second-generation sequencing, or can be used for correcting and optimizing second-generation sequencing so as to improve the sequencing accuracy or the sequencing quality; therefore, any one of them can be prepared into a kit for convenient use.

Preferably, the reagent of the present application further comprises a universal primer, wherein the upstream primer of the universal primer is a sequence shown in SEQ ID No.13, and the downstream primer is a sequence shown in SEQ ID No. 14.

It should be noted that the universal primers are designed for the universal primer binding sequences at both ends of the library, and the library or cloning vector can be amplified to obtain the library sequences. For ease of use, the universal primers are included in the kits of the present application as a separate package.

It should also be noted that for cloning vectors, such as pMD18-T or pMD19-T, which have plasmid amplification primers themselves, or can be designed for plasmids to amplify different inserts simultaneously, there is no need to design universal primer binding sequences at both ends of the library, and the plasmid amplification primers can be used directly for library amplification or sequencing, and there is no need for separate universal primers for the sequence shown in SEQ ID NO.13 and the sequence shown in SEQ ID NO. 14. The specific manner in which this is done is not limited herein.

More preferably, the reagent of the present application further comprises a splint oligo having a sequence shown in SEQ ID NO. 15.

It should be noted that the splint oligo functions to circularize the library DNA, and in one implementation of the present application, sequencing is performed using DNA nanosphere technology, thus circularizing the library is required. It is understood that the splint oligo may be omitted if the DNA nanoball technology is not used, and is not particularly limited herein.

The application also discloses applications of the library, the cloning vector, the engineering bacteria or the reagent in the evaluation of the relation between the basic group and the sequencing quality, the evaluation of the preference and the accuracy of the basic group of the amplification enzyme, the evaluation of the accuracy of the sequencing enzyme, the extraction evaluation or improvement of the basic group signal, the detection of the accuracy of the second generation sequencing, the detection of the error rate of each link from the library construction to the sequencing or the optimization of each link from the library construction to the sequencing.

The library of the application, based on the cloning vector, the engineering bacteria and the reagent of the library of the application, can be used for performing quality evaluation on second-generation sequencing; the principle is to compare and analyze the deviation between the sequencing result and the known library sequence, and the deviation can be used for evaluating the sequencing quality, evaluating the accuracy of the amplification enzyme and the sequencing enzyme or carrying out optimization based on the deviation. It is understood that, based on the above principle, the library, cloning vector, engineering bacteria, reagent, etc. of the present application can evaluate, detect, and optimize each step of the second generation sequencing process, which is not limited herein.

The application also discloses a method for improving the accuracy of nucleic acid sequencing, which comprises the steps of sequencing by adopting a single-stranded DNA library of a known sequence with different base characteristics, comparing a sequencing result with the known sequence, carrying out statistical analysis on sequencing deviation existing in different base characteristics, and correcting a sequencing software algorithm according to the sequencing deviation, so that the accuracy of nucleic acid sequencing is improved; the single-stranded DNA library having known sequences with different base characteristics includes AT least one of a high AT content single-stranded DNA, a high GC content single-stranded DNA, a poly-structure single-stranded DNA, and a hairpin structure single-stranded DNA.

Preferably, the poly-structure single-stranded DNA includes at least one of poly a-structure single-stranded DNA, poly T-structure single-stranded DNA, poly G-structure single-stranded DNA, and poly C-structure single-stranded DNA.

Preferably, a single-stranded DNA library is the library of the present application.

It should be noted that the method for improving the accuracy of nucleic acid sequencing is actually based on the library of the present application, and according to the principle of the present application, the quality evaluation is performed on the second-generation sequencing, so as to optimize and improve the accuracy of sequencing. Based on the same principle, on the basis of the method for improving the nucleic acid sequencing accuracy, the method for evaluating the nucleic acid sequencing quality, the method for evaluating the relation between the base and the sequencing quality, the method for evaluating the base preference and the accuracy of the amplification enzyme, the method for evaluating the accuracy of the sequencing enzyme, the method for extracting, evaluating or improving the base signal, the method for detecting the accuracy of the second-generation sequencing, the method for detecting the error rate of each link from library construction to sequencing, the method for optimizing the scheme of each link from library construction to sequencing and the like can be provided, and the method is not particularly limited.

It should be noted that the method of the present application can improve the accuracy of nucleic acid sequencing, and likewise, the method of the present application can also be used to evaluate the base bias and accuracy of the amplification enzyme, for example, by comparing the sequencing results of the single-stranded DNA library before and after amplification with the amplification enzyme, the influence of the amplification enzyme on the sequencing bias can be analyzed, so as to achieve the purpose of evaluating the accuracy of the amplification enzyme, and by analyzing the specific type of the sequencing bias, the base bias of the amplification enzyme can be known. The principle of the accuracy evaluation of the sequencing enzyme is similar. In addition, the method can improve the accuracy of nucleic acid sequencing, and the key point is that after the sequencing result is compared and analyzed with the known sequence, a sequencing software algorithm is corrected, wherein the sequencing software algorithm comprises the processing of base signal extraction, so that the method can be applied to improving or evaluating the base signal extraction.

Due to the adoption of the technical scheme, the beneficial effects of the application are as follows:

the library is designed with various base characteristics with controllable structures, sequencing is carried out by adopting sequences with known base characteristics, the influence and deviation of different base characteristics on the second-generation sequencing can be evaluated, the quality evaluation of the second-generation sequencing is realized, the deviations are corrected in a targeted manner, and then the optimization of the second-generation sequencing is realized. The method for improving the nucleic acid sequencing accuracy creatively adopts the base characteristics of the library, and obtains the sequencing deviation of different base characteristics by comparing the sequencing result with the known library sequence, thereby guiding the improvement of the sequencing software algorithm and further achieving the purpose of improving the sequencing accuracy; by the method, sequencing deviation can be effectively reduced, and a simple and effective method is provided for improving sequencing accuracy.

Drawings

FIG. 1 is a diagram showing the sequencing results of the first 50bp of a sequence library shown in SEQ ID NO.7 in the example of the present application;

FIG. 2 is a diagram showing the sequencing results of the first 50bp of a sequence library shown in SEQ ID NO.8 in the example of the present application;

FIG. 3 is a diagram showing the sequencing results of the first 50bp of a sequence library shown in SEQ ID NO.9 in the example of the present application;

FIG. 4 is a diagram showing the sequencing results of the first 50bp of a sequence library shown in SEQ ID NO.10 in the example of the present application;

FIG. 5 is a diagram showing the sequencing results of the first 50bp of a sequence library shown in SEQ ID NO.11 in the example of the present application;

FIG. 6 is a diagram showing the sequencing results of the first 50bp of the sequence library shown in SEQ ID NO.12 in the example of the present application;

FIG. 7 is a Q30 profile of a high GC library in an example of the present application.

Detailed Description

Through a large number of experiments and researches, the base content complexity of various sequencing objects is an important factor influencing the quality of next-generation sequencing in the actual sequencing process. For example, for a sequence with uniform AT and GC distribution and few poly structures and hairpin structures, both illuma and proton can reach 99.9 percent of accuracy; however, for sequences with high AT content, high GC content, or more poly structures and hairpin structures, the sequencing accuracy is greatly reduced, and even the use requirement of accurate sequencing in precise medical treatment cannot be effectively met.

For this reason, the present application creatively proposes and develops a single-stranded DNA library having known sequences with different base characteristics, wherein the sequences include various base characteristics specially designed, including high AT content, high GC content, poly structure, hairpin structure, etc.; in one implementation of the present application, there are six single-stranded DNAs of the sequences shown in SEQ ID No.7 to SEQ ID No. 12; by adopting the library designed by the application, the known sequence with specific base characteristics is subjected to second-generation sequencing, and the deviation between the sequencing result and the known library sequence is analyzed and compared, so that the accuracy or the sequencing quality of the second-generation sequencing under various base characteristics is analyzed, the deviation obtained by analysis is corrected, and then the second-generation sequencing is optimized.

Before constructing the library of the present application, a set of standard nucleic acids is designed in advance, and these nucleic acids contain various base characteristics required by the library of the present application, and then a part or all of the sequences of the set of standard nucleic acids are selected for library construction. In one implementation of the present application, the standard nucleic acid consists of at least one of six single-stranded DNAs; the sequences of the six single-stranded DNAs are sequentially a sequence shown by SEQ ID NO.1, a sequence shown by SEQ ID NO.2, a sequence shown by SEQ ID NO.3, a sequence shown by SEQ ID NO.4, a sequence shown by SEQ ID NO.5 and a sequence shown by SEQ ID NO. 6. The libraries of sequences shown in SEQ ID NO.7 to 12 in the present application correspond in sequence to the standard nucleic acids of the sequences shown in SEQ ID NO.1 to 6 in the present application.

The present application is described in further detail below with reference to specific examples. The following examples are intended to be illustrative of the present application only and should not be construed as limiting the present application.

Examples

In this example, a set of standard nucleic acid sequences respectively containing base features such as high AT content, high GC content, poly structure and hairpin structure is first designed, then a library is designed for the standard nucleic acid sequences, and BGISEQ linker sequence, index sequence and universal primer binding sequence are added to the library sequences. Artificially synthesizing a designed library sequence, inserting the artificially synthesized library sequence into a pMD19-T plasmid, and introducing the plasmid into Escherichia coli to prepare the engineering bacteria. And extracting plasmids in the engineering bacteria to obtain a library sequence for next generation sequencing and evaluating the sequencing quality. The details are as follows:

design of first, Standard nucleic acids

In this example, six standard nucleic acid sequences were designed based on the base features commonly described in actual sequencing, such as high AT content, high GC content, poly structure, hairpin structure, etc., and different index sequences were used for each standard nucleic acid sequence. Details are shown in table 1.

TABLE 1 sequences of standard nucleic acids

The six standard nucleic acid sequences of this example include two high GC sequences, two high AT sequences, and two random sequences, both of which are common sequences with similar ACGT content, for comparative analysis. Wherein each standard nucleic acid sequence corresponds to an index sequence, i.e., a barcode sequence, for distinguishing different sequences. The two high GC sequences and the two high AT sequences comprise a hairpin structure and a poly structure.

Second, library sequence design and construction

Most of the six standard nucleic acid sequences designed in this example were selected to construct a library, and a linker sequence suitable for BGISEQ was inserted into the library, and the same universal primer binding sequence was ligated to both ends of each of the six standard nucleic acid sequences. The library sequences designed for the six single-stranded DNA standard nucleic acid sequences of the sequences shown in SEQ ID NO.1 to SEQ ID NO.6 are the sequences shown in SEQ ID NO.7 to SEQ ID NO.12 in sequence.

SEQ ID NO.7：

5’-GATATCTGCAGGCATAGAATGAATATTATTGAATCAATAATTAAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAAACTAGTACGTCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTTACAACTACAGATAATGGGCTGGATACATGGAATGATTATAGATATATTAAGGAATAATGTTAATTAATGCCTAAATTAATTAATCTAAGGGGGTTAATACTTCAGCCTGTGATATC-3’；

SEQ ID NO.8：

5’-GATATCTGCAGGCATGAATAATAATGGAATAGCAATAATTAAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAACGATCAGTACCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTATATAATGTAATACATAATATTAATATATTAATTATTGTATGATTGTTATCTATTACAGTCTAGTACTGACCCGTAGACATATATGCCCCCGATTAATTACTTATCAGCCTGTGATATC-3’；

SEQ ID NO.9：

5’-GATATCTGCAGGCATCGGCCGCGGCGTCCAGTGCGCGGCGCTAGAGCCGGCAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAACGCTATGTACCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTCCGCCGCGGTCGCTTGTCCGGCCGCCGGTCCGGCGCCGGCGGCGCAAAGTGCCAGGCCGAGCCGGCGAACCAGCGGTCCGAAAAACACGGACACTCAGCCTGTGATATC-3’；

SEQ ID NO.10：

5’-GATATCTGCAGGCATCACCGCCGAGGCCGCGGCGGAGACCGCCGGCGCAGGAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAACAGAGTGTACCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTCAAACTACCGGCGCGGCGCTCCTCCGGCCGTCCGCCGCCGACCGGCGGCGGCGTTCCGGTGTGGCACTCCAGGTGGCCGGTTCTCTGCCAAGCGTCAGCCTGTGATATC-3’；

SEQ ID NO.11：

5’-GATATCTGCAGGCATGAAGAACAACCCCGCACGACGCCTACCAACCAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAACTGTATCGTACAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTGCTGTTCGCGGCCGATGTTCGTATAAGATATAAGTTTGGGTATATTCCAGTTTATCGATCGTATCGAAATGTATGAGTTTATACAGGTCCTACTTCAACTCAGCCTGTGATATC-3’；

SEQ ID NO.12：

5’-GATATCTGCAGGCATACTAGACCAGTTCATTATTATAGTGCTAGCCAAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAAACATCAACGTCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTGACGGATTCCCTCGCTTTCTATTGGCTGACAGTACAAGTAACATAGGTTGGGTCGGTTAACCCTGCCGTCACAAGTGGAACGATGTTAATAGTTGCGGTCAGCCTGTGATATC-3’；

In the above six library sequences, "GATATCTGCAGGCAT" is a universal primer binding sequence at the 5 'end, and "TCAGCCTGTGATATC" is a universal primer binding sequence at the 3' end, and universal primers are designed for these two sequences. "AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAANNNNNNNNNNCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTT" is a linker sequence comprising an index sequence, where "NNNNNNNN" is a 10bp index sequence. The index sequence of the sequence library shown in SEQ ID NO.7 is "ACTAGTACGT", the index sequence of the sequence library shown in SEQ ID NO.8 is "CGATCAGTAC", the index sequence of the sequence library shown in SEQ ID NO.9 is "CGCTATGTAC", the index sequence of the sequence library shown in SEQ ID NO.10 is "CAGAGTGTAC", the index sequence of the sequence library shown in SEQ ID NO.11 is "CTGTATCGTA", and the index sequence of the sequence library shown in SEQ ID NO.12 is "ACATCAACGT".

In the universal primer, the upstream primer is a sequence shown by SEQ ID NO.13, and the downstream primer is a sequence shown by SEQ ID NO. 14;

SEQ ID NO.13：5’-GATATCTGCAGGCAT-3’；

SEQ ID NO.14：5’-GATATCACAGGCTGA-3’。

in the method, the DNA nanosphere technology is adopted for sequencing, and library DNA needs to be cyclized, so that the splint oligo is designed and has a sequence shown as SEQ ID NO. 15;

SEQ ID NO.15：5’-ATGCCTGCAGATATCGATATCACAGGCTGA-3’。

the libraries of sequences shown in SEQ ID NO.7 to SEQ ID NO.12 of this example, as well as the universal primers, splint oligo, were all synthesized by Shanghai.

Third, cloning vector and engineering bacterium construction

The synthesized library sequences were cloned, and the cloning vector was introduced into E.coli. The cloning vector and the construction of the engineering bacteria are synthesized by Nanjing Kinsley.

Fourth, library acquisition

Culturing the preserved engineering bacteria in LB culture medium at 37 deg.C overnight, and culturing with Thermo Fisher

And extracting the plasmid according to the instruction mode of the kit. And the extracted plasmid is subjected to PCR amplification by adopting a universal primer, and a PCR amplification product can be directly used for sequencing after cyclization.

1. Plasmid extraction

The plasmid extraction of this example employed

Plasmid extraction kit, extraction procedure reference

The description is not repeated herein.

PCR amplification

PCR amplification system 100. mu.L, comprising: 20. mu.L of 5 XHi-Fi enzyme reaction solution, 5. mu.L of dNTPs mixed solution with each component being 10mM, 1. mu.L of Hi-Fi enzyme with 1U/. mu.L, 6. mu.L of upstream primer with 20. mu.M, 6. mu.L of downstream primer with 20. mu.M, and 1. mu. L, ddH of extracted plasmid template₂O61. mu.L, a total of 100. mu.L.

The PCR amplification conditions were 98 ℃ for 3min, followed by 33 cycles: 20s at 98 ℃, 15s at 60 ℃ and 30s at 72 ℃; after the circulation was completed, Hold was performed at 72 ℃ for 5min and 4 ℃.

Circularization of PCR amplification product

In this example, magnetic beads are used to purify PCR amplification products, and then purified PCR amplification products are circularized according to BGIseq500SE50 circularization library construction kit and procedure. The specific steps for circularizing the PCR amplification product are described in the kit instructions, and will not be described herein.

Fifth, library sequencing detection and sequencing accuracy detection

To verify that the synthesized library with known sequence can satisfy the sequencing of BGISEQ platform, six libraries of the sequences shown in SEQ ID No.7 to SEQ ID No.12 obtained in the example are subjected to sequencing verification of SE50+10 according to BGISEQ500SE50 kit.

The cyclization products of the six libraries are taken and subjected to DNB preparation according to the operation flow of BGISEQ 500. Then 15. mu.L of each prepared DNB is taken and mixed into a DNB system of 90. mu.L, the chip is manufactured according to the standard flow, and the SE50+10 sequencing mode is selected for sequencing.

Sequencing results show that the sequencing results of six libraries of sequences shown by SEQ ID NO.7 to SEQ ID NO.12 are distinguished according to the index sequence, the first 50bp results of the sequencing of the six library sequences are the same as the actual standard nucleic acid sequence, the first 50bp results of the sequencing of the six library sequences are shown in figures 1 to 6, and the figures 1 to 6 sequentially correspond to the sequencing results of the six libraries of the sequences shown by SEQ ID NO.7 to SEQ ID NO. 12; the library construction is successful, and the algorithm basecall is accurate.

Sixth, evaluation of sequencing quality

In order to compare the relationship between sequencing quality and bases, sequencing of SE100 was performed on a library of sequences represented by SEQ ID NO.7 with a high AT content (referred to as high AT library for short) and a library of sequences represented by SEQ ID NO.9 with a high GC content (referred to as high GC library for short) using a sequencing kit of BGISEQ500SE100+ 10.

Preparation and chip fabrication of DNB were the same as "five, library sequencing assay and sequencing accuracy assay". Only a library of the sequence shown in SEQ ID NO.7 and a library of the sequence shown in SEQ ID NO.9 were prepared and subjected to on-machine sequencing in SE100 in this experiment.

The sequencing quality of the two libraries was analyzed and compared, as shown in Table 2, the library of the sequence shown in SEQ ID NO.9 with high GC content had a Q30 lower than that of the library of the sequence shown in SEQ ID NO.7 with high AT content and a higher error rate than that of the library with high AT content. For this reason, targeted optimization can be carried out for libraries rich in GC content in a subsequent improvement of the sequencing technology.

TABLE 2 comparison of sequencing quality of two libraries

Name (R)	PredQual	GC content%	Q10％	Q10％	Q10％	EsErr％
							High AT library	33	27.05％	99.16	98.02	91.44	0.23
High GC libraries	33	75.47％	98.16	94.18	85.05	0.68

In addition, further analysis of the relationship between bases and quality values, as shown in FIG. 7, FIG. 7 is a Q30 distribution diagram of a high GC library, and it can be clearly seen that at the 60bp, 68bp, 81bp, 91bp, 97bp, the Q30 diagram has a significant downward trend, and all the positions corresponding to the sequence have a common characteristic that when the base G is followed by A, the sequencing quality of A is deteriorated, which provides a direction for the optimization of the subsequent sequencing technology.

Therefore, the standard nucleic acid and the library based on the standard nucleic acid can evaluate the base preference and accuracy of sequencing in the second-generation sequencing, detect the accuracy of the second-generation sequencing and evaluate the quality of the second-generation sequencing; and the sequencing result and the analysis of the base characteristics are optimized in a targeted manner, so that the accuracy of nucleic acid sequencing is improved.

The foregoing is a more detailed description of the present application in connection with specific embodiments thereof, and it is not intended that the present application be limited to the specific embodiments thereof. It will be apparent to those skilled in the art from this disclosure that many more simple derivations or substitutions can be made without departing from the spirit of the disclosure.

SEQUENCE LISTING

<110> Shenzhen Huashengshengsciences institute

<120> library, reagent and application for second-generation sequencing quality evaluation

<130> 17I25566-A23542

<160> 15

<170> PatentIn version 3.3

<210> 1

<211> 150

<212> DNA

<213> Artificial sequence

<400> 1

tacaactaca gataatgggc tggatacatg gaatgattat agatatatta aggaataatg 60

ttaattaatg cctaaattaa ttaatctaag ggggttaata ctatgtgtta attaatctta 120

ttagaatgaa tattattgaa tcaataatta 150

<210> 2

<211> 150

<212> DNA

<213> Artificial sequence

<400> 2

atataatgta atacataata ttaatatatt aattattgta tgattgatat ctattacagt 60

ctagtactga cccgtagaca tatatgcccc cgattaatta cttaggctta ttaataatat 120

ataggaataa taatggaata gcaataatta 150

<210> 3

<211> 150

<212> DNA

<213> Artificial sequence

<400> 3

ccgccgcggt cgcttgtccg gccgccggtc cggcgccggc ggcgcaaagt gccaggccga 60

gccggcgaac cagcggtccg aaaaacacgg acacggtaac ctcaccacga tggccggccg 120

cggcgtccag tgcgcggcgc tagagccggc 150

<210> 4

<211> 150

<212> DNA

<213> Artificial sequence

<400> 4

caaactaccg gcgcggcgct cctccggccg tccgccgccg accggcggcg gcgttccggt 60

gtggcactcc aggtggccgg ttctctgcca agcggcaggc gaaaaatcga cggccaccgc 120

cgaggccgcg gcggagaccg ccggcgcagg 150

<210> 5

<211> 150

<212> DNA

<213> Artificial sequence

<400> 5

gctgttcgcg gccgatgttc gtataagata taagtttggg tatattccag tttatcgatc 60

gtatcgaaat gtatgagttt atacaggtcc tacttcaaca agcggcactt tactaccgtg 120

aagaacaacc ccgcacgacg cctaccaacc 150

<210> 6

<211> 150

<212> DNA

<213> Artificial sequence

<400> 6

gacggattcc ctcgctttct attggctgac agtacaagta acataggttg ggtcggttaa 60

ccctgccgtc acaagtggaa cgatgttaat agttgcggaa ccctatgttc ggcggaatac 120

tagaccagtt cattattata gtgctagcca 150

<210> 7

<211> 244

<212> DNA

<213> Artificial sequence

<400> 7

gatatctgca ggcatagaat gaatattatt gaatcaataa ttaaagtcgg aggccaagcg 60

gtcttaggaa gacaaactag tacgtcaact ccttggctca cagaacgaca tggctacgat 120

ccgactttac aactacagat aatgggctgg atacatggaa tgattataga tatattaagg 180

aataatgtta attaatgcct aaattaatta atctaagggg gttaatactt cagcctgtga 240

tatc 244

<210> 8

<211> 244

<212> DNA

<213> Artificial sequence

<400> 8

gatatctgca ggcatgaata ataatggaat agcaataatt aaagtcggag gccaagcggt 60

cttaggaaga caacgatcag taccaactcc ttggctcaca gaacgacatg gctacgatcc 120

gacttatata atgtaataca taatattaat atattaatta ttgtatgatt gttatctatt 180

acagtctagt actgacccgt agacatatat gcccccgatt aattacttat cagcctgtga 240

tatc 244

<210> 9

<211> 244

<212> DNA

<213> Artificial sequence

<400> 9

gatatctgca ggcatcggcc gcggcgtcca gtgcgcggcg ctagagccgg caagtcggag 60

gccaagcggt cttaggaaga caacgctatg taccaactcc ttggctcaca gaacgacatg 120

gctacgatcc gacttccgcc gcggtcgctt gtccggccgc cggtccggcg ccggcggcgc 180

aaagtgccag gccgagccgg cgaaccagcg gtccgaaaaa cacggacact cagcctgtga 240

tatc 244

<210> 10

<211> 244

<212> DNA

<213> Artificial sequence

<400> 10

gatatctgca ggcatcaccg ccgaggccgc ggcggagacc gccggcgcag gaagtcggag 60

gccaagcggt cttaggaaga caacagagtg taccaactcc ttggctcaca gaacgacatg 120

gctacgatcc gacttcaaac taccggcgcg gcgctcctcc ggccgtccgc cgccgaccgg 180

cggcggcgtt ccggtgtggc actccaggtg gccggttctc tgccaagcgt cagcctgtga 240

tatc 244

<210> 11

<211> 244

<212> DNA

<213> Artificial sequence

<400> 11

gatatctgca ggcatgaaga acaaccccgc acgacgccta ccaaccaagt cggaggccaa 60

gcggtcttag gaagacaact gtatcgtaca actccttggc tcacagaacg acatggctac 120

gatccgactt gctgttcgcg gccgatgttc gtataagata taagtttggg tatattccag 180

tttatcgatc gtatcgaaat gtatgagttt atacaggtcc tacttcaact cagcctgtga 240

tatc 244

<210> 12

<211> 244

<212> DNA

<213> Artificial sequence

<400> 12

gatatctgca ggcatactag accagttcat tattatagtg ctagccaaag tcggaggcca 60

agcggtctta ggaagacaaa catcaacgtc aactccttgg ctcacagaac gacatggcta 120

cgatccgact tgacggattc cctcgctttc tattggctga cagtacaagt aacataggtt 180

gggtcggtta accctgccgt cacaagtgga acgatgttaa tagttgcggt cagcctgtga 240

tatc 244

<210> 13

<211> 15

<212> DNA

<213> Artificial sequence

<400> 13

gatatctgca ggcat 15

<210> 14

<211> 15

<212> DNA

<213> Artificial sequence

<400> 14

gatatcacag gctga 15

<210> 15

<211> 30

<212> DNA

<213> Artificial sequence

<400> 15

atgcctgcag atatcgatat cacaggctga 30

Claims

1. A library for quality assessment of next generation sequencing, characterized by: the library is a single-stranded DNA library with known sequences with different base characteristics, and an adapter sequence and an index sequence are connected in the library; the single-stranded DNA library with known sequences with different base characteristics comprises high AT content single-stranded DNA, high GC content single-stranded DNA, poly structure single-stranded DNA and hairpin structure single-stranded DNA; the two ends of the library are provided with universal primer binding sequences;

the high AT content single-stranded DNA refers to single-stranded DNA with the AT content of more than or equal to 72.95 percent;

the high GC content single-stranded DNA refers to a single-stranded DNA having a GC content of 75.47% or more.

2. The library of claim 1, wherein: the single-stranded DNA library consists of a sequence shown by SEQ ID NO.7, a sequence shown by SEQ ID NO.8, a sequence shown by SEQ ID NO.9, a sequence shown by SEQ ID NO.10, a sequence shown by SEQ ID NO.11 and a sequence shown by SEQ ID NO. 12.

3. A cloning vector comprising a plasmid and an insert, characterized in that: the insert comprises the library of claim 1 or 2.

4. The cloning vector of claim 3, wherein: the plasmid is pMD18-T or pMD 19-T.

5. An engineered bacterium comprising a recipient bacterium and the cloning vector of claim 3 or 4 introduced and stored in the recipient bacterium.

6. The engineered bacterium of claim 5, wherein: the recipient bacterium is escherichia coli.

7. A reagent for quality assessment of next generation sequencing, characterized by: the reagent comprises the library of claim 1 or 2, the cloning vector of claim 3 or 4, or the engineered bacterium of claim 5 or 6.

8. The reagent according to claim 7, characterized in that: the primer sequence of the primer sequence is shown as SEQ ID NO.13, and the primer sequence of the primer sequence is shown as SEQ ID NO. 14.

9. The reagent according to claim 7 or 8, characterized in that: also comprises a splint oligo which is shown as SEQ ID NO. 15.

10. Use of the library of claim 1 or 2, the cloning vector of claim 3 or 4, the engineered bacterium of claim 5 or 6, or the reagent of any one of claims 7 to 9 for base-to-sequencing quality relationship assessment, amplification enzyme base preference and accuracy assessment, sequencing enzyme accuracy assessment, base signal extraction assessment or improvement, secondary sequencing accuracy detection, or pooling to individual link error rate detection in sequencing.

11. A method of increasing the accuracy of nucleic acid sequencing, comprising: the method comprises the steps of sequencing a single-stranded DNA library of a known sequence with different base characteristics, comparing a sequencing result with the known sequence, statistically analyzing sequencing deviation existing in the different base characteristics, and correcting a sequencing software algorithm according to the sequencing deviation, so that the nucleic acid sequencing accuracy is improved; the single-stranded DNA library with known sequences with different base characteristics comprises high AT content single-stranded DNA, high GC content single-stranded DNA, poly structure single-stranded DNA and hairpin structure single-stranded DNA;

the poly-structure single-stranded DNA comprises at least one of poly A-structure single-stranded DNA, poly T-structure single-stranded DNA, poly G-structure single-stranded DNA and poly C-structure single-stranded DNA;

12. The method of claim 11, wherein: the single-stranded DNA library is the library of claim 1 or 2.