CN114107290A

CN114107290A - A kind of sequencing adapter and its sequencing analysis system

Info

Publication number: CN114107290A
Application number: CN202111374708.XA
Authority: CN
Inventors: 欧阳川; 王珺; 周逸文; 王江浩; 刘紫丹
Original assignee: Hangzhou Jieyi Biotechnology Co ltd
Current assignee: Hangzhou Jieyi Biotechnology Co ltd
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-01
Also published as: WO2023087527A1

Abstract

The invention relates to the technical field of molecular biology, in particular to nucleic acid sequencing, and in particular to a sequencing adapter and a sequencing analysis system thereof. Among them, the sequencing adapter is in a Y-shaped structure of partial complementary pairing, and one of the chains sequentially includes, from 5' to 3', the internal Index sequence, the sequence of the binding region of the Index1 sequencing primer, the Index1 sequence, and the P7 sequence bound to the chip probe; the other The sequence from 5' to 3' on the chain includes: the P5 sequence bound to the chip probe, the sequence of the Read1 sequencing primer binding region, the internal Index sequence and the T base overhang. The internal tag sequence regions of the two strands are completely complementary, and the sequencing primer binding regions are partially complementary. When using artificially balanced base ratios and different lengths of internal Index sequence adapter combinations, multiple samples can be analyzed while sequencing while ensuring high sequencing quality and throughput, which greatly shortens the turnaround time and improves the throughput. time limit for detection.

Description

Sequencing joint and sequencing analysis system thereof

Technical Field

The invention relates to the technical field of molecular biology, in particular to nucleic acid sequencing, and specifically relates to a sequencing joint and a sequencing analysis system thereof.

Background

Since 2014 the first clinical application case of definite diagnosis of leptoprosis by metagenomic second-generation sequencing (mNGS) was published in the New England medical journal, the mNGS has made a lot of progress in the aspects of new pathogen identification, rare important pathogen diagnosis and the like, and the application of the mNGS in the field of critical and severe infection is also clinically approved. The pathogenic mNGS is characterized in that a sample of a suspected infected part is extracted to obtain nucleic acid in the sample, a nucleic acid fragment is connected with a DNA joint which can be hybridized with a sequencing chip, the joint contains a label sequence (Index) capable of distinguishing different samples, sequencing is performed through a high-throughput sequencer, and the detected sequence is compared with a database containing various pathogens, so that the pathogens can be rapidly locked. Meanwhile, by distinguishing the tag sequences Index, a plurality of samples can be simultaneously sequenced in parallel in one operation, the sequencing flux is fully utilized, and the cost is reduced.

A conventional TruSeq sequencing adaptor is shown in FIG. 1a, and has a T base overhang at the end of the adaptor, which is used to complement the terminal A base overhang in the sample added to the target fragment for T-A ligation. The last bit of the sequencing primer of Read1 contains a T, so that the insert is directly detected first when sequencing, and the T base is not detected. And after the sequencing of the Read1 is finished, replacing the sequencing primer of the tag sequence to obtain the tag sequence. Generally, for pathogen sequencing, the Read1 sequencing portion takes about 500 minutes and the complete sequencing of the Index1 tag takes about 50 minutes. That is, it takes about 550 minutes (-9 hours) for the whole sequencing to be completed, the sequencer can obtain the whole sequence and can distinguish which specific sample is.

In summary, adding the time for library preparation (4 hours) and the sequencing time (9-10 hours), the total of 14 hours is required from the start of sample preparation to the end at which analysis of each sample can begin. In the case of an Illumina NextSeq-like throughput sequencer, each time approximately 20G of data is generated, an hour or so of analysis is required. Thus, at least 15 hours are required from the initial sample to the time of the result, and the approximate flow is as shown in FIG. 1 b. The detection has poor timeliness, and needs to be improved urgently.

Disclosure of Invention

The invention aims to provide a sequencing joint, which can realize sequencing and analysis of a plurality of samples under the condition of ensuring higher sequencing quality and throughput, greatly shortens the turn-around time (TAT) and improves the detection timeliness.

Through analysis of the existing sequencing joint detection process, in order to improve detection timeliness, two key time points need to be solved, wherein 1, the sequencing time is long and accounts for 50% of the total TAT time; 2. the analysis takes one hour and the sequence alignment analysis can be started by splitting the data after obtaining the tag sequence Index of each sample only after waiting for the complete sequencing, i.e. at least 14 hours.

In order to achieve the purpose, the invention adopts the following technical scheme:

a sequencing adaptor (as shown in figure 2) in the form of a partially complementary paired wye, wherein one strand comprises, in order from 5 'to 3': internal Index sequence, Index1 sequencing primer binding region sequence, Index1 sequence, P7 sequence bound to chip probe; the other strand, from 5 'to 3', comprises in sequence: the sequence of P5 bound to the chip probe, the sequence of the primer binding region for Read1 sequencing, the internal Index sequence and the T base overhang. The internal tag sequence regions of the two strands are perfectly complementary paired, and the sequencing primer binding region sequences are partially complementary paired. An Index2 sequence may also be added between the P5 sequence and the sequence of the Read1 sequencing primer binding region.

New linkers will appear during sequencing as internal Ind mutex sequences are added downstream of the binding region of the Read1 sequencing primer, all of which will be T bases in the sequencing result to a fixed position (T-A junction). As shown in figure 3, when an internal Index sequence joint with the length of 8bp is singly used for sequencing, a base proportion at each cycle number can generate a T base with a high proportion at the ninth cycle, so that the single base fluorescence intensity is too strong, other bases uniformly have no signals, the balance proportion among the four bases A/T/C/G is broken, the difficulty of a sequencer in analyzing specific bases is increased, the sequencing quality of the bases at the position can be judged to have problems by analysis software, the sequencing sequence with a large proportion can not pass quality control, and the effective data output is greatly reduced. For the second generation sequencer, it is important to sequence the first few bases, which plays a role in locating cluster positions, so that the quality and quantity of sequencing are greatly reduced if the same base exists in the whole sequencing chip in one cycle in the first ten cycles.

To solve this problem, the present invention further optimizes the design of internal Index sequences for discriminating different samples, designs internal Index sequences having two to four or more lengths, and the length difference between adjacent long and short internal Index sequences may be one base, two bases, or more, but preferably one base in order to save the sequencing cost and reduce the time spent sequencing internal indexes. When used, internal Ind mutex sequence adapters of different lengths must be used in combination to avoid T-A linked T bases from occurring in the same sequencing cycle. All internal Index sequences used should be combined to achieve substantial base ratios equilibrium between internal Index sequences at each position in the sequencing cycle, so that Index improves the quality of the first 10 bases as much as possible.

FIG. 4 shows the base proportion results of each cycle when three types of internal Index sequence linkers of 6bp, 7bp and 8bp are adopted, and the linkers of different Index lengths are mixed to stagger the cycle of the base T, and it can be seen from FIG. 4 that three cycles have a slightly high proportion of T, but not all of the T are concentrated in the same cycle, so that a high-quality sequencing result can be obtained after optimization.

At least two to more than four internal Index sequence combinations are recommended to complete the labeling and sequencing of multiple samples. And the actual ratios of the various internal Index length joints used are balanced. In order to save the sequencing cost and reduce the time spent on sequencing the internal indexes, three internal Index sequence length combinations of 6bp, 7bp and 8bp are optimally used, and the joint of each internal Index length occupies about one third of the total joint; or, optimally, four internal Index sequence length combinations of 6bp, 7bp, 8bp and 9bp are used, and the linker of each internal Index length accounts for about one fourth of the total amount of the linker. For example, when two long and short internal Index sequences are combined, one internal Index sequence is 6 bases long, and one internal Index sequence is 7 bases long, and the two types of samples are mixed at 50% each. This would appear to result in the seventh base sequencing being 50% of the sequence as T (T-A junction of the 6 base internal Ind mutex sequences) and the remaining 50% of the sequence as the seventh base of the 7 base length internal Ind mutex sequence (and not allowed to be designed as T). This combination gives 50% signal and is T (T-A junction of 7-base internal Ind mutex sequence) Ind mutex when sequenced to the eighth base. All sequences are in the insert starting from the ninth base. If there are three to four different lengths of internal Index sequence combinations, it is better to distribute the base ratios evenly over each cycle. For example, three different internal Index sequences, one 6 base long, one 7 base long, and one 8 base long, were combined together at 1/3. Or a combination of four internal Index sequences of different lengths, one internal Index sequence being 6 bases in length, one internal Index sequence being 7 bases in length, one internal Index sequence being 8 bases in length, and one internal Index sequence being 9 bases in length, each accounting for 1/4.

The length difference between adjacent long and short internal Index sequences may be one base, two bases or more, but the Index is preferably one base, such as 6 bases, a combination of 7 bases and 8 bases.

It is further preferred in the present invention that all internal Index sequences used, when combined, achieve substantial base ratios equilibrium between internal Index sequences in each sequencing cycle. . Generally, when the number of libraries (or the number of indexes used) in one sequencing is 4 or more, the ratios of the four kinds of bases ATCG in each sequencing cycle of the internal Index sequence are suitably controlled to 8% to 50%, and the ratio is optimally controlled to 12.5% to 37.5%.

In addition to the above requirements, all internal Index sequences used should also satisfy: (1) the minimum Hamming distance of any two internal Index sequences is 3; (2) excluding Index sequences containing three or more identical contiguous bases; (3) the first two bases of the internal Index should not be "GG". In general, the longer the length of the Index sequence, the more types of indexes that ATCG can combine to create. In order to design enough indexes for multi-sample sequencing and the minimum Hamming distance between any two Index sequences is 3 or more, the sequence length of the internal Index is preferably 6 bases or more.

Because the generation mode of the sequencing sequence is changed, after the sequencing is started, the internal Index sequence can be measured to distinguish each sample after a plurality of cycles, and therefore, the sequence of a specific sample can be analyzed without waiting for the completion of all sequencing (9-10 hours). In addition, as the sequencing cycle number is more, the measured sequence is longer and longer, and the invention can realize real-time analysis to obtain the comparison and analysis results of sequences with different lengths along with the progress of sequencing.

Another objective of the present invention is to provide a novel sequencing analysis system (see FIG. 2 b) for sequencing and analysis, and real-time analysis to obtain sequence alignment and analysis results, according to the above novel linker structure. The system has the advantages of real-time cycle analysis, short analysis time and high accuracy.

The sequencing analysis system of the invention comprises:

1. a sequencing monitoring module: used for monitoring the sequencing progress in real time and triggering an analysis task.

The sequencing monitoring module can scan the sequencing catalog at regular time and monitor the sequencing progress. When the sequencing is carried out to a sufficient length (the shortest length is 22 bp), a monitoring program sends out a signal to trigger the subsequent analysis step, the extended sequence is continuously analyzed in real time along with the sequencing, and the next analysis can be started immediately after the previous analysis is finished.

2. A data generation module: the system is used for converting the BCL file generated by sequencing into a fastq file and filtering a low-quality sequence;

while sequence data is split into corresponding samples using a specific analysis program for specially designed adaptors.

And the data generation module converts the BCL file generated by sequencing into a fastq file, performs quality control on the sequencing data, removes low-quality data and sequences containing joints, and ensures reliable quality of data entering a subsequent analysis process. Meanwhile, the specially designed adaptor is used for distinguishing different samples during sequencing, is also suitable for an extremely-rapid analysis process, and is used for splitting sequence data into corresponding samples by using a specific analysis program.

3. A data filtering module: for removing human sequences from the sequences passing quality control.

And the data filtering module compares the quality-controlled sequence with a human genome database by using quick comparison software to remove the human sequence on comparison. And outputting the unaligned sequences to obtain non-human data with human sequences removed.

4. A data analysis module: for aligning the non-human sequence to a pathogenic microorganism genome database;

and the data analysis module compares the non-human data with the pathogenic microorganism genome database to obtain a microorganism sequence comparison result. For sequences with multiple alignment results, the system will select the alignment score between scoring regions [ L, U]The nearest common ancestor (LCA) of the taxon (taxon) to which these reference sequences belong is calculated as the final alignment of the sequences. The determination mode among the scoring areas is as follows:

，

wherein

Representing the highest score of the theoretical alignment,

representing the lowest score of the theoretical alignment,

the highest score of the alignment representing the sequence represents the scoring interval range parameter, with a default value of 20. When analyzing the comparison result, recording the information of whether the species compared with each sequence is unique or not, whether the species is completely compared or not, and the like.

5. A report generation module: and the method is used for counting, analyzing and comparing results and outputting an analysis report.

And the report generation module counts the number of the sequences detected by each classification unit according to the comparison result of the sequences, and counts the number of the sequences on the taxon, the number of the sequences of the taxon and all the sub-nodes thereof and the number of the uniquely-compared and completely-compared sequences of each taxon for the taxon containing smaller nodes.

Through the implementation of the technical scheme, compared with the nucleic acid sequencing in the prior art, the method has the following advantages:

1. the internal Index is located between the sequencing primer and the insert, and when performing extreme speed analysis, the Index is first determined, so that sequences from different samples can be separated early in sequencing without waiting for sequencing to be completely completed.

The Index uses at least two or more different lengths (preferably 3 lengths, each 6/7/8 bp). And Index sequences with different lengths avoid that the conventional method is in the same cycle and is all the result of T, thereby reducing the sequencing quality.

Base at each position of Index requires a uniform distribution of base ratios.

4. And after the sequencer obtains 22 sequences, the analysis software begins to analyze pathogen information, and each cycle continues to follow up the analysis, so that the purpose of NGS real-time analysis is achieved.

5. By combining Index joints with different lengths and a real-time analysis method, the result can be known only after the original machine is operated for at least 11 hours, and the basic condition of the microorganism in the sample can be known at the first time after about 5 hours after sequencing is started, so that the purpose of NGS (Next Generation Standard) extremely-rapid analysis is achieved.

Drawings

FIG. 1a is a schematic diagram of a conventional sequencing linker structure and a sequencing process in the prior art;

FIG. 1b is a graph showing the time consumption of each process of a conventional sequencing adapter system according to the prior art;

FIG. 2 is a schematic diagram of a sequencing structure and a sequencing process according to the present invention;

FIG. 2b is a schematic flow diagram of a sequencing analysis system using the sequencing adapter of the present invention;

FIG. 3 shows the base ratio at each cycle when sequencing was performed using an internal Index sequence linker of 8bp in length alone;

FIG. 4 shows the base ratios at each cycle number when sequencing was performed using three internal Index sequence linkers of 6bp, 7bp and 8bp length;

FIG. 5 is a detailed sequence structure of the sequencing adapter used in example 1; FIG. 6 is a comparison of base ratios at each cycle number for sequencing using the sequencing adaptors of the invention and conventional Illumina TruSeq adaptors of example 1;

FIG. 7 is a graph comparing sequencing quality and final library data volume when sequencing using the sequencing adapters of the present invention and a conventional Illumina TruSeq adapter of example 1;

FIG. 8 shows the sequence numbers of the Legionella pneumophila in example 2 measured in each cycle of analysis;

FIG. 9 shows the sequences of Citrobacter cleaveri from example 2 measured at each cycle of the analysis.

Detailed Description

It should be noted that the following embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments can be modified, or some technical features can be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Example 1

A sequencing adaptor of this embodiment is in a partially complementary paired Y-shaped configuration. One strand comprises, in order from 5 'to 3': internal Index sequence, Index1 sequencing primer binding region sequence, Index1 sequence, P7 sequence bound to chip probe. The other strand, from 5 'to 3', comprises in sequence: the sequence of P5 bound to the chip probe, the sequence of the primer binding region for Read1 sequencing, the internal Index sequence and the T base overhang. The internal tag sequence regions of the two strands are perfectly complementary paired, and the sequencing primer binding region sequences are partially complementary paired. The structure is shown in figure 5, and three long and short internal Index sequences are adopted, and the lengths are respectively 6bp, 7bp and 8 bp.

In this example, 48 internal Index sequences were designed. They were divided into 16 groups, each with internal Index sequences 6bp, 7bp and 8bp long.

The internal Index sequence meets the following requirements: (1) minimum Hamming distance of any two internal Index sequences is 3 (2) excluding Index sequences containing more than three identical consecutive bases. (3) the first two bases of the internal Index should not be "GG". (4) the 7 th base of the 7bp Index and the 7 th base of the 8bp Index should not be T, and the 8 th base of the 8bp Index should not be T. (5) The base ratios at each sequencing position of the indexes within the combination were all adjusted manually to achieve relative equilibrium.

The specific sequence and design are as follows:

153 libraries were each constructed using the internal Index linker and the traditional Illumina TruSeq linker described above, and then sequenced on a batch basis: dividing the internal Index joint library into 8 times of machine-on sequencing, wherein about 18-20 libraries are subjected to machine-on sequencing each time, and joints with the length of each internal Index account for about one third of the total quantity of the joints used in the round of sequencing; the TruSeq linker library was divided into 5 runs for sequencing, approximately 30-31 libraries per run. The quality of sequencing of the two adapters was compared and the results are shown in FIG. 6 (base ratio comparison at each cycle number for sequencing of the two adapters) and FIG. 7 (comparison of quality of sequencing and final library data volume for sequencing of the two adapters).

As shown in FIG. 6, the use of optimized internal Index adapters provided a more balanced base ratio, only slightly higher T base ratio (relative to the TruSeq adapter) at cycle 9, but had no effect on sequencing quality.

As shown in fig. 7, the use of optimized internal Index linkers ensures a higher percentage of qualified clusters and Q30 scores, and these sequencing quality indicators do not differ significantly from the data for TruSeq linkers. When the optimized internal Index joint is used for splitting data, the internal Index can be used alone for splitting, or the internal Index + Index1 can be used for carrying out double-Index splitting, and the finally obtained library data amount has no obvious difference from that when TruSeq is used.

Example 2

To evaluate the analytical performance of the system, two clinically positive samples were analyzed using the sequencing assay system of the present invention. Where the clinical outcome of sample 1 was legionella pneumophila infection and the clinical outcome of sample 2 was citrobacter cruzi infection. The analysis time and the test results of the two samples are shown in table 1 below. The sequence number of the legionella pneumophila in each analysis cycle is shown in the figure 8, and the sequence number of the Citrobacter kefir in each analysis cycle is shown in the figure 9.

TABLE 1 clinical sample assay time statistics

The analysis result shows that in the first report of the rapid analysis with the sequencing read length of 22bp, the system can sensitively detect the positive pathogenic bacteria; as sequencing progresses, the number of detected pathogen sequences increases slowly and stabilizes after several cycles. Therefore, the system can detect positive pathogens in a very early stage for pathogen infection positive samples and give reliable analysis results.

Claims

1. a sequencing adapter, is characterized in that, is the Y-shaped structure of partial complementary pairing, wherein one chain comprises successively from 5 ' to 3 ': internal Index sequence, Index1 sequencing primer binding region sequence, Index1 sequence, and chip probe Binding P7 sequence; the other strand includes, in order from 5' to 3': the P5 sequence bound to the chip probe, the sequence of the Read1 sequencing primer binding region, the internal Index sequence and the T base overhang.

2 . The sequencing adapter according to claim 1 , wherein, when in use, internal Index sequence adapters of different lengths are used for combination to complete the labeling and sequencing of multiple samples. 3 .

3 . The sequencing adapter according to claim 2 , wherein the length difference between adjacent long and short internal Index sequences is one base. 4 .

4 . The sequencing adapter according to claim 2 , wherein, when in use, two to four lengths of internal Index sequence adapters are used to combine to complete the labeling and sequencing of multiple samples. 5 .

5 . The sequencing adapter according to claim 1 , wherein the base ratios of the internal Index sequences in each round of sequencing cycles are basically balanced after all the internal Index sequences used are combined. 6 .

6. A sequencing adapter according to claim 5, wherein when the number of internal Indexes used in one sequencing is greater than or equal to 4, the ratio of the four bases of ATCG in each round of sequencing cycles of the internal Index sequence It is appropriate to control each at 8% to 50%.

7. A sequencing adapter according to claim 6, wherein when the number of internal Index used in one sequencing is greater than or equal to 4, the ratio of the four bases of ATCG in each round of sequencing cycle of the internal Index sequence Respectively control at 12.5% ~ 37.5% optimal.

8. A sequencing adapter according to claim 1, wherein all internal Index sequences used should satisfy: (1) the minimum Hamming distance of any two internal Index sequences is 3; Index sequence of more than one identical consecutive bases; (3) The first two bases of the internal Index sequence should not be "GG".

9 . The sequencing adapter according to claim 1 , wherein an Index2 sequence can be added between the P5 sequence bound by the chip probe and the sequence in the binding region of the Read1 sequencing primer. 10 .

10. A sequencing analysis system based on the above-mentioned sequencing adapter, characterized in that, comprising:

Sequencing monitoring module: used to monitor sequencing progress in real time and trigger analysis tasks;

Data generation module: used to convert BCL files generated by sequencing into fastq files, and filter low-quality sequences; at the same time, using specific analysis programs for specially designed adapters to split sequence data into corresponding samples;

Data filtering module: used to remove human sequences in the sequences that passed the quality control;

Data analysis module: used to align non-human sequences into the pathogenic microorganism genome database;

Report generation module: used for statistical analysis and comparison results, and output analysis report.