CN115717163A

CN115717163A - Molecular coding detection system for monitoring and correcting sequencing pollution and application thereof

Info

Publication number: CN115717163A
Application number: CN202211328995.5A
Authority: CN
Inventors: 庞震国; 李丽莎; 朱振刚; 王霞; 刘萍萍; 汤郡; 张亚飞
Original assignee: Meijie Transformation Medical Research Suzhou Co ltd
Current assignee: Meijie Transformation Medical Research Suzhou Co ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-02-28
Anticipated expiration: 2042-10-27
Also published as: CN115717163B

Abstract

The invention discloses a molecular coding detection system for monitoring and correcting sequencing pollution and application thereof. The molecular coding detection system comprises at least one insertion coding nucleic acid sequence, wherein the insertion coding nucleic acid sequence comprises a skeleton sequence area with a known sequence and at least one variable coding area, the variable coding area is a random sequence consisting of any one or at least two of A, T, C or G, the variable coding areas are randomly distributed in the skeleton sequence area, and the insertion coding nucleic acid sequence is single-stranded or double-stranded. The invention designs an insertion type coding nucleic acid sequence with a specific structure, marks a sample to be detected by using the insertion type coding nucleic acid sequence, and analyzes based on high-throughput sequencing original data, so that cross contamination among samples in a short-term batch and historical environmental contamination caused by long-term batch detection can be quickly, effectively and recognized.

Description

Molecular coding detection system for monitoring and correcting sequencing pollution and application thereof

Technical Field

The invention belongs to the technical field of gene sequencing, and relates to a molecular coding detection system for monitoring and correcting sequencing pollution and application thereof.

Background

The next generation sequencing technology (NGS) has become an emerging technology for modern biological research and medical diagnosis due to its huge information flux, sample capacity, ultra-high sensitivity, capability of detecting multiple analysis targets simultaneously, and low cost of single sample analysis. Diagnostic products based on the NGS technology are more and more approved by medical supervision departments, and commercialization, technical standardization and industrialization are realized, but the hidden danger of industrial diagnosis is also caused by the problems of sample pollution caused by long process, complex process, batch library building and centralized detection of the NGS technology.

NGS detection of contamination generally comes from three sources: (1) Sample processing contamination including sample information errors, cross contamination occurring during sample collection and nucleic acid extraction; (2) The detection process pollution is generally the reagent pollution such as joint index pollution in the complex library building process or the carrying or cross pollution among library building intermediate products, and is particularly common in the synchronous library building process of a large number of samples in the same batch; (3) Detection of environmental contamination, caused by high concentrations of aerosol contaminating molecules in the detection environment.

The existing centralized on-machine sequencing posing method is to use molecular tags to label libraries, namely, a joint or a primer with additional library identification sequence information is used for independent library building, and sample data is separated by backtracking of tag information of data after off-machine. Any pollution in the pooling process can be carried into the sequencing process, and the pollution cannot be identified and preprocessed through data quality control after the machine is taken off, and whether the pollution occurs in the operation process of a certain sample can be detected only after the data result is analyzed. Cross-contamination of the sample label reagent itself during the library construction process can even cause artificial false contamination, i.e., data contamination. The existing sample pollution identification and monitoring method is mainly implemented by passively analyzing the sex of a patient sample, the consistency, the impurity degree and the like of genetic SNP of a reference sample and a detection sample, whether the sample is polluted or not can be obtained only after the analysis is finished, and the pollution source cannot be traced back after the sample is polluted. No control samples, or small, targeted sequencing panel could be performed. The industrial detection of NGS requires a new system to solve the above-mentioned sample contamination problem.

In conclusion, how to provide a method for monitoring, identifying and correcting the pollution of a high-throughput sequencing sample has great significance to the technical field of gene sequencing.

Disclosure of Invention

Aiming at the defects and actual requirements of the prior art, the invention provides a molecular coding detection system for monitoring and correcting sequencing pollution and application thereof.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a molecular coding detection system for monitoring and correcting sequencing contamination, the molecular coding detection system comprises at least one insertion coding nucleic acid sequence, the insertion coding nucleic acid sequence comprises a skeleton sequence region with a known sequence and at least one variable coding region, the variable coding region is a random sequence composed of any one or at least two of a, T, C or G, the variable coding regions are randomly distributed in the skeleton sequence region, and the insertion coding nucleic acid sequence is single-stranded or double-stranded.

In the invention, an insertion type coding nucleic acid sequence with a specific structure is designed, one part is a fixed known reference framework sequence and is used for sequence replying comparison in information recovery, and the other part is a variable coding region and is used for specific sample information coding so as to carry out pollution identification. The insertion type coding nucleic acid sequence is utilized to mark a sample to be detected, analysis is carried out on the basis of high-throughput sequencing original data, cross contamination among samples in a short-term batch and historical environmental pollution caused by long-term batch detection can be rapidly and effectively identified, and the insertion type coding nucleic acid sequence can be used as a set of standard NGS reagents to carry out quality assessment of a detection laboratory and cleaning, correcting and remedying of a detection result without retest sample contamination.

In the invention, a sequence with a known sequence is selected as a framework sequence region, so that no homology with a sample to be detected is ensured.

Preferably, the length of the insertion-encoding nucleic acid sequence is 100-2000 bp, including but not limited to 101bp, 102bp, 103bp, 104bp, 105bp, 120bp, 200bp, 220bp, 240bp, 260bp, 280bp, 300bp, 500bp, 800bp, 1000bp, 1200bp, 1300bp, 1400bp, 1600bp, 1700bp, 1800bp, 1900bp, 1950bp, 1980bp, 1990bp, 1995bp, 1998bp or 1999bp, preferably 200-300 bp.

Preferably, the length of the variable coding region is 1-20 bp, including but not limited to 2bp, 3bp, 4bp, 5bp, 6bp, 7bp, 8bp, 10bp, 12bp, 15bp, 16bp, 17bp, 18bp or 19bp, and the number is 1-4.

Preferably, the intervening coding nucleic acid sequences are classified as intervening coding nucleic acid sequences for identifying batch-to-batch contamination or intervening coding nucleic acid sequences for identifying batch-to-batch contamination, depending on the variable coding region.

Preferably, the length of the variable coding region in the intervening coding nucleic acid sequence for identifying batch-to-batch contamination is different from the length of the variable coding region in the intervening coding nucleic acid sequence for identifying batch-to-batch contamination.

In the present invention, the length of the inserted coding nucleic acid sequence for identifying contamination in a batch can be designed according to the requirement. It may be 100 to 2000 bases, preferably 200 to 300 bases, and more preferably 240 bases, and the total length of each variable coding region is generally 1 to 4 bases. Distributed over 1 to 4 positions, preferably 1 base per coding region in length, distributed over 4 positions of the nucleic acid sequence.

In the present invention, the variable coding region of the intervening coding nucleic acid sequence for identifying batch contamination may have a length of 1 to 20 bases, preferably 5 bases, and preferably, the coding region and mode for identifying batch contamination are different from those for identifying batch sample contamination, for example, the variable coding region of the batch identification sequence is a continuous basic region, and more preferably, the variable coding region of the batch identification sequence may be two independent continuous basic regions with the same coding, so as to increase the filtering condition and improve the information reliability in the extraction of coding information, in order to prevent signal noise or information loss due to sequencing errors or non-uniform sequencing depth.

Preferably, the length of the variable coding region in the insertion coding nucleic acid sequence for identifying batch-to-batch pollution is 1bp, and the number of the variable coding regions is 4.

Preferably, the length of the variable coding region in the insertion coding nucleic acid sequence for identifying the pollution in the batch is 5bp, and the number of the variable coding regions is 2.

Preferably, the inserted coding nucleic acid sequence for identifying batch-to-batch contamination comprises the sequence shown in SEQ ID NO. 1.

Preferably, the insertion-type encoding nucleic acid sequence for identifying the batch contamination comprises a sequence shown in SEQ ID NO. 2.

SEQ ID NO.1：

CTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACTCCNNNNACTTGATTAGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACNNNNAACCCTATCTCGGTCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATT。

SEQ ID NO.2：

CGTGGCTGGCCACGACGGGCGTTCCTTGCGCAGCTGTGCTCGACGTTGNCACTGAAGCGGGAAGGGACTGGCTGCTATTGGGCGAAGTGCCGGGGCANGATCTCCTGTCATCCCACCTTGCTCCTGCCGAGAAAGTATCCATCATGNCTGATGCAATGCGGCGGCTGCATACGCTTGATCCGGCTACCTGCCCATNCGACCACCAAGCGAAACATCGCATCGAGCGAGCACGTACTCGGA。

Wherein N is any one of A, T, C and G.

Preferably, the molecular coding detection system further comprises a coding information recovery system.

Preferably, the encoded information recovery system comprises a probe or primer complementary to the inserted encoding nucleic acid sequence.

According to different application scenes, the coded information recovery system can be realized according to different modes such as liquid phase hybridization capture or amplicon primer amplification. In some embodiments, a library of insertion-encoding nucleic acid sequence-specific recovery probes is added to a library of hybrid capture probes, the probes consisting of matching bases of the insertion-encoding nucleic acid sequence, in the variable coding region, preferably, of degenerate complementary sequences. The length of the probes may be between 50 and 200 bases, preferably 120 bases, and the number of probes may be any number within 1 to 1000. The working concentration of the recovered probe may be between 0.1nM and 10nM. The recovery probe is characterized in that one or more biotin (biotin) marks are arranged on the probe, so that the recovery is convenient, specifically, a probe enrichment and bank building insertion type coding nucleic acid sequence with an interruption step and a working schematic diagram of a recovery system are shown in figure 1, and a genome DNA and an insertion type coding nucleic acid sequence 101 form a fragment 102 with the length of about 150-200 bp after ultrasonic interruption; adding a library building joint to the two ends after the tail end is repaired to form a library 103 before amplification; in the liquid phase hybridization capture process, the inserted coding nucleic acid sequence segment and the genome segment containing the target sequence are respectively combined with the coding nucleic acid sequence recovery probe and the gene specific probe 104 to complete the capture. The working schematic diagram of the probe enrichment, library construction and insertion type coding nucleic acid sequence and recovery system without interruption step is shown in FIG. 2, and the interruption step is not needed when sequencing is carried out on part of sequencing sample substrate types such as ctDNA. Part of the bottom DNA contains a target sequence, is mixed with a coding nucleic acid sequence 201, is connected through a joint to form a library 202 before amplification, and in the liquid phase hybridization capture process, an inserted coding nucleic acid sequence segment and a genome segment containing the target sequence are respectively combined with a coding nucleic acid sequence recovery probe and a gene specific probe 203 to complete capture.

In other embodiments, the recycling system is composed of primers matching 10-30 bases of the 5' end and 3' end of the inserted coding nucleic acid sequence, preferably 18-25 bases in length, the working concentration of the recycling primers can be 0.1-10 μ M, and specifically, the amplicon enrichment and banking insertion coding nucleic acid sequence and recycling system working schematically are shown in fig. 3, genomic DNA containing the target sequence is mixed with the coding nucleic acid sequence 301, the first round of PCR is performed, the target gene-specific primer pair modified by the universal sequence and the sequencing primer sequence at the 5' end and the insertion coding nucleic acid sequence recycling primer pair are combined with the genomic fragment and the coding nucleic acid sequence 302, respectively, the second round of PCR is performed by amplifying primers composed of sequences respectively having P5, P7 and index sequences at the 5' end and matching the universal sequence at the 5' end of the first round of PCR, and the banking 303 is completed.

Preferably, the probes may have a length of 50 to 200bp and a number of 1 to 1000.

Preferably, the length of the primer is 18 to 25bp.

Preferably, the nucleic acid sequence of the probe for identifying an intervening coding nucleic acid sequence of batch-to-batch contamination is selected from the sequences shown in SEQ ID NO.3 and/or SEQ ID NO. 4.

SEQ ID NO.3：

CTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACTCCNNNNACTTGATTAGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCC-biotin。

SEQ ID NO.4：

CTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACNNNNAACCCTATCTCGGTCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATT-biotin。

Preferably, the nucleic acid sequence of the probe for identifying an intervening coding nucleic acid sequence of contamination within a lot is selected from the sequences shown in SEQ ID No.5 and/or SEQ ID No. 6.

SEQ ID NO.5：

CGTGGCTGGCCACGACGGGCGTTCCTTGCGCAGCTGTGCTCGACGTTGNCACTGAAGCGGGAAGGGACTGGCTGCTATTGGGCGAAGTGCCGGGGCANGATCTCCTGTCATCCCACCTTG-biotin。

SEQ ID NO.6：

CTCCTGCCGAGAAAGTATCCATCATGNCTGATGCAATGCGGCGGCTGCATACGCTTGATCCGGCTACCTGCCCATNCGACCACCAAGCGAAACATCGCATCGAGCGAGCACGTACTCGGA-biotin。

Preferably, the nucleic acid sequence of the primer for identifying the intervening encoding nucleic acid sequence of batch-to-batch contamination comprises the sequences shown in SEQ ID No.7 and SEQ ID No. 8.

SEQ ID NO.7：

TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCTAAATCGGGGGCTCCCTTTAGG。

SEQ ID NO.8：

GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGAATAGGCCGAAATCGGCAAAATCCCT。

Preferably, the nucleic acid sequence of the primer for identifying the contaminating intervening coding nucleic acid sequence within the batch comprises the sequences shown in SEQ ID NO.9 and SEQ ID NO. 10.

SEQ ID NO.9：

TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCGTGGCTGGCCACGACGGGCGTTCCTT。

SEQ ID NO.10：

GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGTCCGAGTACGTGCTCGCTCGATGCGA。

In the present invention, the application of the molecular coding detection system for monitoring and correcting sequencing contamination can be performed throughout the entire process of high throughput sequencing, such as from the time and beginning of sample nucleic acid extraction, so that the insert-type coding nucleic acid sequence can be pre-fabricated into a nucleic acid container in an amount of 1% to 1000%, preferably 10% to 100%, of the number of sample molecules.

In a second aspect, the present invention provides the use of the molecular coding detection system of the first aspect for monitoring and correcting sequencing contamination in the preparation of a genetic sequencing product.

The molecular coding detection system for monitoring and correcting sequencing pollution designed by the invention can be effectively applied to preparing sequencing products and used as a component for monitoring and correcting sequencing pollution.

In a third aspect, the present invention provides a sequencing kit comprising the molecular coding detection system of the first aspect for monitoring and correcting sequencing contamination.

In a fourth aspect, the present invention provides the use of the molecular coding detection system of the first aspect for monitoring and correcting sequencing contamination in gene sequencing.

In a fifth aspect, the present invention provides a method of monitoring and correcting sequencing contamination, the method comprising:

mixing the molecular coding detection system for monitoring and correcting sequencing pollution and a sample to be detected, constructing and purifying a library, sequencing the purified library, and performing data analysis and pollution correction according to a sequencing result.

The standard for judging pollution is as follows: all variable coding regions in the insertion coding nucleic acid sequence have non-sample unique coding sequences, and the number of reads of suspected pollution sequences exceeds 3.

The contamination correction includes: and backtracking a sample pointed by pollution, and performing comparison to remove false positive mutation of the sample.

In the present invention, the flow chart of the method for monitoring and correcting sequencing contamination is shown in fig. 4, the analysis process includes mapping the sequencing data to the reference genome and the reference coding nucleic acid sequence, and filtering and effective depth statistics are performed on the recovered inserted coding nucleic acid sequence, and the steps of filtering and contamination identification of the inserted coding nucleic acid sequence data are as follows: data replying, batch-to-batch and batch-to-batch variable coding region sequence extraction, repeated sequence removal and pollution identification, wherein the pollution identification conditions are as follows:

(1) Batch contamination, wherein the variable coding regions in the batch insertion type coding nucleic acid sequence have non-sample unique coding sequences, suspected contamination codes exist in all the variable coding regions simultaneously, and the number of reads of the suspected contamination sequences exceeds 3;

(2) Batch contamination, (a) simultaneous occurrence of non-sample unique coding sequences in all variable coding regions in the batch of insert-encoded nucleic acid sequences and more than 3 reads of suspected contamination coding sequences per coding region, and (b) traceability of suspected contamination coding sequences within the batch of samples.

The effective depth statistics (pollution index statistics) is the ratio of the effective depth of the only coding of the target sample to the effective depth of the total recovery coding.

Compared with the prior art, the invention has the following beneficial effects:

in the invention, an insertion type coding nucleic acid sequence with a specific structure is designed, a sample to be detected is marked by using the insertion type coding nucleic acid sequence, and analysis is carried out based on high-throughput sequencing original data, so that cross contamination among samples in a short-term batch and historical environmental pollution caused by long-term batch detection can be rapidly and effectively identified, and the insertion type coding nucleic acid sequence can be used as a set of standard NGS reagents for quality evaluation of a detection laboratory and cleaning correction and remediation of a detection result without sample pollution rechecking.

Drawings

FIG. 1 is a schematic diagram of the operation of a probe enrichment, banking and insertion type encoding nucleic acid sequence and recovery system with interruption steps;

FIG. 2 is a schematic diagram of the probe enrichment, banking, insertion-type encoding nucleic acid sequence and recovery system operation without interruption;

FIG. 3 is a schematic diagram of the operation of an amplicon enrichment, banking, insertion-type encoding nucleic acid sequence and recovery system;

FIG. 4 is a flow chart of a method for monitoring and correcting sequencing contamination;

FIG. 5 is a graph showing the result of performance verification of the molecular coding assay system for monitoring and correcting sequencing contamination according to the present invention.

Detailed Description

To further illustrate the technical means adopted by the present invention and the effects thereof, the present invention is further described below with reference to the embodiments and the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and that no limitation of the invention is intended.

The examples do not show the specific techniques or conditions, according to the technical or conditions described in the literature in the field, or according to the product specifications. The reagents or apparatus used are conventional products commercially available from normal sources, not indicated by the manufacturer.

Example 1

This example is designed to monitor and correct for sequencing contamination.

1. Design of an inserted coding nucleic acid sequence

The inserted coding nucleic acid sequence is double-stranded DNA, and the skeleton sequence is designed by adopting an exogenous artificial sequence. The insertion type coding nucleic acid sequence for identifying the pollution among batches is shown in SEQ ID NO.1 and consists of 240 bases, blast search has no homology with human genome, wherein 58 th to 61 th bases and 179 th to 182 bases are variable coding regions and are represented by NNNN, each N represents any one of A, T, C and G, the sequences of 58 th to 61 th bases and 179 th to 182 th bases are completely identical, 256 different insertion type coding nucleic acid sequences are designed according to different combinations of N, the capture recovery probe sequence is single-stranded DNA and consists of two 120 base sequences which are respectively matched with 1 to 120 bases in SEQ ID NO.1 and 121-240 bases, wherein the matched variable coding region sequences are merged sequences, the capture probe sequence has a biotin label at one base at 3', the specific sequences are shown in SEQ ID NO.3 and SEQ ID NO.4, the amplicon recovery primers of the coding nucleic acid sequences are composed of two forward and reverse primers which respectively correspond to 23 bases at the 5' end of SEQ ID NO.1 and 23 'ends of 3' of SEQ ID NO.1, the specific primers are shown in SEQ ID NO.3, and 8 bases of the insertion type primers are respectively matched with 26 bases, and each primer is shown in the base fusion base sequence of SEQ ID NO. 8.

The insertion coding nucleic acid sequence for identifying the pollution in the batch is shown in SEQ ID NO.2 and consists of 240 bases, and blast search has no homology with human genome, wherein 49, 98, 147 and 196 bases are variable coding regions N, 49 bases are coded as A, T, C or G, 98 bases are coded as T or C, 147 bases are coded as A, C or G, and 196 bases are coded as A, T, C or G. Designing 96 different insertion type coding nucleic acid sequences according to different combinations of N, wherein a capture recovery probe sequence is single-stranded DNA and consists of two 120 base sequences which are respectively matched with 1 to 120 bases and 121 to 240 bases in SEQ ID NO.2, a variable coding region is degenerate base, a 3 'end is provided with a biotin label, specific sequences are shown as SEQ ID NO.5 and SEQ ID NO.6, an amplicon recovery primer consists of two forward and reverse primers which respectively correspond to 27 bases at a 5' end and 26 bases at a 3 'end in SEQ ID NO.2, and a 5' end of each primer is respectively provided with a library-establishing primer matching fusion sequence, and the specific sequences are shown as SEQ ID NO.9 and SEQ ID NO. 10.

2. Chemical synthesis

256 of the insert coding nucleic acid sequences and matched capture and recovery probe and recovery primer sequences designed to identify batch-to-batch contamination and 96 of the insert coding nucleic acid sequences and matched capture and recovery probe and recovery primer sequences designed to identify batch-to-batch contamination were committed to synthesis (Integrated DNA Technologies) in the form of dry powders.

Mother liquor preparation and quantification the doping type double-stranded nucleic acid sequence, the matched recovery probe and the primer thereof are added into ultrapure water according to the instruction of a synthetic product to prepare 100 mu M mother liquor, then the batch-to-batch doping type double-stranded nucleic acid is continuously diluted to the production concentration of the prefabricated liquor according to 30000 copies/mu L, 2 mu L of the prefabricated liquor is added into the bottom of a 1.5mL EP tube, and the prefabricated tube is placed into a refrigerator at-80 ℃ for storage.

Example 2

The working principle of the test of interrupted capture enrichment depot building (targeted panel) is shown in fig. 1, and the test specifically comprises the following steps:

(1) Extracting sample DNA, wherein the type of the tested sample is an FFPE sample, the sample DNA extraction Kit is a QIAamp DNA FFPE Tissue Kit, 200ng of the extracted DNA is quantitatively added into a prefabricated tube, the base of the variable coding region of the insertion coding nucleic acid sequence polluted among identification batches of the prefabricated tube is AGGT, the base of the variable coding region of the insertion coding nucleic acid sequence polluted in the identification batches is A, T, C and C respectively according to the sequence from 5 'to 3', and the extracted DNA is slightly vibrated and swirled for 30s after being added into the prefabricated tube;

(2) A step of establishing a library, wherein a library establishing reagent related in the embodiment is purchased from NEB, and a probe hybridization reagent is from IDT;

a. a nucleic acid disruption step of supplementing the DNA to 50. Mu.L with 1 XTE buffer, and performing DNA disruption using a Covaris M220 ultrasonic disruptor according to the procedure of Table L;

TABLE 1

Duty cycle	10％
		Peak power	75
Number of burst cycles	200
		Duration of interruption	100-330s
Temperature of water bath	18-20℃

b. Repairing is interrupted, reaction liquid is prepared according to the table 2, and incubation is carried out for 15min at 20 ℃;

TABLE 2

Fragmenting FFPE DNA	50μL
		FFPE DNA buffer	6.5μL
NEBNext FFPE DNA Repair Mix	2μL
		Ultra-pure water	3.5μL
In all	62μL

c. Magnetic bead purification and End repair nucleic acid purification was performed using AMPure XP Beads, end repair was performed using nebnexext Ultra II End Prep kit, and the repair reaction system and PCR program are shown in tables 3 and 4;

TABLE 3

FFPE DNA	50μL
		NEBNext Ultra Ⅱ End Prep Buffer	7μL
NEBNext Ultra Ⅱ End Prep enzyme mix	3μL
		In all	60μL

TABLE 4

Step (ii) of	Temperature of	Time
			Cycle 1	20℃	30min
Cycle
	2	65℃	30min
Cycle 3				4℃	Pausing

d. Performing joint connection, namely building a library according to a reaction system shown in the table 5, and incubating for 15min at 20 ℃;

TABLE 5

DNA Repair Reaction Mixture	60μL
		NEBNext Ultra Ⅱ Ligation Master Mix	30μL
NEBNext Ligation Enhancer	1μL
		Duplex Seq Adapters	2μL
In all	93μL

e. Screening and pre-amplifying library fragments, screening AMPure XP beads, and pre-amplifying joints according to a reaction system in a table 6 and reaction conditions in a table 7;

TABLE 6

NEBNext Ultra Ⅱ Q5 Master Mix	25μL
		UDI Primer Mix	5μL
In all	30μL

TABLE 7

f. Performing hybrid capture, and capturing a target sequence and a coding nucleic acid sequence according to a reaction system in a table 8 and reaction conditions in a table 9;

TABLE 8

2X Hybridization Buffer	8.5μL
		Hybridization Buffer Enhancer	2.7μL
Targeting gene panel	4μL
		Inter-batch intra-coded nucleic acid recovery probes	1.8μL
In all	17μL

TABLE 9

Step (ii) of	Temperature of	Time
			Cycle 1	95℃	30s
Cycle
	2	65℃	4h
Cycle 3				65℃	Pausing

g. Recovering streptomycin magnetic beads, amplifying and purifying a capture library, recovering and washing a hybrid capture sequence by using an instruction according to Dynabeads M-270, and amplifying the capture library according to a reaction system in a table 10 and reaction conditions in a table 11;

watch 10

Library PCR Master Mix(2×)	25μL
		Illumina P5/P7 Primer Mix(10×)	5μL
Dynabeads	20μL
		In all	50μL

TABLE 11

h. Library purification and quantification

Amplification was performed using AMPure XP beads and then library purification was performed, purification was performed using Qubit 3.0 and then library quantification was performed.

(3) Sequencing

The Novaseq 6000 high-throughput sequencer PE150 is used for reading length to carry out on-machine sequencing, the sequencing depth is 10000 x, data mapping is carried out to a reference genome and a reference coding nucleic acid sequence, the recovered insertion coding nucleic acid sequence is filtered and effectively subjected to depth statistics, and the standards and steps for filtering the insertion coding nucleic acid data and judging pollution are as follows: data replying, extraction of variable coding region sequences among batches and in batches, removal of repeated sequences, pollution identification, and the pollution identification conditions are as follows: 1) Batch contamination, wherein the variable coding region of the batch insertion coding sequence has a non-sample unique coding sequence, and suspected contamination codes exist in the first variable coding region and the second variable coding region simultaneously; and the number of reads of suspected pollution sequences exceeds 3; 2) Batch contamination, (a) batch in-batch insert coding

variable coding regions

1, 2, 3, 4 simultaneously present non-sample unique coding sequences, and suspected contamination coding sequence reads exceed 3 in each variable coding region, (b) suspected contamination coding sequences can be traced in the batch of samples.

And (3) counting the pollution indexes: the only coding effective depth of the target sample accounts for the ratio of the total recovery coding effective depth.

Pollution correction: backtracking samples of pollution among batches and pollution direction in batches, and removing false positive mutation of the samples through comparison.

The depth of the inter-batch and intra-batch interpolation coding sequence in this implementation is shown in table 12, which proves that the method can recover a sufficient number of sample unique identification codes.

TABLE 12

Effective depth of sequencing target	5192×
		Inter-batch insertion coding sequence validationDepth of field	4567×
Batch interpolation coding order effective depth	4605×

Example 3

The working principle of the amplicon library construction method (TRB immune repertoire targeted sequencing) performed in this example is shown in FIG. 3, and comprises the following steps:

1. extracting sample DNA, wherein the type of a tested sample is a blood sample, the sample DNA extraction kit is a QIAamp DNA blood kit, and after DNA is extracted, quantifying 1 mu g of the extracted DNA, adding the quantified DNA into a prefabricated tube, wherein the base of a batch inserted coding nucleic acid sequence variable coding region of the prefabricated tube is ATAT, the base of the batch inserted coding nucleic acid sequence variable coding region is T, C and T according to the sequence from 5 'to 3', and extracting the DNA, adding the DNA into the prefabricated tube, and then slightly performing vortex oscillation for 30s;

2. amplifying and enriching target and coding sequences by PCR, amplifying target regions by multiple PCR using a TRB primer system, configuring a reaction system according to table 13, wherein related intra-batch and inter-batch nucleic acid primer pairs are shown as SEQ ID NO.7, SEQ ID NO.8, SEQ ID NO.9 and SEQ ID NO.10, and the reaction conditions are shown as table 14;

watch 13

2×Multiplex PCR Buffer	25μL
		Multiplex Polymerase	1μL
TRB primer Mix(10μM)	2μL
		Batch-to-batch coding nucleic acid primer working solution	2μL
Batch coding nucleic acid primer working solution	2μL
		Ultra-pure water	2μL
DNA(1000ng)	20μL

TABLE 14

3. After being purified by AMPure XP beads, library construction PCR is carried out according to a reaction system shown in a table 15 (wherein P5-F and P7-R sequences are shown as SEQ ID NO.11 (aatgatacggcacccagatctacatacgtacatgcgctcgctcgtcggcgcgcgcgcgtc) and SEQ ID NO.12 (caagcagagagaagaccgacatgaagctcgtctcgtgggctcgg)) and reaction conditions shown in a table 16;

watch 15

5x Reaction Buffer	10μL
		DNA Polymerase	0.5μL
10mM dNTP	1μL
		P5-F(10uM)	1μL
P7-R(10uM)	1μL
		Nuclease-free water	34.5μL

TABLE 16

4. Library purification and sequencing

And (3) performing amplification by using AMPure XP beads, purifying and quantifying the library by using the Qubit 3.0, and performing on-machine sequencing by using a Novaseq 6000 high-throughput sequencer PE150 for reading, wherein the sequencing quantity is 0.3 Mbeads.

5. Data analysis and depth statistics

The data analysis and depth statistical method is as described in embodiment 1, and the depths of the inter-batch and intra-batch interpolated coding sequences in this embodiment are shown in table 17, which proves that the amplicon library construction method can recover a sufficient number of unique identification codes of the samples.

TABLE 17

Inter-batch coded efficient reads	63425
		Intra-batch coded efficient reads	52583

Example 4

This example performs contamination identification performance verification of the manually mixed sample.

The performance verification of the pollution identification capability is carried out by respectively preparing artificial simulation doping ratio pollution samples, the proportion gradient of the pollution doping ratio is 0.1%, 0.5%, l%, 5% and 10%, the data of the actual pollution index is shown in figure 5, and the molecular coding detection system for monitoring and correcting sequencing pollution can identify the pollution of 0.1% level at the lowest.

In summary, the invention designs an insertion type coding nucleic acid sequence with a specific structure, the insertion type coding nucleic acid sequence is utilized to mark a sample to be detected, and the analysis is carried out based on high-throughput sequencing original data, so that the cross contamination among samples in a short-term batch and the historical environmental pollution caused by long-term batch detection can be rapidly and effectively identified, and the insertion type coding nucleic acid sequence can be used as a set of standard NGS reagents to carry out quality evaluation of a detection laboratory and cleaning, correcting and remedying of a detection result without retest sample contamination.

The applicant states that the present invention is illustrated by the above examples to show the detailed method of the present invention, but the present invention is not limited to the above detailed method, that is, it does not mean that the present invention must rely on the above detailed method to be carried out. It should be understood by those skilled in the art that any modification of the present invention, equivalent substitutions of the raw materials of the product of the present invention, addition of auxiliary components, selection of specific modes, etc., are within the scope and disclosure of the present invention.

Claims

1. A molecular coding assay system for monitoring and correcting sequencing contamination, said molecular coding assay system comprising at least one intervening coding nucleic acid sequence;

the insertion-encoding nucleic acid sequence comprises a framework sequence region with a known sequence and at least one variable encoding region;

the variable coding region is a random sequence consisting of any one or at least two of A, T, C or G;

the variable coding regions are randomly distributed within the framework sequence region;

the inserted coding nucleic acid sequence is single-stranded or double-stranded.

2. The molecular coding detection system for monitoring and correcting sequencing contamination according to claim 1, wherein the length of the inserted coding nucleic acid sequence is 100-2000 bp;

the length of the variable coding region is 1-20 bp, and the number of the variable coding regions is 1-4.

3. The molecular coding detection system for monitoring and correcting sequencing contamination of claim 1, wherein the intervening coding nucleic acid sequences are classified as intervening coding nucleic acid sequences for identifying batch-to-batch contamination or intervening coding nucleic acid sequences for identifying batch-to-batch contamination based on the variable coding regions;

the length of the variable coding region in the insertion-type coding nucleic acid sequence for identifying batch-to-batch pollution is different from the length of the variable coding region in the insertion-type coding nucleic acid sequence for identifying batch-to-batch pollution.

4. The molecular coding detection system for monitoring and correcting sequencing contamination according to claim 3, wherein the length of the variable coding region in the insertion coding nucleic acid sequence for identifying batch contamination is 1bp, and the number of the variable coding regions is 4;

the length of the variable coding region in the insertion type coding nucleic acid sequence for identifying batch pollution is 5bp, and the number of the variable coding regions is 2;

the insertion coding nucleic acid sequence for identifying batch-to-batch pollution comprises a sequence shown in SEQ ID NO. 1;

the inserted coding nucleic acid sequence for identifying the pollution in the batch comprises a sequence shown in SEQ ID NO. 2.

5. The molecular coding assay system for monitoring and correcting sequencing contamination of claim 1, further comprising a coded information recovery system;

the encoded information recovery system includes probes or primers complementary to the inserted encoding nucleic acid sequence.

6. The molecular coding detection system for monitoring and correcting sequencing contamination according to claim 5, wherein the length of the probe is 50-200 bp, and the number of the probes is 1-1000;

the length of the primer is 18-25 bp;

the nucleic acid sequence of the probe for identifying the insertion-type coding nucleic acid sequence of batch-to-batch pollution is selected from the sequences shown in SEQ ID NO.3 and/or SEQ ID NO. 4;

the nucleic acid sequence of the probe for identifying the insertion-type coding nucleic acid sequence of the pollution in the batch is selected from a sequence shown in SEQ ID NO.5 and/or SEQ ID NO. 6;

the nucleic acid sequence of the primer for identifying the insertion-type coding nucleic acid sequence of the batch-to-batch pollution comprises the sequences shown in SEQ ID NO.7 and SEQ ID NO. 8;

the nucleic acid sequence of the primer for identifying the plug-in coding nucleic acid sequence of the pollution in the batch comprises the sequences shown in SEQ ID NO.9 and SEQ ID NO. 10.

7. Use of the molecular coding detection system of any one of claims 1 to 6 for monitoring and correcting sequencing contamination in the preparation of a genetic sequencing product.

8. A sequencing kit comprising the molecular coding assay system of any one of claims 1 to 6 for monitoring and correcting sequencing contamination.

9. Use of the molecular coding detection system of any one of claims 1 to 6 for monitoring and correcting sequencing contamination in genetic sequencing.

10. A method of monitoring and correcting sequencing contamination, the method comprising:

mixing the molecular coding detection system for monitoring and correcting sequencing pollution according to any one of claims 1 to 6 with a sample to be detected, constructing and purifying a library, sequencing the purified library, and performing data analysis and pollution correction according to a sequencing result;

the standard for judging pollution is as follows: all variable coding regions in the insertion coding nucleic acid sequence have non-sample unique coding sequences, and the number of reads of suspected pollution sequences exceeds 3;