CN115717163A - Molecular coding detection system for monitoring and correcting sequencing pollution and application thereof - Google Patents

Molecular coding detection system for monitoring and correcting sequencing pollution and application thereof Download PDF

Info

Publication number
CN115717163A
CN115717163A CN202211328995.5A CN202211328995A CN115717163A CN 115717163 A CN115717163 A CN 115717163A CN 202211328995 A CN202211328995 A CN 202211328995A CN 115717163 A CN115717163 A CN 115717163A
Authority
CN
China
Prior art keywords
coding
nucleic acid
acid sequence
batch
pollution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211328995.5A
Other languages
Chinese (zh)
Other versions
CN115717163B (en
Inventor
庞震国
李丽莎
朱振刚
王霞
刘萍萍
汤郡
张亚飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meijie Transformation Medical Research Suzhou Co ltd
Original Assignee
Meijie Transformation Medical Research Suzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Meijie Transformation Medical Research Suzhou Co ltd filed Critical Meijie Transformation Medical Research Suzhou Co ltd
Priority to CN202211328995.5A priority Critical patent/CN115717163B/en
Publication of CN115717163A publication Critical patent/CN115717163A/en
Application granted granted Critical
Publication of CN115717163B publication Critical patent/CN115717163B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a molecular coding detection system for monitoring and correcting sequencing pollution and application thereof. The molecular coding detection system comprises at least one insertion coding nucleic acid sequence, wherein the insertion coding nucleic acid sequence comprises a skeleton sequence area with a known sequence and at least one variable coding area, the variable coding area is a random sequence consisting of any one or at least two of A, T, C or G, the variable coding areas are randomly distributed in the skeleton sequence area, and the insertion coding nucleic acid sequence is single-stranded or double-stranded. The invention designs an insertion type coding nucleic acid sequence with a specific structure, marks a sample to be detected by using the insertion type coding nucleic acid sequence, and analyzes based on high-throughput sequencing original data, so that cross contamination among samples in a short-term batch and historical environmental contamination caused by long-term batch detection can be quickly, effectively and recognized.

Description

Molecular coding detection system for monitoring and correcting sequencing pollution and application thereof
Technical Field
The invention belongs to the technical field of gene sequencing, and relates to a molecular coding detection system for monitoring and correcting sequencing pollution and application thereof.
Background
The next generation sequencing technology (NGS) has become an emerging technology for modern biological research and medical diagnosis due to its huge information flux, sample capacity, ultra-high sensitivity, capability of detecting multiple analysis targets simultaneously, and low cost of single sample analysis. Diagnostic products based on the NGS technology are more and more approved by medical supervision departments, and commercialization, technical standardization and industrialization are realized, but the hidden danger of industrial diagnosis is also caused by the problems of sample pollution caused by long process, complex process, batch library building and centralized detection of the NGS technology.
NGS detection of contamination generally comes from three sources: (1) Sample processing contamination including sample information errors, cross contamination occurring during sample collection and nucleic acid extraction; (2) The detection process pollution is generally the reagent pollution such as joint index pollution in the complex library building process or the carrying or cross pollution among library building intermediate products, and is particularly common in the synchronous library building process of a large number of samples in the same batch; (3) Detection of environmental contamination, caused by high concentrations of aerosol contaminating molecules in the detection environment.
The existing centralized on-machine sequencing posing method is to use molecular tags to label libraries, namely, a joint or a primer with additional library identification sequence information is used for independent library building, and sample data is separated by backtracking of tag information of data after off-machine. Any pollution in the pooling process can be carried into the sequencing process, and the pollution cannot be identified and preprocessed through data quality control after the machine is taken off, and whether the pollution occurs in the operation process of a certain sample can be detected only after the data result is analyzed. Cross-contamination of the sample label reagent itself during the library construction process can even cause artificial false contamination, i.e., data contamination. The existing sample pollution identification and monitoring method is mainly implemented by passively analyzing the sex of a patient sample, the consistency, the impurity degree and the like of genetic SNP of a reference sample and a detection sample, whether the sample is polluted or not can be obtained only after the analysis is finished, and the pollution source cannot be traced back after the sample is polluted. No control samples, or small, targeted sequencing panel could be performed. The industrial detection of NGS requires a new system to solve the above-mentioned sample contamination problem.
In conclusion, how to provide a method for monitoring, identifying and correcting the pollution of a high-throughput sequencing sample has great significance to the technical field of gene sequencing.
Disclosure of Invention
Aiming at the defects and actual requirements of the prior art, the invention provides a molecular coding detection system for monitoring and correcting sequencing pollution and application thereof.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a molecular coding detection system for monitoring and correcting sequencing contamination, the molecular coding detection system comprises at least one insertion coding nucleic acid sequence, the insertion coding nucleic acid sequence comprises a skeleton sequence region with a known sequence and at least one variable coding region, the variable coding region is a random sequence composed of any one or at least two of a, T, C or G, the variable coding regions are randomly distributed in the skeleton sequence region, and the insertion coding nucleic acid sequence is single-stranded or double-stranded.
In the invention, an insertion type coding nucleic acid sequence with a specific structure is designed, one part is a fixed known reference framework sequence and is used for sequence replying comparison in information recovery, and the other part is a variable coding region and is used for specific sample information coding so as to carry out pollution identification. The insertion type coding nucleic acid sequence is utilized to mark a sample to be detected, analysis is carried out on the basis of high-throughput sequencing original data, cross contamination among samples in a short-term batch and historical environmental pollution caused by long-term batch detection can be rapidly and effectively identified, and the insertion type coding nucleic acid sequence can be used as a set of standard NGS reagents to carry out quality assessment of a detection laboratory and cleaning, correcting and remedying of a detection result without retest sample contamination.
In the invention, a sequence with a known sequence is selected as a framework sequence region, so that no homology with a sample to be detected is ensured.
Preferably, the length of the insertion-encoding nucleic acid sequence is 100-2000 bp, including but not limited to 101bp, 102bp, 103bp, 104bp, 105bp, 120bp, 200bp, 220bp, 240bp, 260bp, 280bp, 300bp, 500bp, 800bp, 1000bp, 1200bp, 1300bp, 1400bp, 1600bp, 1700bp, 1800bp, 1900bp, 1950bp, 1980bp, 1990bp, 1995bp, 1998bp or 1999bp, preferably 200-300 bp.
Preferably, the length of the variable coding region is 1-20 bp, including but not limited to 2bp, 3bp, 4bp, 5bp, 6bp, 7bp, 8bp, 10bp, 12bp, 15bp, 16bp, 17bp, 18bp or 19bp, and the number is 1-4.
Preferably, the intervening coding nucleic acid sequences are classified as intervening coding nucleic acid sequences for identifying batch-to-batch contamination or intervening coding nucleic acid sequences for identifying batch-to-batch contamination, depending on the variable coding region.
Preferably, the length of the variable coding region in the intervening coding nucleic acid sequence for identifying batch-to-batch contamination is different from the length of the variable coding region in the intervening coding nucleic acid sequence for identifying batch-to-batch contamination.
In the present invention, the length of the inserted coding nucleic acid sequence for identifying contamination in a batch can be designed according to the requirement. It may be 100 to 2000 bases, preferably 200 to 300 bases, and more preferably 240 bases, and the total length of each variable coding region is generally 1 to 4 bases. Distributed over 1 to 4 positions, preferably 1 base per coding region in length, distributed over 4 positions of the nucleic acid sequence.
In the present invention, the variable coding region of the intervening coding nucleic acid sequence for identifying batch contamination may have a length of 1 to 20 bases, preferably 5 bases, and preferably, the coding region and mode for identifying batch contamination are different from those for identifying batch sample contamination, for example, the variable coding region of the batch identification sequence is a continuous basic region, and more preferably, the variable coding region of the batch identification sequence may be two independent continuous basic regions with the same coding, so as to increase the filtering condition and improve the information reliability in the extraction of coding information, in order to prevent signal noise or information loss due to sequencing errors or non-uniform sequencing depth.
Preferably, the length of the variable coding region in the insertion coding nucleic acid sequence for identifying batch-to-batch pollution is 1bp, and the number of the variable coding regions is 4.
Preferably, the length of the variable coding region in the insertion coding nucleic acid sequence for identifying the pollution in the batch is 5bp, and the number of the variable coding regions is 2.
Preferably, the inserted coding nucleic acid sequence for identifying batch-to-batch contamination comprises the sequence shown in SEQ ID NO. 1.
Preferably, the insertion-type encoding nucleic acid sequence for identifying the batch contamination comprises a sequence shown in SEQ ID NO. 2.
SEQ ID NO.1:
CTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACTCCNNNNACTTGATTAGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACNNNNAACCCTATCTCGGTCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATT。
SEQ ID NO.2:
CGTGGCTGGCCACGACGGGCGTTCCTTGCGCAGCTGTGCTCGACGTTGNCACTGAAGCGGGAAGGGACTGGCTGCTATTGGGCGAAGTGCCGGGGCANGATCTCCTGTCATCCCACCTTGCTCCTGCCGAGAAAGTATCCATCATGNCTGATGCAATGCGGCGGCTGCATACGCTTGATCCGGCTACCTGCCCATNCGACCACCAAGCGAAACATCGCATCGAGCGAGCACGTACTCGGA。
Wherein N is any one of A, T, C and G.
Preferably, the molecular coding detection system further comprises a coding information recovery system.
Preferably, the encoded information recovery system comprises a probe or primer complementary to the inserted encoding nucleic acid sequence.
According to different application scenes, the coded information recovery system can be realized according to different modes such as liquid phase hybridization capture or amplicon primer amplification. In some embodiments, a library of insertion-encoding nucleic acid sequence-specific recovery probes is added to a library of hybrid capture probes, the probes consisting of matching bases of the insertion-encoding nucleic acid sequence, in the variable coding region, preferably, of degenerate complementary sequences. The length of the probes may be between 50 and 200 bases, preferably 120 bases, and the number of probes may be any number within 1 to 1000. The working concentration of the recovered probe may be between 0.1nM and 10nM. The recovery probe is characterized in that one or more biotin (biotin) marks are arranged on the probe, so that the recovery is convenient, specifically, a probe enrichment and bank building insertion type coding nucleic acid sequence with an interruption step and a working schematic diagram of a recovery system are shown in figure 1, and a genome DNA and an insertion type coding nucleic acid sequence 101 form a fragment 102 with the length of about 150-200 bp after ultrasonic interruption; adding a library building joint to the two ends after the tail end is repaired to form a library 103 before amplification; in the liquid phase hybridization capture process, the inserted coding nucleic acid sequence segment and the genome segment containing the target sequence are respectively combined with the coding nucleic acid sequence recovery probe and the gene specific probe 104 to complete the capture. The working schematic diagram of the probe enrichment, library construction and insertion type coding nucleic acid sequence and recovery system without interruption step is shown in FIG. 2, and the interruption step is not needed when sequencing is carried out on part of sequencing sample substrate types such as ctDNA. Part of the bottom DNA contains a target sequence, is mixed with a coding nucleic acid sequence 201, is connected through a joint to form a library 202 before amplification, and in the liquid phase hybridization capture process, an inserted coding nucleic acid sequence segment and a genome segment containing the target sequence are respectively combined with a coding nucleic acid sequence recovery probe and a gene specific probe 203 to complete capture.
In other embodiments, the recycling system is composed of primers matching 10-30 bases of the 5' end and 3' end of the inserted coding nucleic acid sequence, preferably 18-25 bases in length, the working concentration of the recycling primers can be 0.1-10 μ M, and specifically, the amplicon enrichment and banking insertion coding nucleic acid sequence and recycling system working schematically are shown in fig. 3, genomic DNA containing the target sequence is mixed with the coding nucleic acid sequence 301, the first round of PCR is performed, the target gene-specific primer pair modified by the universal sequence and the sequencing primer sequence at the 5' end and the insertion coding nucleic acid sequence recycling primer pair are combined with the genomic fragment and the coding nucleic acid sequence 302, respectively, the second round of PCR is performed by amplifying primers composed of sequences respectively having P5, P7 and index sequences at the 5' end and matching the universal sequence at the 5' end of the first round of PCR, and the banking 303 is completed.
Preferably, the probes may have a length of 50 to 200bp and a number of 1 to 1000.
Preferably, the length of the primer is 18 to 25bp.
Preferably, the nucleic acid sequence of the probe for identifying an intervening coding nucleic acid sequence of batch-to-batch contamination is selected from the sequences shown in SEQ ID NO.3 and/or SEQ ID NO. 4.
SEQ ID NO.3:
CTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACTCCNNNNACTTGATTAGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCC-biotin。
SEQ ID NO.4:
CTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACNNNNAACCCTATCTCGGTCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATT-biotin。
Preferably, the nucleic acid sequence of the probe for identifying an intervening coding nucleic acid sequence of contamination within a lot is selected from the sequences shown in SEQ ID No.5 and/or SEQ ID No. 6.
SEQ ID NO.5:
CGTGGCTGGCCACGACGGGCGTTCCTTGCGCAGCTGTGCTCGACGTTGNCACTGAAGCGGGAAGGGACTGGCTGCTATTGGGCGAAGTGCCGGGGCANGATCTCCTGTCATCCCACCTTG-biotin。
SEQ ID NO.6:
CTCCTGCCGAGAAAGTATCCATCATGNCTGATGCAATGCGGCGGCTGCATACGCTTGATCCGGCTACCTGCCCATNCGACCACCAAGCGAAACATCGCATCGAGCGAGCACGTACTCGGA-biotin。
Preferably, the nucleic acid sequence of the primer for identifying the intervening encoding nucleic acid sequence of batch-to-batch contamination comprises the sequences shown in SEQ ID No.7 and SEQ ID No. 8.
SEQ ID NO.7:
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCTAAATCGGGGGCTCCCTTTAGG。
SEQ ID NO.8:
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGAATAGGCCGAAATCGGCAAAATCCCT。
Preferably, the nucleic acid sequence of the primer for identifying the contaminating intervening coding nucleic acid sequence within the batch comprises the sequences shown in SEQ ID NO.9 and SEQ ID NO. 10.
SEQ ID NO.9:
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCGTGGCTGGCCACGACGGGCGTTCCTT。
SEQ ID NO.10:
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGTCCGAGTACGTGCTCGCTCGATGCGA。
In the present invention, the application of the molecular coding detection system for monitoring and correcting sequencing contamination can be performed throughout the entire process of high throughput sequencing, such as from the time and beginning of sample nucleic acid extraction, so that the insert-type coding nucleic acid sequence can be pre-fabricated into a nucleic acid container in an amount of 1% to 1000%, preferably 10% to 100%, of the number of sample molecules.
In a second aspect, the present invention provides the use of the molecular coding detection system of the first aspect for monitoring and correcting sequencing contamination in the preparation of a genetic sequencing product.
The molecular coding detection system for monitoring and correcting sequencing pollution designed by the invention can be effectively applied to preparing sequencing products and used as a component for monitoring and correcting sequencing pollution.
In a third aspect, the present invention provides a sequencing kit comprising the molecular coding detection system of the first aspect for monitoring and correcting sequencing contamination.
In a fourth aspect, the present invention provides the use of the molecular coding detection system of the first aspect for monitoring and correcting sequencing contamination in gene sequencing.
In a fifth aspect, the present invention provides a method of monitoring and correcting sequencing contamination, the method comprising:
mixing the molecular coding detection system for monitoring and correcting sequencing pollution and a sample to be detected, constructing and purifying a library, sequencing the purified library, and performing data analysis and pollution correction according to a sequencing result.
The standard for judging pollution is as follows: all variable coding regions in the insertion coding nucleic acid sequence have non-sample unique coding sequences, and the number of reads of suspected pollution sequences exceeds 3.
The contamination correction includes: and backtracking a sample pointed by pollution, and performing comparison to remove false positive mutation of the sample.
In the present invention, the flow chart of the method for monitoring and correcting sequencing contamination is shown in fig. 4, the analysis process includes mapping the sequencing data to the reference genome and the reference coding nucleic acid sequence, and filtering and effective depth statistics are performed on the recovered inserted coding nucleic acid sequence, and the steps of filtering and contamination identification of the inserted coding nucleic acid sequence data are as follows: data replying, batch-to-batch and batch-to-batch variable coding region sequence extraction, repeated sequence removal and pollution identification, wherein the pollution identification conditions are as follows:
(1) Batch contamination, wherein the variable coding regions in the batch insertion type coding nucleic acid sequence have non-sample unique coding sequences, suspected contamination codes exist in all the variable coding regions simultaneously, and the number of reads of the suspected contamination sequences exceeds 3;
(2) Batch contamination, (a) simultaneous occurrence of non-sample unique coding sequences in all variable coding regions in the batch of insert-encoded nucleic acid sequences and more than 3 reads of suspected contamination coding sequences per coding region, and (b) traceability of suspected contamination coding sequences within the batch of samples.
The effective depth statistics (pollution index statistics) is the ratio of the effective depth of the only coding of the target sample to the effective depth of the total recovery coding.
Compared with the prior art, the invention has the following beneficial effects:
in the invention, an insertion type coding nucleic acid sequence with a specific structure is designed, a sample to be detected is marked by using the insertion type coding nucleic acid sequence, and analysis is carried out based on high-throughput sequencing original data, so that cross contamination among samples in a short-term batch and historical environmental pollution caused by long-term batch detection can be rapidly and effectively identified, and the insertion type coding nucleic acid sequence can be used as a set of standard NGS reagents for quality evaluation of a detection laboratory and cleaning correction and remediation of a detection result without sample pollution rechecking.
Drawings
FIG. 1 is a schematic diagram of the operation of a probe enrichment, banking and insertion type encoding nucleic acid sequence and recovery system with interruption steps;
FIG. 2 is a schematic diagram of the probe enrichment, banking, insertion-type encoding nucleic acid sequence and recovery system operation without interruption;
FIG. 3 is a schematic diagram of the operation of an amplicon enrichment, banking, insertion-type encoding nucleic acid sequence and recovery system;
FIG. 4 is a flow chart of a method for monitoring and correcting sequencing contamination;
FIG. 5 is a graph showing the result of performance verification of the molecular coding assay system for monitoring and correcting sequencing contamination according to the present invention.
Detailed Description
To further illustrate the technical means adopted by the present invention and the effects thereof, the present invention is further described below with reference to the embodiments and the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and that no limitation of the invention is intended.
The examples do not show the specific techniques or conditions, according to the technical or conditions described in the literature in the field, or according to the product specifications. The reagents or apparatus used are conventional products commercially available from normal sources, not indicated by the manufacturer.
Example 1
This example is designed to monitor and correct for sequencing contamination.
1. Design of an inserted coding nucleic acid sequence
The inserted coding nucleic acid sequence is double-stranded DNA, and the skeleton sequence is designed by adopting an exogenous artificial sequence. The insertion type coding nucleic acid sequence for identifying the pollution among batches is shown in SEQ ID NO.1 and consists of 240 bases, blast search has no homology with human genome, wherein 58 th to 61 th bases and 179 th to 182 bases are variable coding regions and are represented by NNNN, each N represents any one of A, T, C and G, the sequences of 58 th to 61 th bases and 179 th to 182 th bases are completely identical, 256 different insertion type coding nucleic acid sequences are designed according to different combinations of N, the capture recovery probe sequence is single-stranded DNA and consists of two 120 base sequences which are respectively matched with 1 to 120 bases in SEQ ID NO.1 and 121-240 bases, wherein the matched variable coding region sequences are merged sequences, the capture probe sequence has a biotin label at one base at 3', the specific sequences are shown in SEQ ID NO.3 and SEQ ID NO.4, the amplicon recovery primers of the coding nucleic acid sequences are composed of two forward and reverse primers which respectively correspond to 23 bases at the 5' end of SEQ ID NO.1 and 23 'ends of 3' of SEQ ID NO.1, the specific primers are shown in SEQ ID NO.3, and 8 bases of the insertion type primers are respectively matched with 26 bases, and each primer is shown in the base fusion base sequence of SEQ ID NO. 8.
The insertion coding nucleic acid sequence for identifying the pollution in the batch is shown in SEQ ID NO.2 and consists of 240 bases, and blast search has no homology with human genome, wherein 49, 98, 147 and 196 bases are variable coding regions N, 49 bases are coded as A, T, C or G, 98 bases are coded as T or C, 147 bases are coded as A, C or G, and 196 bases are coded as A, T, C or G. Designing 96 different insertion type coding nucleic acid sequences according to different combinations of N, wherein a capture recovery probe sequence is single-stranded DNA and consists of two 120 base sequences which are respectively matched with 1 to 120 bases and 121 to 240 bases in SEQ ID NO.2, a variable coding region is degenerate base, a 3 'end is provided with a biotin label, specific sequences are shown as SEQ ID NO.5 and SEQ ID NO.6, an amplicon recovery primer consists of two forward and reverse primers which respectively correspond to 27 bases at a 5' end and 26 bases at a 3 'end in SEQ ID NO.2, and a 5' end of each primer is respectively provided with a library-establishing primer matching fusion sequence, and the specific sequences are shown as SEQ ID NO.9 and SEQ ID NO. 10.
2. Chemical synthesis
256 of the insert coding nucleic acid sequences and matched capture and recovery probe and recovery primer sequences designed to identify batch-to-batch contamination and 96 of the insert coding nucleic acid sequences and matched capture and recovery probe and recovery primer sequences designed to identify batch-to-batch contamination were committed to synthesis (Integrated DNA Technologies) in the form of dry powders.
Mother liquor preparation and quantification the doping type double-stranded nucleic acid sequence, the matched recovery probe and the primer thereof are added into ultrapure water according to the instruction of a synthetic product to prepare 100 mu M mother liquor, then the batch-to-batch doping type double-stranded nucleic acid is continuously diluted to the production concentration of the prefabricated liquor according to 30000 copies/mu L, 2 mu L of the prefabricated liquor is added into the bottom of a 1.5mL EP tube, and the prefabricated tube is placed into a refrigerator at-80 ℃ for storage.
Example 2
The working principle of the test of interrupted capture enrichment depot building (targeted panel) is shown in fig. 1, and the test specifically comprises the following steps:
(1) Extracting sample DNA, wherein the type of the tested sample is an FFPE sample, the sample DNA extraction Kit is a QIAamp DNA FFPE Tissue Kit, 200ng of the extracted DNA is quantitatively added into a prefabricated tube, the base of the variable coding region of the insertion coding nucleic acid sequence polluted among identification batches of the prefabricated tube is AGGT, the base of the variable coding region of the insertion coding nucleic acid sequence polluted in the identification batches is A, T, C and C respectively according to the sequence from 5 'to 3', and the extracted DNA is slightly vibrated and swirled for 30s after being added into the prefabricated tube;
(2) A step of establishing a library, wherein a library establishing reagent related in the embodiment is purchased from NEB, and a probe hybridization reagent is from IDT;
a. a nucleic acid disruption step of supplementing the DNA to 50. Mu.L with 1 XTE buffer, and performing DNA disruption using a Covaris M220 ultrasonic disruptor according to the procedure of Table L;
TABLE 1
Duty cycle 10%
Peak power 75
Number of burst cycles 200
Duration of interruption 100-330s
Temperature of water bath 18-20℃
b. Repairing is interrupted, reaction liquid is prepared according to the table 2, and incubation is carried out for 15min at 20 ℃;
TABLE 2
Fragmenting FFPE DNA 50μL
FFPE DNA buffer 6.5μL
NEBNext FFPE DNA Repair Mix 2μL
Ultra-pure water 3.5μL
In all 62μL
c. Magnetic bead purification and End repair nucleic acid purification was performed using AMPure XP Beads, end repair was performed using nebnexext Ultra II End Prep kit, and the repair reaction system and PCR program are shown in tables 3 and 4;
TABLE 3
FFPE DNA 50μL
NEBNext Ultra Ⅱ End Prep Buffer 7μL
NEBNext Ultra Ⅱ End Prep enzyme mix 3μL
In all 60μL
TABLE 4
Step (ii) of Temperature of Time
Cycle 1 20 30min
Cycle
2 65℃ 30min
Cycle 3 4℃ Pausing
d. Performing joint connection, namely building a library according to a reaction system shown in the table 5, and incubating for 15min at 20 ℃;
TABLE 5
DNA Repair Reaction Mixture 60μL
NEBNext Ultra Ⅱ Ligation Master Mix 30μL
NEBNext Ligation Enhancer 1μL
Duplex Seq Adapters 2μL
In all 93μL
e. Screening and pre-amplifying library fragments, screening AMPure XP beads, and pre-amplifying joints according to a reaction system in a table 6 and reaction conditions in a table 7;
TABLE 6
NEBNext Ultra Ⅱ Q5 Master Mix 25μL
UDI Primer Mix 5μL
In all 30μL
TABLE 7
Figure BDA0003912985180000071
f. Performing hybrid capture, and capturing a target sequence and a coding nucleic acid sequence according to a reaction system in a table 8 and reaction conditions in a table 9;
TABLE 8
2X Hybridization Buffer 8.5μL
Hybridization Buffer Enhancer 2.7μL
Targeting gene panel 4μL
Inter-batch intra-coded nucleic acid recovery probes 1.8μL
In all 17μL
TABLE 9
Step (ii) of Temperature of Time
Cycle 1 95 30s
Cycle
2 65℃ 4h
Cycle 3 65℃ Pausing
g. Recovering streptomycin magnetic beads, amplifying and purifying a capture library, recovering and washing a hybrid capture sequence by using an instruction according to Dynabeads M-270, and amplifying the capture library according to a reaction system in a table 10 and reaction conditions in a table 11;
watch 10
Library PCR Master Mix(2×) 25μL
Illumina P5/P7 Primer Mix(10×) 5μL
Dynabeads 20μL
In all 50μL
TABLE 11
Figure BDA0003912985180000081
h. Library purification and quantification
Amplification was performed using AMPure XP beads and then library purification was performed, purification was performed using Qubit 3.0 and then library quantification was performed.
(3) Sequencing
The Novaseq 6000 high-throughput sequencer PE150 is used for reading length to carry out on-machine sequencing, the sequencing depth is 10000 x, data mapping is carried out to a reference genome and a reference coding nucleic acid sequence, the recovered insertion coding nucleic acid sequence is filtered and effectively subjected to depth statistics, and the standards and steps for filtering the insertion coding nucleic acid data and judging pollution are as follows: data replying, extraction of variable coding region sequences among batches and in batches, removal of repeated sequences, pollution identification, and the pollution identification conditions are as follows: 1) Batch contamination, wherein the variable coding region of the batch insertion coding sequence has a non-sample unique coding sequence, and suspected contamination codes exist in the first variable coding region and the second variable coding region simultaneously; and the number of reads of suspected pollution sequences exceeds 3; 2) Batch contamination, (a) batch in-batch insert coding variable coding regions 1, 2, 3, 4 simultaneously present non-sample unique coding sequences, and suspected contamination coding sequence reads exceed 3 in each variable coding region, (b) suspected contamination coding sequences can be traced in the batch of samples.
And (3) counting the pollution indexes: the only coding effective depth of the target sample accounts for the ratio of the total recovery coding effective depth.
Pollution correction: backtracking samples of pollution among batches and pollution direction in batches, and removing false positive mutation of the samples through comparison.
The depth of the inter-batch and intra-batch interpolation coding sequence in this implementation is shown in table 12, which proves that the method can recover a sufficient number of sample unique identification codes.
TABLE 12
Effective depth of sequencing target 5192×
Inter-batch insertion coding sequence validationDepth of field 4567×
Batch interpolation coding order effective depth 4605×
Example 3
The working principle of the amplicon library construction method (TRB immune repertoire targeted sequencing) performed in this example is shown in FIG. 3, and comprises the following steps:
1. extracting sample DNA, wherein the type of a tested sample is a blood sample, the sample DNA extraction kit is a QIAamp DNA blood kit, and after DNA is extracted, quantifying 1 mu g of the extracted DNA, adding the quantified DNA into a prefabricated tube, wherein the base of a batch inserted coding nucleic acid sequence variable coding region of the prefabricated tube is ATAT, the base of the batch inserted coding nucleic acid sequence variable coding region is T, C and T according to the sequence from 5 'to 3', and extracting the DNA, adding the DNA into the prefabricated tube, and then slightly performing vortex oscillation for 30s;
2. amplifying and enriching target and coding sequences by PCR, amplifying target regions by multiple PCR using a TRB primer system, configuring a reaction system according to table 13, wherein related intra-batch and inter-batch nucleic acid primer pairs are shown as SEQ ID NO.7, SEQ ID NO.8, SEQ ID NO.9 and SEQ ID NO.10, and the reaction conditions are shown as table 14;
watch 13
2×Multiplex PCR Buffer 25μL
Multiplex Polymerase 1μL
TRB primer Mix(10μM) 2μL
Batch-to-batch coding nucleic acid primer working solution 2μL
Batch coding nucleic acid primer working solution 2μL
Ultra-pure water 2μL
DNA(1000ng) 20μL
TABLE 14
Figure BDA0003912985180000101
3. After being purified by AMPure XP beads, library construction PCR is carried out according to a reaction system shown in a table 15 (wherein P5-F and P7-R sequences are shown as SEQ ID NO.11 (aatgatacggcacccagatctacatacgtacatgcgctcgctcgtcggcgcgcgcgcgtc) and SEQ ID NO.12 (caagcagagagaagaccgacatgaagctcgtctcgtgggctcgg)) and reaction conditions shown in a table 16;
watch 15
5x Reaction Buffer 10μL
DNA Polymerase 0.5μL
10mM dNTP 1μL
P5-F(10uM) 1μL
P7-R(10uM) 1μL
Nuclease-free water 34.5μL
TABLE 16
Figure BDA0003912985180000102
4. Library purification and sequencing
And (3) performing amplification by using AMPure XP beads, purifying and quantifying the library by using the Qubit 3.0, and performing on-machine sequencing by using a Novaseq 6000 high-throughput sequencer PE150 for reading, wherein the sequencing quantity is 0.3 Mbeads.
5. Data analysis and depth statistics
The data analysis and depth statistical method is as described in embodiment 1, and the depths of the inter-batch and intra-batch interpolated coding sequences in this embodiment are shown in table 17, which proves that the amplicon library construction method can recover a sufficient number of unique identification codes of the samples.
TABLE 17
Inter-batch coded efficient reads 63425
Intra-batch coded efficient reads 52583
Example 4
This example performs contamination identification performance verification of the manually mixed sample.
The performance verification of the pollution identification capability is carried out by respectively preparing artificial simulation doping ratio pollution samples, the proportion gradient of the pollution doping ratio is 0.1%, 0.5%, l%, 5% and 10%, the data of the actual pollution index is shown in figure 5, and the molecular coding detection system for monitoring and correcting sequencing pollution can identify the pollution of 0.1% level at the lowest.
In summary, the invention designs an insertion type coding nucleic acid sequence with a specific structure, the insertion type coding nucleic acid sequence is utilized to mark a sample to be detected, and the analysis is carried out based on high-throughput sequencing original data, so that the cross contamination among samples in a short-term batch and the historical environmental pollution caused by long-term batch detection can be rapidly and effectively identified, and the insertion type coding nucleic acid sequence can be used as a set of standard NGS reagents to carry out quality evaluation of a detection laboratory and cleaning, correcting and remedying of a detection result without retest sample contamination.
The applicant states that the present invention is illustrated by the above examples to show the detailed method of the present invention, but the present invention is not limited to the above detailed method, that is, it does not mean that the present invention must rely on the above detailed method to be carried out. It should be understood by those skilled in the art that any modification of the present invention, equivalent substitutions of the raw materials of the product of the present invention, addition of auxiliary components, selection of specific modes, etc., are within the scope and disclosure of the present invention.

Claims (10)

1. A molecular coding assay system for monitoring and correcting sequencing contamination, said molecular coding assay system comprising at least one intervening coding nucleic acid sequence;
the insertion-encoding nucleic acid sequence comprises a framework sequence region with a known sequence and at least one variable encoding region;
the variable coding region is a random sequence consisting of any one or at least two of A, T, C or G;
the variable coding regions are randomly distributed within the framework sequence region;
the inserted coding nucleic acid sequence is single-stranded or double-stranded.
2. The molecular coding detection system for monitoring and correcting sequencing contamination according to claim 1, wherein the length of the inserted coding nucleic acid sequence is 100-2000 bp;
the length of the variable coding region is 1-20 bp, and the number of the variable coding regions is 1-4.
3. The molecular coding detection system for monitoring and correcting sequencing contamination of claim 1, wherein the intervening coding nucleic acid sequences are classified as intervening coding nucleic acid sequences for identifying batch-to-batch contamination or intervening coding nucleic acid sequences for identifying batch-to-batch contamination based on the variable coding regions;
the length of the variable coding region in the insertion-type coding nucleic acid sequence for identifying batch-to-batch pollution is different from the length of the variable coding region in the insertion-type coding nucleic acid sequence for identifying batch-to-batch pollution.
4. The molecular coding detection system for monitoring and correcting sequencing contamination according to claim 3, wherein the length of the variable coding region in the insertion coding nucleic acid sequence for identifying batch contamination is 1bp, and the number of the variable coding regions is 4;
the length of the variable coding region in the insertion type coding nucleic acid sequence for identifying batch pollution is 5bp, and the number of the variable coding regions is 2;
the insertion coding nucleic acid sequence for identifying batch-to-batch pollution comprises a sequence shown in SEQ ID NO. 1;
the inserted coding nucleic acid sequence for identifying the pollution in the batch comprises a sequence shown in SEQ ID NO. 2.
5. The molecular coding assay system for monitoring and correcting sequencing contamination of claim 1, further comprising a coded information recovery system;
the encoded information recovery system includes probes or primers complementary to the inserted encoding nucleic acid sequence.
6. The molecular coding detection system for monitoring and correcting sequencing contamination according to claim 5, wherein the length of the probe is 50-200 bp, and the number of the probes is 1-1000;
the length of the primer is 18-25 bp;
the nucleic acid sequence of the probe for identifying the insertion-type coding nucleic acid sequence of batch-to-batch pollution is selected from the sequences shown in SEQ ID NO.3 and/or SEQ ID NO. 4;
the nucleic acid sequence of the probe for identifying the insertion-type coding nucleic acid sequence of the pollution in the batch is selected from a sequence shown in SEQ ID NO.5 and/or SEQ ID NO. 6;
the nucleic acid sequence of the primer for identifying the insertion-type coding nucleic acid sequence of the batch-to-batch pollution comprises the sequences shown in SEQ ID NO.7 and SEQ ID NO. 8;
the nucleic acid sequence of the primer for identifying the plug-in coding nucleic acid sequence of the pollution in the batch comprises the sequences shown in SEQ ID NO.9 and SEQ ID NO. 10.
7. Use of the molecular coding detection system of any one of claims 1 to 6 for monitoring and correcting sequencing contamination in the preparation of a genetic sequencing product.
8. A sequencing kit comprising the molecular coding assay system of any one of claims 1 to 6 for monitoring and correcting sequencing contamination.
9. Use of the molecular coding detection system of any one of claims 1 to 6 for monitoring and correcting sequencing contamination in genetic sequencing.
10. A method of monitoring and correcting sequencing contamination, the method comprising:
mixing the molecular coding detection system for monitoring and correcting sequencing pollution according to any one of claims 1 to 6 with a sample to be detected, constructing and purifying a library, sequencing the purified library, and performing data analysis and pollution correction according to a sequencing result;
the standard for judging pollution is as follows: all variable coding regions in the insertion coding nucleic acid sequence have non-sample unique coding sequences, and the number of reads of suspected pollution sequences exceeds 3;
the contamination correction includes: and backtracking a sample pointed by pollution, and performing comparison to remove false positive mutation of the sample.
CN202211328995.5A 2022-10-27 2022-10-27 Molecular coding detection system for monitoring and correcting sequencing pollution and application thereof Active CN115717163B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211328995.5A CN115717163B (en) 2022-10-27 2022-10-27 Molecular coding detection system for monitoring and correcting sequencing pollution and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211328995.5A CN115717163B (en) 2022-10-27 2022-10-27 Molecular coding detection system for monitoring and correcting sequencing pollution and application thereof

Publications (2)

Publication Number Publication Date
CN115717163A true CN115717163A (en) 2023-02-28
CN115717163B CN115717163B (en) 2023-10-27

Family

ID=85254369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211328995.5A Active CN115717163B (en) 2022-10-27 2022-10-27 Molecular coding detection system for monitoring and correcting sequencing pollution and application thereof

Country Status (1)

Country Link
CN (1) CN115717163B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115896255A (en) * 2023-03-08 2023-04-04 中国环境科学研究院 Tracing method using DNA identification code

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014128453A1 (en) * 2013-02-19 2014-08-28 Genome Research Limited Nucleic acid marker molecules for identifying and detecting cross contamination of nucleic acid samples
CN109628568A (en) * 2019-01-10 2019-04-16 上海境象生物科技有限公司 A kind of internal standard and its application polluted for differentiating and calibrating high-flux sequence
JP2019131539A (en) * 2018-01-31 2019-08-08 公益財団法人かずさDna研究所 Detection method of cross-contamination between samples in next-generation sequencing
WO2019212138A1 (en) * 2018-05-03 2019-11-07 주식회사 셀레믹스 Internal control substance for discovering cross-contamination between samples for next generation sequencing
CN111944807A (en) * 2020-08-26 2020-11-17 天津诺禾医学检验所有限公司 Human sequencing sample tracking marker, and monitoring method and monitoring device for human sequencing sample cross contamination
CN113897354A (en) * 2021-08-27 2022-01-07 海宁麦凯医学检验有限公司 Internal standard for sequencing correction and application thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014128453A1 (en) * 2013-02-19 2014-08-28 Genome Research Limited Nucleic acid marker molecules for identifying and detecting cross contamination of nucleic acid samples
JP2019131539A (en) * 2018-01-31 2019-08-08 公益財団法人かずさDna研究所 Detection method of cross-contamination between samples in next-generation sequencing
WO2019212138A1 (en) * 2018-05-03 2019-11-07 주식회사 셀레믹스 Internal control substance for discovering cross-contamination between samples for next generation sequencing
CN109628568A (en) * 2019-01-10 2019-04-16 上海境象生物科技有限公司 A kind of internal standard and its application polluted for differentiating and calibrating high-flux sequence
CN111944807A (en) * 2020-08-26 2020-11-17 天津诺禾医学检验所有限公司 Human sequencing sample tracking marker, and monitoring method and monitoring device for human sequencing sample cross contamination
CN113897354A (en) * 2021-08-27 2022-01-07 海宁麦凯医学检验有限公司 Internal standard for sequencing correction and application thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115896255A (en) * 2023-03-08 2023-04-04 中国环境科学研究院 Tracing method using DNA identification code

Also Published As

Publication number Publication date
CN115717163B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
CN108893466B (en) Sequencing joint, sequencing joint group and detection method of ultralow frequency mutation
CN111052249B (en) Methods of determining predetermined chromosome conservation regions, methods of determining whether copy number variation exists in a sample genome, systems, and computer readable media
CN108998508B (en) Construction method of amplicon sequencing library, primer group and kit
CN104694635A (en) Method for constructing high-flux simplified genome sequencing library
US20160115544A1 (en) Molecular barcoding for multiplex sequencing
CN105695448A (en) Construction method of blood free DNA (deoxyribonucleic acid) library based on Ion ProtonTM sequencing platform, reagents and application of reagents
CN107604045A (en) A kind of construction method of amplification sublibrary for the mutation of testing goal gene low frequency
CN106554955A (en) Build method and kit of the sequencing library of PKHD1 gene mutations and application thereof
CN107893260A (en) Efficiently remove the method and kit of the structure transcript profile sequencing library of rRNA
WO2013173774A2 (en) Molecular inversion probes
CN109853047A (en) A kind of genomic DNA sequencing library fast construction method and matched reagent box
CN111424119B (en) High-flux detection primer and kit for SARS-CoV-2 virus
CN115717163B (en) Molecular coding detection system for monitoring and correcting sequencing pollution and application thereof
CN109295500B (en) Single cell methylation sequencing technology and application thereof
CN103998625B (en) For the method and system of Viral diagnosis
CN111944807B (en) Human sequencing sample tracking marker, and monitoring method and monitoring device for human sequencing sample cross contamination
CN107083440A (en) Kit for detecting chromosome aneuploidy and preparation method and application thereof
CN115948607B (en) Method and kit for simultaneously detecting multiple pathogen genes
CN110734982A (en) High-throughput sequencing technology-based linkage autosomal STR typing system and kit
CN109266723A (en) Rare mutation detection method, its kit and application
CN115011695A (en) Multiple cancer species identification marker based on free circular DNA gene, kit and application
CN113897354A (en) Internal standard for sequencing correction and application thereof
WO2024119481A1 (en) Method for rapidly preparing multiplex pcr sequencing library and use thereof
CN108085367A (en) A kind of genetic analyzer tests special allele standard control preparation method of reagent thereof
CN111197072B (en) Rapid extraction method of DNA and application of rapid extraction method in detection of low-frequency chimeric gene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant