CN114150047B

CN114150047B - Method for evaluating base damage, mismatching and variation in sample DNA by one-generation sequencing

Info

Publication number: CN114150047B
Application number: CN202111620536.XA
Authority: CN
Inventors: 罗俊峰; 王一帆; 徐雪; 陈曦; 宋萍
Original assignee: Carrier Gene Technology Suzhou Co ltd
Current assignee: Carrier Gene Technology Suzhou Co ltd
Priority date: 2020-12-29
Filing date: 2021-12-27
Publication date: 2022-11-08
Anticipated expiration: 2041-12-27
Also published as: CN113005188A; CN114150047A

Abstract

The invention relates to a method for evaluating base damage, mismatching and variation in sample DNA by first-generation sequencing, which adopts a molecular label to mark DNA original molecules with damage or mismatching in the PCR amplification process on one hand, and carries out enrichment amplification on a sampling region on the other hand, amplifies the damage or mismatching information of about 0.1 percent to 10-99 percent, then respectively evaluates the ratio values of the base damage, the mismatching and the variation in the sample DNA by adopting an evaluation method based on the enrichment amplification effect and an evaluation method based on the type quantity of the molecular label, and judges the ratio values of the base damage, the mismatching and the variation in the sample DNA according to the credible results of the two methods. The method can accurately confirm the real existence of the damage or the mismatch by adopting an economical and rapid sanger sequencing method, can be favorable for optimizing a sample DNA extraction technology and a storage method, and helps to evaluate the quality of the sample DNA.

Description

Method for evaluating base damage, mismatching and variation in sample DNA by one-generation sequencing

Technical Field

The invention belongs to the technical field of gene detection, and particularly relates to a method for evaluating base damage, mismatching and variation in sample DNA by using first-generation sequencing.

Background

With the development of technology, in the field of DNA detection, especially cancer detection, people pay more and more attention to low-proportion mutation information, for example, 0.1% of body mutation information is one of the important indexes in the current liquid biopsy field, and gradually, people will not meet the 0.1% index any more, and further, if the level reaches 0.01%, the problem of how to distinguish mutation from mismatch and base damage will be faced.

First, the meaning of the two concepts of mutation and mismatch will be clarified. On the single copy cell level, such as single sperm and ovum, which are haploid, the concept of mutation is difficult to apply to haploid cells, the more conventional mutation is a population or a collective concept, such as human genome hg19, where the base at Chr1:2,000 is C, then if 1 sperm cell out of 1000 has a C > T mutation, and the other cells retain wild type C, we say that 0.1% of C > T mutation has occurred at this position, whereas in the case of T-containing sperm cells, the Chr1:2,000 position is normal T: a pairing and no mutation has occurred, whereas the mismatch described in this patent refers to the fact that Chr1:2,000 is not normal C: G pairing but T: G pairing does not meet the base pairing principle, which is a base pairing error in double strands, called base pairing, and if the base pairing is not repaired by a system, after a certain concept, DNA is replicated once, thus a progeny of DNA and a progeny are generated with a mutation.

The base damage and base mismatch can be formed in nature or in the postnatal; the base mismatching formed in nature means that in the process of division and proliferation of cells in organism cells, because of errors of an in vivo DNA replication system, G errors are matched with non-C bases, and the errors are not repaired by an in vivo repair system and are further reserved; the acquired base damage refers to damage caused by inappropriate or limited technologies, methods and conditions in the process of extracting DNA, for example, cytosine C is oxidized under the oxidizing condition to generate oxidative damage, deamination reaction is generated to become deaminated cytosine, and then the deaminated cytosine is considered to be uracil in the copying process and is matched with A; for example, G is easy to form 8-oxoG under the oxidation condition, and then the oxoG is easy to match with A in the replication process; generally, when these damaged bases and mismatches are stably inherited in an organism, mutations are formed, and mutations occurring at key positions of key genes accumulate to a certain extent, which may cause serious diseases such as cancer and may cause drug resistance. It is obvious that if the base damage caused by the acquired is easy to disturb the index of ten-thousandth or one-thousandth, so that the damage and the mismatch of deoxyribonucleotides in a sample are very important to be evaluated, especially important for some key mutation hot spots, and C > T and G > A caused by the base damage at the positions can cause false positive interference.

Because the probability and the proportion of the occurrence of the damage and the mismatch are very low, the current known sensitivity is about ten thousandth and lower than that of a conventional technical platform, for example, the error rate of a second-generation sequencing platform is about a thousandth, and therefore, the detection sensitivity of the second-generation sequencing platform is about 1%; some techniques of the qPCR platform have best detection sensitivity at 0.2%. Therefore, in the technical aspect, if a low proportion of variant information is to be detected, the molecular labeling technology is not separated, but the molecular labeling technology is seriously dependent on high-depth sequencing, and the time period is long, so that the popularization of detection items is not facilitated.

Disclosure of Invention

In order to solve the technical problems, the invention discloses a method for evaluating low-proportion base damage or base mismatch in a DNA sample, and simultaneously can evaluate the proportion of low-proportion base mismatch and base change naturally existing in an organism, thereby being beneficial to optimizing a sample DNA extraction technology and a storage method and helping to evaluate the quality of the sample DNA.

The first objective of the invention is to provide a method for evaluating base damage, mismatch and variation in sample DNA by one-generation sequencing, which comprises the following steps:

s1, adding a nucleic acid composition capable of inhibiting a non-target region (the non-target region refers to a region which does not have base damage, mismatch or variation and correspondingly has base damage, mismatch or variation) in a DNA sample and an amplification primer with an error-correctable molecular tag library, amplifying the DNA sample, and sequencing a product amplified by PCR by adopting a first-generation sequencing technology;

wherein the nucleic acid composition that inhibits non-target regions in the DNA sample is designed based on the sampling region in the DNA sample;

s2, obtaining sequencing data of the PCR amplification product in the step S1, and analyzing the sequencing data of the product by adopting an evaluation method based on enrichment amplification and an evaluation method based on the type number of the molecular tags to obtain ratio values of base damage, mismatching and variation in the DNA of an evaluation sample;

and S3, when the results of the evaluation method based on the enrichment amplification effect and the evaluation method based on the number of the molecular tag varieties simultaneously have credible results, adopting the evaluation method based on the number of the molecular tag varieties as the ratio value of base damage, mismatching and variation in the evaluation sample DNA.

Among them, the design method of nucleic acid composition capable of inhibiting non-target region in DNA sample is disclosed in Chinese patent No. 2020115796048.

The design method of the amplification primer of the error-correctable molecular label library is disclosed in Chinese patent with application number 2020115404605.

Further, the evaluation method based on enrichment amplification is analyzed by the following steps:

s01, representing the enrichment amplification effect of each sampling region by using an Efold value, wherein the calculation formula is as follows:

Efold＝(VRF/VAF)×[(1-VAF)/(1-VRF)]，

wherein VAF is the initial proportion of variant information in the sample; VRF is the variation information proportion of the sample in the detection result;

s02, obtaining an Efold value of each sampling region through testing of the standard substance, calculating a VRF value through the peak value proportion of different bases in sequencing data of a PCR amplification product, and calculating a VAF value through the following formula when the VRF satisfies 5% < = VRF < = 95%:

VAF＝VRF/(Efold-Efold×VRF+VRF)，

when VRF does not satisfy 5% < = VRF < =95%, the evaluation method result based on enrichment amplification is not reliable.

For example, given that the proportion of variation information in a standard sample is 0.1%, then VAF =0.1%, after enrichment amplification, PCR products were sequenced and found to contain 50% variation information, then VRF =50%, when:

Efold＝(50％/0.1％)×[(1-0.1％)/(1-50％)]＝999

if a PCR reaction does not enrich for amplification, i.e., VAF =0.1%, VRF will also be 0.1%, then,

Efold＝＝(0.1％/0.1％)×((1-0.1％)/(1-0.1％))＝1

it can be seen that when Efold =1, the whole reaction system has no enrichment amplification effect on the variation information; the table below illustrates the Efold calculated from different VAFs and VRFs for a particular reaction system, which embodies the inherent characteristics of that reaction system.

	Efold
		VAF＝0.1％,VRF＝0.1％	1
VAF＝0.1％,VRF＝50％	999
		VAF＝0.1％,VRF＝90％	8991
VAF＝1％,VRF＝50％	99
		VAF＝1％,VRF＝90％	891
VAF＝1％,VRF＝99％	9801
		VAF＝5％,VRF＝99％	1881

As can be seen from the above table, when VRF infinitely approaches 100%, VAF value and Efold cannot be directly proportional, for example, the case of VAF =1%, VRF =99% and VAF =5%, VRF =99% indicates that the amplification and enrichment of a reaction are saturated when the reaction is 1%, and if the value of Efold at VAF =5% indicates that the amplification and enrichment of a reaction are inaccurate, we stipulate that: the Efold value for a particular reaction must be obtained at 5% < = VRF < = 95%.

The step S02 is specifically illustrated in the following table by way of example, when Efold is known, different VRFs can calculate VAF in the target sample to be measured

It should be noted that if a homozygous peak of variant information is present in the sanger signal, which means that the signal may be saturated, i.e. VRF is close to 100%, there is a high possibility that there is no direct relationship between VRF and VAF, for example, when VAF =5%, VRF =99% in the sanger sequencing result; when VAF =10%, sanger VRF is 99%, and thus VAF cannot be distinguished to be 5% or 10%, VAF = VRF/(Efold-Efold × VRF + VRF) can only be reasonably established in the linear range when 5% < = VRF < =95%, and when VRF >95% or VRF <5%, it means that the base damage and/or base mismatch ratio of the target sample to be measured is out of the detection range of the method disclosed in this patent.

Further, the evaluation method based on the number of the molecular tag species is analyzed by the following steps:

s001, outputting the UMInum of the type quantity of one molecular label sequence by a DNA sequence identification method based on sequencing data of a PCR amplification product and a known molecular label sequence;

s002, when UMInum < =10, the calculation formula of the ratio Pdm% of base damage, mismatching and mutation is as follows:

Pdm％＝UMInum/(Ng×1000×2/6.67)×100％，

wherein Ng = mass of DNA charged in the reaction;

when UMInum >10, the results of the evaluation method based on the number of molecular tag species are not reliable.

For example, ng =10Ng, uminum =5,

the ratio of base damage to mismatch of wrong bases Pdm% =5/2998.5 × 100% =0.17%.

Further, before calculating the VRF value or before outputting the parameter umium, the method comprises the step of identifying variant information:

s0001, obtaining a base line value Noise of a sanger sequencing signal _c (ii) a The method comprises the following steps:

a) Reading the Sanger AB1 file to obtain the signal value Fluor of each signal sampling of each fluorescence channel in the file _cs And the number S of signal samples per base _k ；

Fluor _ck Number S of signal samples at base k for fluorescence channel c _k Maximum in the i region, fluor _ck The calculating method comprises the following steps:

Fluor _ck ＝max{Fluor _cs ：s＝S _k -i..S _k +i}

wherein i can be a positive integer within 0 to 5;

b) For each fluorescence channel there is a maximum at all N base positions

The maximum of the M bases (as given in the Sanger AB1 document) identified as corresponding to fluorescence channel c in the sequencing of one generation was removed from this to give a new set of maxima:

c) Calculating out

Removing the value of which the difference with the median value exceeds the average absolute deviation by n times, wherein n can be 2-5, and calculating the average value Noise of the rest maximum value _c As background noise baseline for fluorescence channel c;

d) Subtracting the background noise value of the corresponding fluorescence channel from the signal value of all fluorescence channel signal samples to obtain FluorNN _cs (No Noise)：

FluorNN _cs ＝Fluor _cs -Noise _c

S0002, searching a regional signal peak value according to the signal change of each fluorescence channel:

traversing the peak value of the fluorescence channel when the width W of one base _k Only any channel in the region has a peak value, the region has a base, and the type of the base is the base type corresponding to the channel with the peak value; when one base is wide W _k When there are multiple channels in the region (2) and there are peaks, there may be multiple bases in the region, the base type corresponding to the channel with the highest peak is the main base in the region, and the peaks of other channels, so as toThe proportion of the peak data in the peak value of the main base channel is taken as the basis, when the proportion is higher than a threshold value, the base type corresponding to the channel is an alternative base type of the region, otherwise, the alternative base type does not exist; obtaining a candidate base sequence A consisting of main bases and alternative bases, and labeling alternative base types at positions where the alternative bases exist;

wherein the one base width W _k The area of (a) is defined as: if Sanger AB1 contains N bases, the number of signal samples at base k is S _k The number of samples of the signal where the previous base is located is S _k-1 The number of signal samples of the next base is S _k+1 Then the base width region start position WS of base k _k The following formula is obtained:

base width region termination position WE of base k _k The following formula is obtained:

wherein the one base width W _k The peak value in the region of (a) is defined as: fluorescence channel c was mapped to s ∈ (WS) using find _ peaks algorithm of Scipy _k ，WE _k ) FluorNN signal values after background noise removal of regions _cs Calculating a peak value of the region; if no peak exists, the fluorescence channel c is at the base width W _k There is no peak in the region of (a); if one or more peaks exist, the peak with the largest signal value is taken as the base width W of the fluorescence channel c _k A peak within the region of (a);

s0003, obtaining a candidate base sequence B coded by IUPAC according to a first generation sequencing result:

the candidate base sequence B represents the full-length sequence of the PCR product and comprises a candidate base sequence B1, a candidate base sequence B2 and a candidate base sequence B3, wherein the candidate base sequence B1 is the sequence of the molecular tag library position, the candidate base sequence B2 is the sequence of the sample DNA sampling region, and the candidate base sequence B3 is the sequence except the sequence of the molecular tag library position and the sequence of the sample DNA sampling region; combining the main base and the alternative base in the candidate base sequence A by using IUPAC (International Union of Pure and Applied Chemistry) recommended base coding rule to obtain a candidate base sequence B coded by IUPAC; such as:

IUPAC coding table:

s0004, identifying variation information in the first-generation sequencing result:

1) Identifying information that the candidate base sequence B is different from the known reference sequence R (i.e., the sequence of the reference sequence genome, for example, hg 19) by using a method of calculating alignment information;

the method for calculating the para-position information is to compare the candidate base sequence B coded by the IUPAC with a known reference sequence R by using a sequence comparison Algorithm Gotoh's Algorithm and NUC.4.4 IUPAC code comparison fraction table; selecting the result with the highest comparison score as the alignment result of the candidate base sequence B and the known reference sequence R to obtain the para-position information of the candidate base sequence B and the known reference sequence R; 2) Using a para-position information calculation method to obtain para-position information of candidate base sequences B2 and B3 and a known reference sequence R, and aligning the two sequences; scanning the aligned candidate base sequences B2 and B3 and the known reference sequence R to obtain base information which is different from the known reference sequence R in the IUPAC sequence and is variation information;

wherein, define Base _k For a certain Base position, the reference Base _kr Is Base information in the reference sequence, and the Base different from the known reference sequence R is Base _km (ii) a Specific position of candidate base sequences B2 and B3Base of position _k From the reference Base _kr And Base representing impairment, mismatch or variation information _km And (4) forming.

Such as Base at a certain Base position _k If in the IUPAC sequence "M" (corresponding to "A" or "C") and in the reference sequence "A" is present, then the position is considered to have a variation information of Base type "C", which we define, with reference to the Base _kr Is the Base information in the reference sequence, such as the above-mentioned "A", and the Base different from the reference sequence R is called Base _km Such as the "C" mentioned above, see Base _km Contains information of Base damage, mismatching, change or variation, the Base _km Is a reference to a particular Base type, so that the same position may have multiple bases _km 。

Further, the VRF value is calculated by the following formula:

wherein Peak (Base) _km ) Is the Base _km The peak fluorescence signal of (a) is,

is Base _k The sum of the peak fluorescence signals of the medium bases (including the main base and the alternative bases).

Further, the type number of the molecular tag sequences, UMInum, is obtained by the following method:

taking the adjacent amplification primers of B1 as known reference sequences, and using a para-position information calculation method for the candidate base sequence B to obtain the para-position information of the candidate base sequence B and the amplification primers, and aligning the two sequences; obtaining a candidate base sequence B1 from the aligned sequence according to the known length information of the B1 sequence;

extracting N at each position of the candidate base sequence B1 ^- Information as a characteristic value, said N ^- The information being Base _k Types of bases not covered, e.g., position 1 of the candidate base sequence B1Is W (A/T), then N in bit number 1 ^- The information is S (G/C), and if the position 2 of the candidate base sequence B1 is H (A/T/C), the position 2 is N ^- Information G, N of the candidate nucleotide sequence B1 ^- The collection of information is defined as Index _B Each known sequence in the library of tag sequences is defined as an Index _l An Index _l Index for each position of (1) _B Information is excluded, tag sequence library Index _l The number of the remaining molecular labels in the sequence is UMInum.

It is a second object of the present invention to provide an analysis device for evaluating base damage, mismatch and variation in a sample DNA by one-generation sequencing, the analysis device comprising:

the data extraction module is used for acquiring base sequence information and fluorescence signal data in a generation of sequencing AB1 file;

the preprocessing module is used for removing background noise of the fluorescence signal and generating a candidate base sequence;

the analysis module is used for analyzing and acquiring variation information in a generation of sequencing results;

and the label processing module is used for analyzing and calculating the number of the molecular label types UMInum in the PCR product.

It is a third object of the invention to provide a server, comprising one or more processors and memory,

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the method for evaluating base damage, mismatch and variation in sample DNA by one-generation sequencing.

It is a fourth object of the present invention to provide a computer readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the method for evaluating base damage, mismatches and variations in sample DNA using one generation sequencing.

By means of the scheme, the invention at least has the following advantages:

in the PCR amplification process, on one hand, a molecular label is adopted to mark DNA original molecules with damage or mismatch, on the other hand, enrichment amplification is carried out on a sampling region, damage or mismatch information of less than 0.1 percent is amplified to 10-99 percent, then, the ratio values of base damage, mismatch and variation in sample DNA are respectively evaluated by an evaluation method based on the enrichment amplification effect and an evaluation method based on the type number of the molecular label, and the ratio values of base damage, mismatch and variation in the sample DNA are judged according to the credible results of the two methods. The method can accurately confirm the real existence of the damage or the mismatch by adopting an economical and rapid sanger sequencing method, can be favorable for optimizing a sample DNA extraction technology and a storage method, and helps to evaluate the quality of the sample DNA.

The foregoing is a summary of the present invention, and the following is a detailed description of the preferred embodiments of the present invention, so that the technical solutions of the present invention can be more clearly understood.

Drawings

FIG. 1 shows the evaluation results of peripheral blood DNA sample base damage Sanger;

FIG. 2 is a bit sequence chart of 100 molecular tag sequences;

FIG. 3 is a schematic diagram of obtaining N-information;

FIG. 4 is a schematic diagram of molecular tags excluded from Excel using base order and N-information in example 2;

FIG. 5 is a schematic diagram of the use of N-information for exclusion and verification of the presence of molecular tags in example 2;

FIGS. 6a and 6b are bit-order tables of 100 molecular tag sequences in example 3;

FIG. 7 shows the results of Sagner sequencing in example 3;

FIG. 8 is a schematic diagram of molecular tags excluded from Excel using base order and N-information in example 3;

FIG. 9 is a schematic diagram of the use of N-information for exclusion and verification of the presence of molecular tags in example 3.

Detailed Description

Example 1: one-generation sequencing to evaluate the extent of base damage in sample DNA

1. Sampling regions were set at 4 positions in the human genome and primer pair combinations for PCR were designed as follows:

Name	Seq(5’-3’)(SEQ ID NO.1～13)	50mM，25℃，deltaG
			DmDe1-FP	CCCTGACAACATAGTTGGAATCA	-27.4
DmDe1-RP	ACTCCAGGATAATACACATCACAGT	-29.2
			DmDe1-BL	TGGAATCACTCATGATATCTCGAGCCAT	-34.0
DmDe2-FP	AGCAGTCTCTGCCTCGC	-24.5
			DmDe2-RP	AGAAGATTCGGCAGAACTAAGCA	-28.5
DmDe2-BL	CCTCGCCAAGCGGCTCATGTTAATATT	-35.0
			DmDe4-FP	AGAAGATGTGGAAAAGTCCCAATG	-28.4
DmDe4-RP	GTGCCCAGGTCAGTGGAT	-24.7
			DmDe4-BL	TCCCAATGGAACTATCCGGAACATCCA	-34.1
DmDe6-FP	TCCTTTAACCACATAATTAGAATCATTCTTGA	-33.9
			DmDe6-RP	AGTTAGTTTTCACTCTTTACAAGTTAAAATGA	-33.5
DmDe6-BL	ATCATTCTTGATGTCTCTGGCTAGACCAAA	-35.6
			UNITag	tgtaaaacgacggccagtaca

note that: the RP sequences in the table are only specific sequence parts, and in the preparation process, a 5-tgtaaaacgacggcgccagtaca (N28) -RP structure is constructed by adding a UNITaq sequence, wherein N28 is 100 UMI sequences in example 2.

2. And (4) customizing a synthesized positive mutant plasmid template according to the hg19 reference sequence information. The sequences of the regions near the sampling region in the positive mutant template were as follows:

Name	Seq(5’-3’)(SEQ ID NO.14～17)
		Plasmid01	TGGAATCACTCATGATA--TCGAGCCA
Plasmid02	CCTCGCCAAGC--CTCATGTTA
		Plasmid04	TCCCAATGGAACTAT--GGAACATCC
Plasmid06	ATCATTCTTGATGTCTCTG--TAGACCAAA

wherein "- -" means a deletion of 2 bases.

3. Preparing 0.1% of variation standard. And (3) configuring a standard product: the method comprises the steps of quantifying by using a qubit, calculating theoretical molecular number according to molecular mass of a plasmid template, gradually preparing a 0.1% variation standard substance, correcting and adjusting by using ddPCR to prepare 0.1% with a smaller relative error, and subsequently continuously correcting by using a second-generation sequencing result.

4. Efold values for each sampled region were obtained by NGS sequencing

a) Configuration of 5 × Oligo mix with BL system

Components	Primer concentration (μ M)	Volume (μ L)
			FP	100	20
RP	100	20
			BL	100	100
0.1×TE		Make up to 1000 μ L
			Total		1000μL

b) Configuration of 5 Xoligo mix w/o BL system (for use as a control in evaluating samples, the same amount of PCR system as used in with BL group)

Components	Primer concentration (μ M)	Volume (μ L)
			FP	100	20
RP	100	20
			0.1×TE		Make up to 1000 μ L
Total		1000μL

c) Configuration of PCR System

Reagent composition	Volume (μ L)
		5×Oligo Mix with BL	6μL
2 XDNA polymeraseMaster Mix	15μL
		0.1% standard substance	300ng
Nuclease Free Water	Make up to 30 mu L

d) UMI-PCR amplification procedure

After the PCR was completed, 1 unit of exonuclease I was added to each reaction, and the reaction was incubated at 37 ℃ for 30 minutes and inactivated at 80 ℃ for 30 minutes. A further 2. Mu.L of 10. Mu.M FP and 2. Mu.L of 10. Mu.M UNITag were added for the subsequent PCR amplification procedure.

e) Subsequent PCR Process

5. Constructing a library of the PCR product after reaction by using a commercial second-generation sequencing library construction kit, sequencing on an Illumina platform, analyzing the number of the molecular label varieties of reads containing 2bp deletion variation information and analyzing the number of the molecular label varieties of reads containing wild type information at the same time, wherein the ratio of the two types is corrected VAF; the number of reads containing the variant information and the number of reads of the wild-type information are analyzed, and the ratio of the two numbers is VRF. The Efold value for each sample position is calculated.

	VAF before NGS correction	VAF after NGS correction	VRF	Efold
					DmDe1	0.1％	0.25％	57.2％	533.2
DmDe2	0.1％	0.31％	83.5％	1627.4
					DmDe4	0.1％	0.15％	48.4％	624.4
DmDe6	0.1％	0.23％	61.0％	678.5

6. Peripheral blood DNA samples to be evaluated were selected, DNA input =30ng, and then done with both BL and w/o BL groups, ensuring no contamination, while in comparison of the two groups, enrichment and amplification effects could be seen, some of the results are shown in fig. 1, and w/o BL groups can be seen to display wild-type information, meaning no enrichment amplification.

7. From the Efold obtained in NGS results and the VRF obtained in Sanger analysis procedure, according to the formula: VAF = VRF/(Efold-Efold × VRF + VRF) VAF in the original sample is calculated:

name of sampling area	Base position information	Efold	VRF	VAF
					DmDe1	9G>A	533.2	73％	0.50％
DmDe2	11C>T	1627.4	9％	0.01％
						12C>T	1627.4	11％	0.01％
	13G>A	1627.4	47％	0.05％
						14C>T	1627.4	29％	0.03％
DmDe4	6T>C	624.4	8％	0.01％
						10G>A	624.4	31％	0.07％
DmDe6	10G>A	678.5	50％	0.15％
						12G>A	678.5	35％	0.08％

Since there may be many base positions in the sample region where damage or mismatch may occur, we estimate the final degree of damage or mismatch as a range, such as DmDe2, and we consider the degree of damage or mismatch to be 0.01% to 0.05%, considering that 30ng input has about 9000 copies, the original molecules of the detected damage or mismatch may be around 1-5. Meanwhile, the conditions of C > T and G > A are the most frequently found in a large number of tests, and the conditions that cytosine is easy to mismatch with T after deamination and G oxidation are also shown in the literature.

Example 2: logic demonstration for analyzing number of molecular labels UMInum from sanger result

1. 100 molecular tags of known sequence were prepared, 28nt each, and each base was space-occupied separately as shown in FIG. 2.

2. Assuming that the PCR product contains 5 molecular tag sequences as shown in FIG. 3, after one-generation sequencing, based on the sanger results, N at each position ^- Information is available as shown in fig. 3.

3. According to N ^- The known sequence of the information filtering molecular tag, for example, at the 16 th base, the molecular tag which does not contain g and t at the position is excluded, and N at the 1 st to 16 th positions is passed ^- After the exclusion of the information, only 15 molecular tags remain, as shown in fig. 4;

4. continue according to N ^- Information is excluded that when proceeding to base 28, eventually leaves 5 molecular tags, just the 5 previously hypothesized to exist, e.g.FIG. 5 is a schematic illustration;

5. this example describes the use of molecular tags of known sequence to obtain N after sanger sequencing ^- The information is used for reversely deducing the logic of the number of the molecular label types in the PCR product, and the specific actual analysis is completed by the written software.

Example 3

In order to show the more general utility of the present invention, this example was designed at a different location in the human genome from that of example 1, and the same principle as that of example 1 was applied to the primer design principle of this example, refer to the earlier patent of this company, CN110923325A, primer set, kit and method for detecting EGFR gene mutation, and CN110982884A, primer set, kit and method for detecting AML-related gene mutation;

SSL3-FP：CCAGAAAACAGGCAGGTCTCTC

SSL3-BL：CAGGTCTCTCTGCTCTTGACCGAGC

SSL3-RP：ACAGCAGGCAGTTGGGA

the UNITaq sequence is the same as that in example 1, and the SSL3-RP sequence in this example is only a specific sequence part, and in the preparation process, the UNITaq sequence is added to construct a 5-tgtaaaacgacggctagtaca (N28) -RP structure, wherein N28 is 100 UMI sequences as shown in FIGS. 6a and 6B, and the design part of UMI refers to CN110060734B barcode generation and reading method for high robustness DNA sequencing, and the difference is that the barcode designed in CN110060734B is used for sample differentiation, and the reading mode is more complex, and the scheme of the present invention is used for distinguishing different original molecules in a sample, and simultaneously has a simpler reading and identifying mode.

Experimental method referring to example 1, according to hg19 reference sequence information, a synthetic positive mutant plasmid template is customized, the specific position is C > G near the position 80 in fig. 7, configured into 0.1% of a variant standard, a PCR product is directly subjected to Sanger sequencing, and the sequencing is repeated for 3 times, and the experimental result is shown in fig. 7, the horizontal frame is the region where UMI is located, and the vertical frame is the position C > G.

The first 4 bases of UMI can be clearly seen from the results of three replicates of sangerIs a pure peak of CTCA using the same N as in example 2 ^- The information concept is eliminated, and 6 UMI molecular labels can be screened, as shown in FIG. 8.

The 5 th N-information is c and a, further screening of UMI is not helpful, the 6 th N ^- The information is t, g and a, which is useful information, and 2 UMI molecular tags can be further screened, as shown in fig. 9;

n of the subsequent position ^- The information can be further clarified, and the sanger result is composed of the two UMI sequences, and the proportion of the two UMI sequences in the PCR product is close to 1 and accounts for 50 percent respectively, which indicates that at least two original DNA molecules with base mutation occur.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, it should be noted that, for those skilled in the art, many modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Sequence listing

<110> Zell Gene technology (Suzhou) Ltd

<120> method for evaluating base damage, mismatch and variation in sample DNA using one-generation sequencing

<160> 17

<170> PatentIn version 3.3

<210> 1

<211> 23

<212> DNA

<213> (Artificial sequence)

<400> 1

ccctgacaac atagttggaa tca 23

<210> 2

<211> 25

<212> DNA

<213> (Artificial sequence)

<400> 2

actccaggat aatacacatc acagt 25

<210> 3

<211> 28

<212> DNA

<213> (Artificial sequence)

<400> 3

tggaatcact catgatatct cgagccat 28

<210> 4

<211> 17

<212> DNA

<213> (Artificial sequence)

<400> 4

agcagtctct gcctcgc 17

<210> 5

<211> 23

<212> DNA

<213> (Artificial sequence)

<400> 5

agaagattcg gcagaactaa gca 23

<210> 6

<211> 27

<212> DNA

<213> (Artificial sequence)

<400> 6

cctcgccaag cggctcatgt taatatt 27

<210> 7

<211> 24

<212> DNA

<213> (Artificial sequence)

<400> 7

agaagatgtg gaaaagtccc aatg 24

<210> 8

<211> 18

<212> DNA

<213> (Artificial sequence)

<400> 8

gtgcccaggt cagtggat 18

<210> 9

<211> 27

<212> DNA

<213> (Artificial sequence)

<400> 9

tcccaatgga actatccgga acatcca 27

<210> 10

<211> 32

<212> DNA

<213> (Artificial sequence)

<400> 10

tcctttaacc acataattag aatcattctt ga 32

<210> 11

<211> 32

<212> DNA

<213> (Artificial sequence)

<400> 11

agttagtttt cactctttac aagttaaaat ga 32

<210> 12

<211> 30

<212> DNA

<213> (Artificial sequence)

<400> 12

atcattcttg atgtctctgg ctagaccaaa 30

<210> 13

<211> 21

<212> DNA

<213> (Artificial sequence)

<400> 13

tgtaaaacga cggccagtac a 21

<210> 14

<211> 25

<212> DNA

<213> (Artificial sequence)

<400> 14

tggaatcact catgatatcg agcca 25

<210> 15

<211> 20

<212> DNA

<213> (Artificial sequence)

<400> 15

cctcgccaag cctcatgtta 20

<210> 16

<211> 24

<212> DNA

<213> (Artificial sequence)

<400> 16

tcccaatgga actatggaac atcc 24

<210> 17

<211> 28

<212> DNA

<213> (Artificial sequence)

<400> 17

atcattcttg atgtctctgt agaccaaa 28

Claims

1. A method for evaluating base damage, mismatches and variations in sample DNA using one-generation sequencing, comprising the steps of:

s1, adding a nucleic acid composition capable of inhibiting a non-target region in a DNA sample and an amplification primer with an error-correctable molecular tag library, amplifying the DNA sample, and sequencing a product obtained after PCR amplification by adopting a first-generation sequencing technology;

wherein the nucleic acid composition for inhibiting the non-target region in the DNA sample is designed according to the sampling region in the DNA sample, and comprises a forward primer, a reverse primer and a Blocker primer for amplifying the DNA sample; the Blocker primer inhibits the amplification of a region without base damage, mismatching and mutation; the reverse primer is connected with a UNITaq sequence and a UMI sequence;

s3, when the results of the evaluation method based on the enrichment amplification effect and the evaluation method based on the molecular tag species number have credible results, taking the evaluation method based on the molecular tag species number as the ratio value of base damage, mismatching and variation in the DNA of the evaluation sample;

the evaluation method based on enrichment amplification is analyzed through the following steps:

Efold=(VRF/VAF) × [(1-VAF)/(1-VRF)]，

s02, testing the standard substance to obtain an Efold value of each sampling region, calculating a VRF value according to the peak ratio of different bases in sequencing data of a PCR amplification product, and calculating a VAF value according to the following formula when the VRF satisfies 5% < = VRF < = 95%:

VAF = VRF/(Efold-Efold × VRF+VRF)，

when VRF does not satisfy 5% < = VRF < =95%, the evaluation method result based on enrichment amplification is not reliable;

the evaluation method based on the molecular tag variety number is characterized by comprising the following steps of:

Pdm%=UMInum/(Ng×1000×2/6.67)×100%，

wherein Ng = mass of DNA in Ng added in the reaction;

when the UMInum is more than 10, the result of the evaluation method based on the molecular tag variety number is not credible;

before calculating the VRF value or before outputting the parameter UMInum, the method comprises the steps of identifying variation information:

s0001, obtaining a base line value of the sanger sequencing signal

(ii) a The method comprises the following steps:

a) Reading the Sanger AB1 file to obtain the signal value of each fluorescence channel signal sampling in the file

And the number of signal samples per base

；

As a fluorescent channel

At the base

Number of samples of the location signal

The maximum value within the region of interest,

the calculation method comprises the following steps:

wherein i can be a positive integer within 0 to 5;

b) For each fluorescence channel in all

The maximum value of one base position is

Removing base from the first generation of sequencing to identify as fluorescent channel

Corresponding to

Maximum of one base, obtaining a new maximumA set of values:

c) Computing

Removing the difference from the median value exceeding the mean absolute deviation

The value of the factor (x) is,

the value can be 2 to 5, and the average value of the remaining maximum values is calculated

As a fluorescent channel

A background noise baseline of (a);

d) Subtracting the background noise value of the corresponding fluorescence channel from the signal value of all fluorescence channel signal samples to obtain

：

traversing the peak value of the fluorescence channel, and when only any channel in a region with the width of one base has the peak value, the region has one base, and the type of the base is the base type corresponding to the channel with the peak value; when one base is wide

When there is a peak in a plurality of channels in the region of (2), then the region of (2) isThe region may have a plurality of bases, the base type corresponding to the channel with the highest peak value is the main base of the region, the peak values of other channels are based on the proportion of the peak value data in the peak value of the main base channel, when the proportion is higher than a threshold value, the base type corresponding to the channel is an alternative base type of the region, otherwise, the alternative base type does not exist; obtaining a candidate base sequence A consisting of main bases and alternative bases, and labeling alternative base types at positions where the alternative bases exist;

wherein the one base is wide

The area of (a) is defined as: if Sanger AB1 contains

One base, then base

At a signal sampling number of

The number of samples of the signal at which the previous base is located is

The number of samples of the signal at which the latter base is located is

Then base

Starting position of the base width region of

The following formula is obtained:

base

The base width region of (3) terminating position

The following formula is obtained:

wherein the presence of a peak within the region of one base width is defined as: fluorescence channel pair Using find _ peaks algorithm of Scipy

In that

Background noise removed signal values of regions

Calculating a peak value of the region;

the candidate base sequence B represents the full-length sequence of the PCR product and comprises a candidate base sequence B1, a candidate base sequence B2 and a candidate base sequence B3, wherein the candidate base sequence B1 is the sequence of a molecular tag library position, the candidate base sequence B2 is the sequence of a sample DNA sampling region, and the candidate base sequence B3 is other sequences except the sequence of the molecular tag library position and the sequence of the sample DNA sampling region; combining the main base and the alternative base in the candidate base sequence A by using IUPAC base coding rules to obtain a candidate base sequence B coded by IUPAC;

identifying the information that the candidate base sequence B is different from the known reference sequence R by using a para-position information calculation method;

the calculation method of the para-position information is to compare the candidate base sequence B coded by IUPAC with the known reference sequence R by using a sequence comparison algorithm and an IUPAC code comparison fraction table; selecting the result with the highest comparison score as the alignment result of the candidate base sequence B and the known sequence R to obtain the alignment information of the candidate base sequence B and the known reference sequence R;

using a para-position information calculation method to obtain para-position information of candidate base sequences B2 and B3 and a known reference sequence R, and aligning the two sequences; scanning the aligned candidate base sequences B2 and B3 and the known reference sequence R to obtain base information which is different from the known reference sequence R in the IUPAC sequence and is variation information;

wherein, define