CN111243664B - Gene variation detection method based on high-throughput sequencing - Google Patents
Gene variation detection method based on high-throughput sequencing Download PDFInfo
- Publication number
- CN111243664B CN111243664B CN202010222444.5A CN202010222444A CN111243664B CN 111243664 B CN111243664 B CN 111243664B CN 202010222444 A CN202010222444 A CN 202010222444A CN 111243664 B CN111243664 B CN 111243664B
- Authority
- CN
- China
- Prior art keywords
- quality
- reads
- strand
- base
- follows
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a gene variation detection method based on high-throughput sequencing. The method comprises the following steps: preprocessing high-throughput sequencing data, mainly comprising base quality correction and read de-duplication; calculating the preference generated by sequencing, and reducing the number of effective reads supporting mutation through the preference to obtain the corrected read number supporting mutation; and calculating the correlation quality of all single samples by using the corrected read number of the supporting mutant types, then obtaining the variation quality of the single samples, and judging according to the variation quality. Experiments prove that the method improves the true positive rate and reduces the false positive rate aiming at the WGS data of the tissue sample, and possibly reduces biological tests.
Description
Technical Field
The invention belongs to the technical field of biological information, and particularly relates to a gene variation detection method based on high-throughput sequencing.
Background
With the development of sequencing technology, the throughput of high-throughput sequencing is larger and larger, and therefore, the generated data is also larger and larger. At the same time, the application of high-throughput sequencing is becoming more and more widespread and important. Today high-throughput sequencing has become very widespread and important in cancer applications, for example, high-throughput sequencing is applied to: tumor early screening, tumor prognosis, tumor classification and tumor medication guidance. At the same time, the large amount of high throughput sequencing data presents many computational challenges. In high throughput sequencing data analysis procedures, the most important step is mutation detection (variant call). Before clinical reports are issued, the mutation detection results reported by mutation detection software (variable caller) generally need to be checked manually.
The problems of missed detection and false positive exist in the detection result of the existing method for detecting variation based on high-throughput sequencing data. Missed testing can lead to misdiagnosis, which can be a serious hazard to the life of the patient (e.g., reducing survival probability). The large number of false positive variations requires increased manual review effort, which results in very high time and cost for manual review. In summary, the consistency between the current software automatic detection result and the result obtained by manual review is poor. In addition, some mutation detection software runs slowly, resulting in more computing resource consumption and resulting in a doctor waiting longer to get a patient's genetic test report.
Disclosure of Invention
In view of the above problems, the present invention provides a method for detecting gene variation based on high throughput sequencing.
In a first aspect, the invention claims a method for detecting genetic variation based on high throughput sequencing.
The gene variation detection method based on high-throughput sequencing provided by the invention supports data of single-ended sequencing and double-ended sequencing.
The gene variation detection method based on high-throughput sequencing provided by the invention can comprise the following steps:
(A) Preprocessing the high-throughput sequencing data, mainly comprising base quality correction and read de-duplication;
(B) Calculating the preference generated by sequencing, and reducing the number of effective reads supporting mutation through the preference to obtain the number of corrected reads supporting mutation;
(C) And calculating the correlation quality (quality) of all single samples by using the corrected read number of the support mutants, then obtaining the variation quality of the single samples, and judging according to the variation quality.
In the method, the genetic variation detected may be an SNV variation or an InDel variation. If no normal control sample is available, the data corresponding to the normal control sample is considered to be a virtual file in the BAM format that does not contain any read, and the virtual file is equivalent to an empty file.
All qualities mentioned in the present invention are used as "twisting B, green P.base-lifting of automated sequence processes using phred.II.Error Prohibites [ J ]. Genome research,1998,8 (3): 186-194 "quality (quality) is defined in the first equation, so that any quality (e.g., base quality and variant quality) has a one-to-one correspondence with probability. The quality and probability are therefore interchangeable.
When the detected genetic variation is an InDel variation, the step (A) further comprises a step of estimating the base quality before the base quality correction; the base quality estimation is performed as follows:
InDelQual(x,y,a,b,c)=min(x+10,y+10,STRQual(a,b))+10×log 10 (c)
where InDelQual is the InDel mass for the same purpose as base mass, x and y are the base masses of the previous and next bases, respectively, at which InDel occurs, a and b are the unit size (repeat size) and the number of units (repeat number), respectively, of a Short Tandem Repeat (STR), and c is the number of bases of InDel. The STRGUAL function is defined as follows:
STRQual(a,b)=44-10×log 10 (1+8×softplus(a×b-8)/(θ×a 2 ));
the softplus function is defined as follows:
softplus (v) = log (1 + exp (v)), where log refers to the natural logarithm.
If InDel is the absence of one STR cell, then θ =0.25, and in addition θ =1.
When the gene variation to be detected is SNV or InDel variation, the base quality correction in step (A) may be performed as follows:
if the high throughput sequencing is paired-end sequencing, then the R1 and R2 ends are combined before deduplication and filtering, and then the base quality of the combined reads is used.
For the overlapping region of the R1 end and the R2 end, the calculation method of the base quality after combination is as follows:
when the bases corresponding to p and q are the same, the following formula is used:
max(p,q);
when the bases corresponding to p and q are different, the following formula is used for calculation:
max(p,q)-min(p,q);
wherein p is the base mass before the correction of the R1 terminal, and q is the base mass before the correction of the R2 terminal.
For the base masses in the non-overlapping regions of the R1 end and the R2 end, the base masses after combination are equal to the base masses before correction.
In the case of single-ended sequencing, pooling is not required (the corrected base mass is equal to the base mass before correction).
When the detected gene variation is SNV or InDel variation, in the step (A), reads can be deduplicated as follows:
reads of PCR repeats or optical repeats are put into a family to remove the repeats.
Specifically, the PCR repeated or optically repeated reads are put into a family to remove the repeated ones according to a method comprising the following steps:
(a1) For each locus (each locus on each chromosome, e.g.: chr1:1, chr 1.
(a2) Each locus x attracts the other loci y, calculating the strength of attraction of x for y, s, which is proportional to the x-terminal depth (the number of reads that start/end for this locus) (proportionality constant = 1), inversely proportional to the y-terminal depth (proportionality constant = 1) and exponentially decaying with the number of bases z between x and y (decay coefficient = 5);
the formula of the attractive force intensity s is as follows;
s=x/y×5 -z ;
if s > 1 (y is strongly attracted to x), then y is considered a PCR or optical repeat of x, thereby placing x and y in a family.
(a3) And (3) filtering: the threshold (threshold) for base mass was calculated as follows:
Threshold(P)=argmax p∈P (p -|{q∈P|q≤p}| );
wherein P is a set of sequencing error probabilities corresponding to all base masses (one base mass to one sequencing error probability); the mathematical symbol e represents set attribution, so P e P means P is any one element of P, and q e P means q is any one element of P.
Filtering the bases on reads according to a base quality threshold (filter): filtering out bases whose base quality is lower than a base quality threshold after correction; if a base is filtered out, the absence of the base is considered, and it is understood that no base is detected.
It should be noted that filtering bases and filtering reads are two concepts.
After filtering, for each family, taking the most common base (the base has the most occurrence times) of the family as the truncated consensus family base, namely the truncated base; here family may be either base or reads.
Then, integrating the repeated reads into a consensus read; it should be mentioned here that the final result after deduplication is the same whether the deduplication is reads or bases. Thus, it is understood that reads can be de-duplicated, and bases can be filtered.
In the step (B), four preferences of depletion bias, position bias, strand bias and mismatch bias are calculated in each chromosome chain; then selecting the highest preference as the integral preference of the chromosome chain; the number of reads supporting mutant is corrected, and the corrected number of reads is equal to the number of reads before correction divided by the overall preference.
When the detected gene variation is SNV or InDel variation, the four preferences of reduction bias, position bias, strand bias and mismatch bias can be calculated according to the following formulas:
Bias(a 1 ,a 2 ,b 1 ,b 2 )=max(0,OR(LS(a 1 ,a 2 ,b 1 ,b 2 ))-1)*(b 2 ÷(b 2 +b 1 ))+1;
in the formula, OR represents an odds ratio; the OR is calculated as follows:
OR(a 1 ,a 2 ,b 1 ,b 2 )=(a 1 ×b 2 )÷(a 2 ×b 1 )。
in the formula, LS represents Laplace smoothness; the LS formula is as follows: LS (a, b, c, d) = Laplace 2 (Laplace 1 (a, b, c, d)); wherein, laplace 1 The formula is as follows: laplace 1 (a,b,c,d)=(a+p,b+p,c+p,d+p);Laplace 2 The formula is as follows: laplace 2 (a,b,c,d)=(a+p×α,b+p×α,c+p×β,d+p×β);
In Laplace 1 Formula and Laplace 2 In the formula, the default value of p is 0.5;
in Laplace 2 In the formulas, α and β are as follows: α = max (0,p × ((a) 1 +a 2 )/(b 1 +b 2 )-1));β=max(0,p×(b 1 +b 2 )/(a 1 +a 2 )-1);
For the reduction bias, the meaning of the parameters is as follows:
a 1 : the number of reads supporting any allele that were deduplicated;
a 2 : after deduplication, the number of reads supporting any allele;
b 1 : the number of reads that support the mutant allele that are deduplicated;
b 2 : after deduplication, the number of reads supporting the mutant allele.
If there are 7 reads before deduplication and 2 are left after deduplication, then 7-2=5 reads are removed from deduplication.
For position bias, the meaning of each parameter is as follows:
for each direction (two left and right directions in total), 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66 bases are elongated in direction; the following four values are then substituted:
a 1 : the number of reads that support any allele in reads that have crossed the extended locus;
a 2 : the number of reads that support any allele in reads that do not span the extended locus;
b 1 : after the elongation has been crossedThe number of reads supporting the mutant allele in the reads of the locus;
b 2 : the number of reads that support the mutant allele in reads that do not span the extended locus.
Sequentially substituting 11 extension units to generate a sequence consisting of 11 numerical values; taking every 3 adjacent values in the sequence as an array, thereby generating 9 arrays; taking the minimum value for each array, thereby producing 9 minimum values; the maximum of the 9 minima is taken as position bias.
For strand bias, the following values are substituted:
a 1 : in the forward direction of the chromosome strand S, the number of reads supporting any allele;
a 2 : in the reverse of chromosome strand S, the number of reads supporting any allele;
b 1 : in the forward direction of chromosome strand S, the number of reads supporting the mutant allele;
b 2 : in the reverse direction of the chromosome chain S, the number of reads of the mutant allele is supported.
Meanwhile, calculating base quality imbalance (BQI for short, 1I in the formula) and strand position imbalance (SPI for short, I in the formula, 2I in the formula), calculating 3 values of I by the following method, and then taking the minimum value; then, dividing the strand bias by the minimum value to normalize the strand bias; finally ensuring that the minimum Strand bias is at least 1;
for BQI, c is the average base mass of all alleles of the other strand, d is the average base mass of the mutant alleles of this strand, and I =10 d-c ;
For SPI, c is the average base distance to the end of read for all alleles of the other strand, d is the average base distance to the end of read for mutant alleles suspected of having a preferred strand, andeach read has two ends: left and right. Due to the fact thatThe left and right termini generate two I, respectively.
Let x =1,2,3,.., 11 for mismatch bias; x is the mismatch quantity threshold (as can be inferred by mismatch quantity < x). 1,2, … …,11 are different attempted thresholds.
For each value of x, the following four values are substituted:
a 1 : mismatch numbers < x, number of reads supporting any allele;
a 2 : the mismatch number is more than or equal to x, and the number of reads of any allele is supported;
b 1 : mismatch numbers < x, number of reads supporting mutant alleles;
b 2 : the number of mismatches is more than or equal to x, and the number of reads of mutant alleles is supported;
sequentially substituting 11 x values to generate a sequence consisting of 11 values; taking every 5 adjacent values in the sequence as an array, thereby generating 7 arrays; taking the minimum value for each array, thereby generating 7 minimum values; the maximum of the 7 minima was taken as the mismatch bias.
When the gene variation to be detected is an SNV or InDel variation, the step (C) may be performed as follows:
(C1) For each site where quality needs to be calculated, calculating family quality, single-strand quality, double-strand quality by using the following formulas;
Quality(p,B,b,r)=log r (BIAS(B,p×B,B,b)×(r-1)+1)×10×log 10 (OR(B,p×B,B,b))。
wherein:
(1) Let r =2 and calculate the family quality using the following definition for reads in a family:
p: error probability corresponding to the base quality threshold calculated in the reads deduplication in the step (a 3);
b: the number of reads that actually observed to support any allele from the data;
b: the number of reads supporting the mutant allele was actually observed from the data.
(2) R =1.5 if the high throughput sequencing experiment was pooled using the PCR amplification method, and r =1.25 if the high throughput sequencing experiment was pooled using the capture or WGS method. Comparing that the families on one chromosome chain are on one single-strand; single-strand quality was calculated for one single-strand using the following definitions (maximum quality: 44 for C: G > T: A mutant form, 48 for T: A > C: G mutant form, 52 for other SNV mutant forms, 60 for all InDel mutant forms). If the single-strand quality is greater than the maximum family quality value, then let the single-strand quality equal the maximum family quality value;
p: the error probability corresponding to each possible family quality (i.e., trying all family qualities);
b: the actual observation of family numbers supporting any allele;
b: actual observation of family numbers supporting mutant alleles;
after all p's are tried, the maximum single-strand quality is taken as the final single-strand quality.
(3) Forward or reverse direction on one double-strand for each single-strand; double-strand quality was calculated for a single strand using the following definitions:
v 1 ′=v 1 +v 2 ×min(1,(min(w 1 +w 2 ,w 0 )-w 1 )/w 1 );
wherein v is 1 ' is the final variable mass on a strand, v 1 Is the quality of variation of this chain (i.e., the final single-strand quality), v 2 Is the quality of variation on the other strand, w 1 Is the average value of the family quality, w, of this chain 2 Is the average value of the family quality, w, of the other chain 0 Equal to the maximum similarity quality (maximum similarity quality is as follows: 44 for C: G > T: A mutant form, 48 for T: A > C: G mutant form, 52 for other SNV mutant forms, 60 for all InDel mutant forms) plus 10;
exchange v 1 And v 2 And exchange w 1 And w 2 According to the calculation v 1 ' symmetry calculation of formula v 2 ′;
v 2 ′=v 2 +v 1 ×min(1,(min(w 1 +w 2 ,w 0 )-w 2 )/w 2 )。
(C2) Double-strand quality is defined as the quality of a single sample variation.
(C3) The single-sample mutation quality of the tumor sample and the normal control sample was calculated according to (C1) and (C2) above. TLOD is then calculated. TLOD represents the probability that the variation is derived from neither the body (soma) nor the embryo (germline), as follows:
first, output values of the following functions are calculated.
TNreward(a 1 ,a 2 ,b 1 ,b 2 )=maxH(TNrewardBinomial(a 1 ,a 2 ,b 1 ,b 2 ),PLQTN(b 2 /b 1 ,a 2 /a 1 ));
Wherein, TNrewardBonomial (a) 1 ,a 2 ,b 1 ,b 2 )=10/log(10)×(a1×KLBernoulli(b 2 /b 1 ,a 2 /a 1 ))
maxH (a, b) = a if max (a, -a) < max (b, -b), otherwise maxH (a, b) = b;
PLQTN(t,n)=3×10×log 10 (min(t/n,10 2.5/3 ))。
TNpenal(a 1 ,a 2 ,b 1 ,b 2 )=max(0,min(DSVQnormal,PLQ(a 2 /a 1 ))-12.5×(max(0,OR(a 1 ,a 2 ,b 1 ,b 2 )-1)) 2 ) (ii) a Wherein DSVQnormal refers to the single sample mutation quality of the normal sample, i.e., the double-strand quality of the normal sample.
a 1 、a 2 、b 1 And b 2 The meanings are as follows:
a 1 : pairingSum of family quality of reads supporting any allele in the normal control sample;
a 2 : sum of family quality of reads supporting mutant alleles in the paired normal control sample;
b 1 : family quality sum of reads supporting any allele in the tomor sample;
b 2 : family quality summation of reads supporting mutant alleles in the tomor sample.
Then, TLOD is calculated as follows:
TLOD(a 1 ,a 2 ,b 1 ,b 2 )=min(DSVQtumor,PLQ(b 2 /b 1 ))+TNreward(a 1 ,a 2 ,b 1 ,b 2 )-TNpenal(a 1 ,a 2 ,b 1 ,b 2 )。
wherein PLQ (f) =90+3 × 10 × log 10 (f) In that respect DSVQtomor refers to the single-sample mutation quality of the tomor sample, i.e., the double-strand quality of the tomor sample.
(C4) The NLOD is calculated. NLOD represents the probability of coming variant from the embryo (germline). The method comprises the following specific steps:
first, for each sample (two samples, tumor and normal paired in total), the Genotyp Likelilhood (GL) for each Genotyp (GT) is calculated. GT and GL are defined and described in https: i/samtools. Io/hts-specs/vcfv4.2. Pdf. In summary, homref (homozygous wild type), hetero (heterozygous) and homalt (homozygous mutant) are three GT. The GL of these three GTs is calculated as follows:
GL(f,homref)=max(GLPowerLaw(f,homref),GLBinomial(f,homref));
GL(f,hetero)=max(GLPowerLaw(f,hetero),GLBinomial(f,hetero));
GL(f,homalt)=max(GLPowerLaw(f,homalt),GLBinomial(f,homalt))。
where GLPowerLaw (f, homref) =10 × 3 × min (0,log) 1o (0.02×(1-f)/f));
GLPowerLaw(f,hetero)=10×3×min(0,log 10 (min(f/(1-f),(1-f)/f)));
GLPowerLaw(f,homalt)=10×3×min(0,log 10 (0.02×f/(1-f)));
GLBinomial(f,homref)=10/log(10)×d×KLBernoulli(min(0.02,f),f);
GLBinomial(f,hetero)=10/log(10)×d×KLBernoulli(0.5,f);
GLBinomial(f,homalt)=10/log(10)×d×KLBernoulli(max(1-0.02,f),f);
KLBernoulli(x,y)=y×log(y/x)+(1-y)×log((1-y)/(1-x));
KLBernoulli (x, y) =0 if x =0, y =0, x is indeterminate or y is indeterminate. For example, (0/0) is an indefinite form.
And a refers to the number of reads supporting mutant genes, d refers to the number of reads supporting any genotype, and f = a/d is the mutation frequency.
The NLOD is then calculated. NLOD = max (qtumhorref, qnormalomref).
Wherein Qtomormref = -3-max (GL (f, homalt), GL (f, heter)) on the tomor sample. Qtumorhmoref is calculated using tomor sample data, so it has no relationship to the paired normal sample.
Wherein Qnormalhomref = GL (f, homref) -max (GL (f, homalt), GL (f, hetero)) on paired normal samples. Qnormalhomref is calculated using paired normal sample data, so it has no relation to the tomor sample.
(C5) The final quality of variation was: min (TLOD, NLOD + 31). If the final quality of variation is at least 60, determining as positive and outputting a variant form; otherwise, the result is judged to be negative. A mass of 60 represents one part per million of false positive rate. Mutation detection software MuTect (Cibulskis K, lawrence M S, carter S L, et al. Sensitive detection of physiological point mutations in impure and heterologous samples [ J ]. Nature Biotechnology,2013, 31 (3): 213.) indirectly uses a threshold of 60, while BWA can output a maximum alignment quality of 60. A threshold of 60 is used. The variant output format is VCF. The VCF format is defined in detail in (Danecek P, auton A, abecasis G, et al. The variant call format and VCFtools [ J ]. Bioinformatics,2011, 27 (15): 2156-2158.).
In a second aspect, the invention claims a genetic variation detection system based on high throughput sequencing.
The invention claims a gene variation detection system based on high-throughput sequencing, which comprises a device A, a device B and a device C;
said device a being capable of pre-processing high throughput sequencing data, comprising essentially base quality correction, read de-duplication, according to step (a) of the first aspect above;
said device B is capable of calculating the bias resulting from sequencing according to step (B) in the first aspect above, and obtaining the corrected number of reads supporting mutations by reducing the number of valid reads supporting mutations by the bias;
the device C can calculate all individual sample correlation qualities (qualities) using the corrected read numbers of the support mutants according to the step (C) in the first aspect, and then obtain individual sample variation qualities, and determine the quality through the variation qualities.
In a third aspect, the invention claims any of the following applications:
(I) Use of a method according to the first aspect or a system according to the second aspect for the manufacture of a product for tumor pre-screening, tumor prognosis, tumor classification and/or tumor medication guidance;
the product is capable of performing the steps of the method of the first aspect as described above for detecting genetic variations from high throughput sequencing data.
(II) use of a method according to the first aspect hereinbefore or a system according to the second aspect hereinbefore for the early screening, prognosis, classification and/or guidance of a tumor.
Experiments demonstrate that the present invention increases the true positive rate and decreases the false positive rate, and possibly decreases biological tests, for WGS data of tissue samples.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. there was little missing detection, that is, there were few false negative detection results.
2. The number of false positive mutations generated was small.
3. Software implemented using the algorithms described in the present invention runs faster than most software.
Drawings
FIG. 1 shows a comparison of the performance of the method of the present invention with that of similar software on data with molecular tags.
FIG. 2 is a comparison of the performance of the method of the present invention with that of similar software on data with molecular tags.
FIG. 3 is a comparison of the performance of the method of the present invention with that of similar software on data with molecular tags.
FIG. 4 is a comparison of the performance of the method of the present invention with similar software on data with molecular tags.
FIG. 5 is a comparison of the performance of the method of the present invention compared to similar software on the tomor-normal organization data.
FIG. 6 is a comparison of the performance of the method of the present invention compared to similar software on the tomor-normal organization data.
In each figure, the accuracy = the number of true positives/(number of true positives + number of false positives). Recall = number of true positives/(number of true positives + number of false negatives).
Detailed Description
The experimental procedures used in the following examples are all conventional procedures unless otherwise specified.
Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.
Example 1 establishment of the Gene mutation detection method based on high throughput sequencing
The gene variation detection method based on high-throughput sequencing established by the invention is detailed by taking SNV variation as an example.
1. Sample genomic DNA was subjected to high throughput sequencing.
2. Preprocessing data of high-throughput sequencing, which mainly comprises the following steps: base quality correction and read de-duplication; the specific method comprises the following steps:
the invention adopts the following technical scheme that the method adopts the following steps of' wining B, green P.base-calibrating of automated sequence processing using phred.II.error substrates [ J ]. Genome research,1998,8 (3): 186-194 "quality in the first formula, so that any quality (e.g., base quality and variant quality) has a one-to-one correspondence to probability. The quality and probability are therefore interchangeable.
The base quality correction is as follows:
a. In the case of paired-end sequencing, the R1 and R2 ends are pooled (merge) before de-duplication and filtration, and the base masses of the pooled reads are used.
For the overlapping region of R1 and R2, the combined base mass was calculated as follows:
when the bases corresponding to p and q are the same: max (p, q);
when the bases corresponding to p and q are different: max (p, q) -min (p, q);
wherein p is the base mass before the correction of the R1 terminal, and q is the base mass before the correction of the R2 terminal.
For base masses within the non-overlapping region of R1 and R2, the combined base masses are equal to the base masses before correction.
b. In the case of single-ended sequencing, pooling is not required (the corrected base mass is equal to the base mass before correction).
The reads are deduplicated, as follows:
The method comprises the following steps: placing PCR or optically repeated reads into a family to remove repeats, comprising the following steps:
a) For each locus (each locus on each chromosome, for example: and (2) chr1:1, chr1:2,., chrl:249250621, chr2: 1..) the number of reads that start or stop at this locus (called end depth for short) is found.
b) Each locus x attracts the other loci y, and the strength of attraction of x to y, s, is calculated as proportional to the x-terminal depth (the number of reads that start/end for this locus) (proportionality constant = 1), inversely proportional to the y-terminal depth (proportionality constant = 1) and exponentially decaying with the number of bases z between x and y (decay coefficient = 5).
If y is strongly attracted to x, then y is considered a PCR or optical repeat of x.
The formula of the attractive force intensity s is as follows:
s=x/y×5 -z ;
if s > 1 (y is strongly attracted to x), then y is considered a PCR or optical repeat of x, thereby placing x and y in a family.
c) And (3) filtering: the threshold (threshold) for base mass was calculated as follows:
Threshold(P)=argmax p∈P (p -|{q∈P|q≤p}| );
wherein P is a set of sequencing error probabilities corresponding to all base masses (one base mass to one sequencing error probability); the mathematical symbol e represents set attribution, so that the meaning of P e P means that P is any one element of P, and the meaning of q e P means that q is any one element of P.
Filtering the bases on reads according to a base quality threshold (filter): the bases whose corrected base quality is below the base quality threshold are filtered out. If a base is filtered out, the absence of the base is considered, and it is understood that no base is detected.
It should be noted that filtering bases and filtering reads are two concepts.
After filtering, for each family, the most common base (the base occurs most frequently) of the family is taken as the truncated consensus base, i.e., the base after de-emphasis. Here family may be either base or reads.
Then, the repeated reads are integrated into a consensus read. It should be mentioned here that the final result after deduplication is the same whether the deduplication is reads or bases. Thus, it can be understood that reads can be de-duplicated, as well as bases filtered.
3. Calculating the preference generated by sequencing, and reducing the number of effective reads of the support mutation through the preference to obtain the corrected read number of the support mutation.
1. Calculating the preference generated in the sequencing process, comprising: reduction bias, position bias, strand bias and mismatch bias.
Bias(a 1 ,a 2 ,b 1 ,b 2 )=max(0,OR(LS(a 1 ,a 2 ,b 1 ,b 2 ))-1)*(b 2 ÷(b 2 +b 1 ))+1;
In the formula, OR represents the ratio of ratios; the OR is calculated as follows:
OR(a 1 ,a 2 ,b 1 ,b 2 )=(a 1 ×b 2 )÷(a 2 ×b 1 )。
in the formula, LS represents Laplace smoothness; the LS formula is as follows: LS (a, b, c, d) = Laplace 2 (Laplace 1 (a, b, c, d)); wherein, laplace 1 The formula is as follows: laplace 1 (a,b,c,d)=(a+p,b+p,c+p,d+p);Laplace 2 The formula is as follows: laplace 2 (a,b,c,d)=(a+p×α,b+p×α,c+p×β,d+p×β);
In Laplace 1 Formula and Laplace 2 In the formula, the default value of p is 0.5;
in Laplace 2 In the formulas, α and β are as follows: α = max (0,p × ((a) 1 +a 2 )/(b 1 +b 2 )-1));β=max(0,p×(b 1 +b 2 )/(a 1 +a 2 )-1);
For the reduction bias, the meaning of the parameters is as follows:
a 1 : the number of reads supporting any allele that were deduplicated;
a 2 : after deduplication, the number of reads supporting any allele;
b 1 : the number of reads that support the mutant allele that are deduplicated;
b 2 : de-weightingThe number of reads supporting the mutant allele then follows.
If there are 7 reads before deduplication and 2 are left after deduplication, then 7-2=5 reads are removed from deduplication.
For position bias, the meaning of each parameter is as follows:
for each direction (two left and right directions in total), 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66 bases are elongated in direction; the following four values are then substituted:
a 1 : the number of reads that support any allele in reads that have crossed the extended locus;
a 2 : the number of reads that support any allele in reads that do not span the extended locus;
b 1 : the number of reads that support the mutant allele in reads that have crossed the extended locus;
b 2 : the number of reads that support the mutant allele in reads that do not span the post-elongation locus.
Substituting 11 extension units in sequence to generate a sequence consisting of 11 values; taking every 3 adjacent values in the sequence as an array, thereby generating 9 arrays; taking the minimum value for each array, thereby producing 9 minimum values; the maximum of the 9 minima is taken as position bias.
For strand bias, the following values are substituted:
a 1 : in the forward direction of the chromosome strand S, the number of reads supporting any allele;
a 2 : in the reverse of chromosome strand S, the number of reads supporting any allele;
b 1 : in the forward direction of the chromosome strand S, the number of reads supporting the mutant allele;
b 2 : in the reverse direction of the chromosome chain S, the number of reads of the mutant allele is supported.
Meanwhile, calculating base quality imbalance (BQI for short, 1I in the formula) and strand position imbalance (SPI for short, I in the formula, 2I in the formula), calculating 3 values of I by the following method, and then taking the minimum value; then, dividing strand bias by the minimum value to normalize strand bias; finally ensuring that the minimum Strand bias is at least 1;
for BQI, c is the average base mass of all alleles of the other strand, d is the average base mass of the mutant alleles of this strand, and I =10 d-c ;
For SPI, c is the average base distance to the end of read for all alleles of the other strand, d is the average base distance to the end of read for mutant alleles suspected of having a preferred strand, andeach read has two ends: left and right. The left and right termini thus yield two I, respectively.
Let x =1,2,3,.., 11 for mismatch bias; x is the mismatch quantity threshold (as can be inferred by mismatch quantity < x). 1,2, … …,11 are different attempted thresholds.
For each value of x, the following four values are substituted:
a 1 : mismatch numbers < x, number of reads supporting any allele;
a 2 : the mismatch number is more than or equal to x, and the number of reads of any allele is supported;
b 1 : mismatch numbers < x, number of reads supporting mutant alleles;
b 2 : the mismatch number is more than or equal to x, and supports the reads number of mutant alleles;
sequentially substituting 11 x values to generate a sequence consisting of 11 values; taking every 5 adjacent values in the sequence as an array, thereby generating 7 arrays; taking the minimum value for each array, thereby producing 7 minimum values; the maximum of the 7 minima was taken as the mismatch bias.
Four preferences were calculated for each chromosome strand as described above. The highest preference is selected as the global preference of the chromosome chain. And correcting the number of reads supporting mutant type, wherein the corrected number of reads is equal to the number of reads before correction divided by the overall preference.
4. And calculating the correlation quality (quality) of all single samples by using the corrected read number of the support mutants, then obtaining the variation quality of the single samples, and judging according to the variation quality.
The method comprises the following steps:
1. for each site where quality needs to be calculated, the following formulas are used to calculate family quality, single-strand quality, double-strand quality, turor-normal quality and variable quality.
Quality(p,B,b,r)=log r (BIAS(B,p×B,B,b)×(r-1)+1)×10×log 10 (OR(B,p×B,B,b))。
Wherein:
(1) Let r =2 and calculate the family quality using the following definition for reads in a family:
p: error probability corresponding to the base quality threshold calculated in the reads deduplication in the step (a 3);
b: the number of reads that actually observed to support any allele from the data;
b: the number of reads supporting the mutant allele was actually observed from the data.
(2) R =1.5 if the high throughput sequencing experiments were pooled using PCR amplification methods, and r =1.25 if the high throughput sequencing experiments were pooled using capture or WGS methods. Comparing that the families on one chromosome chain are on one single-strand; single-strand quality was calculated for a single strand using the following definitions (maximum quality: 44 for C: G > T: A mutant form, 48 for T: A > C: G mutant form, 52 for the other SNV mutant forms, and 60 for all InDel mutant forms). If the single-strand quality is greater than the maximum similarity quality value, then let the single-strand quality equal the maximum similarity quality value;
p: the error probability corresponding to each possible family quality (i.e., trying all family qualities);
b: actual observation of family numbers supporting any allele;
b: actual observation of family numbers supporting mutant alleles;
after all p's are tried, the maximum single-strand quality is taken as the final single-strand quality.
(3) Forward or reverse direction of each single-strand on a double-strand; double-strand quality was calculated for a single strand using the following definitions:
v 1 ′=v 1 +v 2 ×min(1,(min(w 1 +w 2 ,w 0 )-w 1 )/w 1 );
wherein v is 1 ' is the final variable mass on a strand, v 1 Is the quality of variation of this chain (i.e., the final single-strand quality), v 2 Is the quality of variation on the other strand, w 1 Is the average value of the family quality, w, of this chain 2 Is the average value of the family quality of the other chain, w 0 Equal to the maximum similarity quality (maximum similarity quality is as follows: 44 for C: G > T: A mutant form, 48 for T: A > C: G mutant form, 52 for other SNV mutant forms, 60 for all InDel mutant forms) plus 10;
exchange v 1 And v 2 And exchange w 1 And w 2 According to the calculation v 1 ' symmetry calculation of formula v 2 ′;
v 2 ′=v 2 +v 1 ×min(1,(min(w 1 +w 2 ,w 0 )-w 2 )/w 2 )。
2. Double-strand quality was defined as the single sample variant mass.
3. The single-sample mutation quality of the tumor sample and the normal control sample were calculated according to the above steps 1 and 2, respectively. TLOD is then calculated. TLOD represents the probability that the variation is derived from neither the body (soma) nor the embryo (germline), as follows:
first, output values of the following functions are calculated.
TNreward(a 1 ,a 2 ,b 1 ,b 2 )=maxH(TNrewardBinomial(a 1 ,a 2 ,b 1 ,b 2 ),PLQTN(b 2 /b 1 ,a 2 /a 1 ));
Wherein, TNrewardBonomial (a) 1 ,a 2 ,b 1 ,b 2 )=10/log(10)×(a1×KLBernoulli(b 2 /b 1 ,a 2 /a 1 ))
maxH (a, b) = a if max (a, -a) < max (b, -b), otherwise maxH (a, b) = b;
PLQTN(t,n)=3×10×log 10 (min(t/n,10 2.5/3 ))。
TNpenal(a 1 ,a 2 ,b 1 ,b 2 )=max(0,min(DSVQnormal,PLQ(a 2 /a 1 ))-12.5×(max(0,OR(a 1 ,a 2 ,b 1 ,b 2 )-1)) 2 ) (ii) a Wherein DSVQnormal refers to the single sample mutation quality of the normal sample, i.e., the double-strand quality of the normal sample.
a 1 、a 2 、b 1 And b 2 The meanings are as follows:
a 1 : the sum of the family quality of reads supporting any allele in the paired normal control sample;
a 2 : sum of family quality of reads supporting mutant alleles in the paired normal control sample;
b 1 : family quality sum of reads supporting any allele in the tomor sample;
b 2 : family quality summation of reads supporting mutant alleles in the tomor sample.
Then, TLOD is calculated as follows:
TLOD(a 1 ,a 2 ,b 1 ,b 2 )=min(DSVQtumor,PLQ(b 2 /b 1 ))+TNreward(a 1 ,a 2 ,b 1 ,b 2 )-TNpenal(a 1 ,a 2 ,b 1 ,b 2 )。
wherein PLQ (f) =90+3 × 10 × log 10 (f) In that respect DSVQtomor refers to the single sample mutation quality of a tomor sample, i.e., the double-strand quality of a tomor sample.
(C4) The NLOD is calculated. NLOD represents the probability of coming variant from the embryo (germline). The method comprises the following specific steps:
first, the Genotype Likehood (GL) for each Genotype (GT) was calculated for each sample (two samples, tumor and normal paired in total). The definition and specification of GT and GL are in https: i/samtools. Io/hts-specs/vcfv4.2. Pdf. In summary, homref (homozygous wild type), hetero (heterozygous) and homalt (homozygous mutant) are three GT. The GL of these three GTs is calculated as follows:
GL(f,homref)=max(GLPowerLaw(f,homref),GLBinomial(f,homref));
GL(f,hetero)=max(GLPowerLaw(f,hetero),GLBinomial(f,hetero));
GL(f,homalt)=max(GLPowerLaw(f,homalt),GLBinomial(f,homalt))。
wherein GLPowerLaw (f, homref) =10 × 3 × min (0,log) 10 (0.02×(1-f)/f));
GLPowerLaw(f,hetero)=10×3×min(0,log 10 (min(f/(1-f),(1-f)/f)));
GLPowerLaw(f,homalt)=10×3×min(0,log 10 (0.02×f/(1-f)));
GLBinomial(f,homref)=10/log(10)×d×KLBernoulli(min(0.02,f),f);
GLBinomial(f,hetero)=10/log(10)×d×KLBernoulli(0.5,f);
GLBinomial(f,homalt)=10/log(10)×d×KLBernoulli(max(1-0.02,f),f);
KLBernoulli(x,y)=y×log(y/x)+(1-y)×log((1-y)/(1-x));
KLBernoulli (x, y) =0 if x =0, y =0, x is indeterminate or y is indeterminate. For example, (0/0) is an indefinite form.
And a refers to the number of reads supporting mutant genes, d refers to the number of reads supporting any genotype, and f = a/d is the mutation frequency.
The NLOD is then calculated. NLOD = max (qtumhorref, qnormalomref).
Wherein Qtomormref = -3-max (GL (f, homalt), GL (f, heter)) on the tomor sample. Qtumorhmref is calculated using tomor sample data, so there is no relationship to paired normal samples.
Wherein Qnormalhomref = GL (f, homref) -max (GL (f, homalt), GL (f, hetero)) on paired normal samples. Qnomalhomref was calculated using paired normal sample data, so it has no relation to the tunor sample.
5. Calculating the final quality of variation
The final quality of variation was calculated according to the following formula: min (TLOD, NLOD + 31).
If the final quality of variation is at least 60, determining as positive and outputting a variant form; otherwise, the result is judged to be negative. A mass of 60 represents one part per million of false positive rate. The mutation detection software MuTect (Cibutkis K, lawrence M S, carter S L, et al. Sensitive detection of physiological point mutations in input and heterologous samples [ J ]. Nature biotechnology,2013, 31 (3): 213.) indirectly used a threshold of 60, while the maximum alignment quality that BWA can output was 60. A threshold of 60 is used. The variant output format is VCF. The VCF format is defined in detail in (Danecek P, auton A, abecasis G, et al. The variant call format and VCFtools [ J ]. Bioinformatics,2011, 27 (15): 2156-2158.).
Detecting InDel:
compared with SNV mutation, the method adds a step of 'base quality estimation' before 'base quality correction', and the rest is the same as the detection method of SNV mutation. The "base quality estimation" method is specifically as follows:
InDelQual(x,y,a,b,c)=min(x+10,y+10,STRQual(a,b))+10×log 10 (c)
where InDelQual is the InDel mass for the same purpose as base mass, x and y are the base masses of the previous and next base, respectively, at which InDel occurs, a and b are the unit size (repeat size) and number of units (repeat number), respectively, of the Short Tandem Repeat (STR), and c is the number of bases of InDel. The STRKual function is defined as follows:
STRQual(a,b)=44-10×log 10 (1+8×softplus(a×b-8)/(θ×a 2 ))
the softplus function is defined as follows:
softplus (v) = log (1 + exp (v)), where log refers to the natural logarithm.
If InDel is the absence of one STR cell, then θ =0.25, and in addition θ =1.
Example 2 application example of the method for detecting gene variation based on high throughput sequencing
1. Comparison of the Process of the invention and the SmCounter2 Process
SNV variations were detected on N13532 (SRR 7526729) and N0261 (SRR 7526728) samples, respectively, using the method established in example 1 of the present invention and the existing smCounter2 method (reference "Xu C, gu X, padmanabhan R, et al. SmCounter2: an acid low-frequency variable cassette for targeted sequencing data with an unique molecular identifiers [ J ]. Bioinformatics,2019, 35 (8): 1299-1309.").
N13532 (SRR 7526729) and N0261 (SRR 7526728) samples: data can be found in the following english web sites. https: // www.ncbi.nlm.nih.gov/sra/SRR7526728; https: // www.ncbi.nlm.nih.gov/sra/SRR7526729. In addition, the smCounter2 article is also described. SRR7526728 product number: CDHS-13907Z-562; SRR7526729 product number: CDHS-13532Z-10181.
FIGS. 1 through 4 show the accuracy and recall of the invention and smCounter2 on N13532 (SRR 7526729) and N0261 (SRR 7526728) samples (project number: SRP 153933). As can be seen, the advantages of the present invention are higher F1 fraction and area under the curve.
smCounter2 requires machine learning training by repeated sequencing of the same sample and pre-obtaining some (but not all) known positive mutations of the sample, but the present invention does not require these, and therefore the present invention reduces the requirement for existing data. The present invention improves the true positive rate and reduces the false positive rate for the molecular signature data of blood samples (fig. 1-4).
2. Comparison of Performance of the inventive method, the Mutect2 method and the Strelka2 method on Standard Mixed samples
The method established in the invention example 1 and the existing Mutect2 method and Strelka2 method are adopted to detect the SNV variation on the standard mixed sample respectively.
Strelka2 method: see "Kim S, scheffller K, halpern a L, et al strelkka 2: fast and accurate locking of germline and physiological variations [ J ]. Nature methods,2018, 15 (8): 591. Strelka2 source code is described in https: com/Illumina/strelkka.
Mutect2 method: see "Cibutsis K, lawrence M S, carter S L, et al]Nature biotechnology,2013, 31 (3): 213. "article. Mutect2 source code is inhttps:// github.com/broadinstitute/gatk。
The standard mixed sample contains WGS (whole genome) data for virtual tunor and normal: tumor is a mixture of two samples of NA12878/NA24385 with a mixing ratio of 3/7 with a mean sequencing depth of 90 × and normal is a single sample of pure NA24385 with a mean sequencing depth of 30 ×. Standard mixed sample data is as follows:ftp://ftp- trace.ncbi.nlm.nih.gov/giab/ftp/use_cases/mixtures。
germline variation of NA12878 is all: ftp: v/ftp-trace.ncbi.nlm.nih.gov/gib/ftp/release/NA 12878_ HG001/NISTv3.3.2/GRCh37/HG001_ GRCh37_ GIAB _ highconf _ CG-IllFB-IlGATKHC-Ion-10X-SOLID _ CHROM1-X _ v.3.3.2_ highconf _ PGand RTPHASE transfer.vcf.gz.
Wherein descriptive documentation about these variations is:
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/README_NISTv3.3.2.txt。
FIGS. 5 and 6 show the results of performance testing of the inventive method, the Mutect2 method and the Strelka2 method on standard mixed samples. As can be seen, the advantages of the present invention are higher F1 fraction and area under the curve.
Strelka2 requires machine learning training by sequencing other similar samples and pre-deriving all known positive mutations for other similar samples, but this is not required by the present invention. In summary, the present invention increases true positive rate and decreases false positive rate and potentially decreases biological tests for WGS data of tissue samples.
The results of this example show that the method established in example 1 of the present invention (abbreviated as UVC) was compared and evaluated with other software. Evaluation showed very good UVC performance: whether the data are provided with molecular labels or not, the UVC improves the true positive rate and reduces the false positive rate.
Claims (8)
1. A gene variation detection method based on high-throughput sequencing comprises the following steps:
(A) Preprocessing the high-throughput sequencing data, mainly comprising base quality correction and read de-duplication;
(B) Calculating the preference generated by sequencing, and reducing the number of effective reads supporting mutation through the preference to obtain the corrected read number supporting mutation;
(C) Calculating the relative quality of all single samples by using the corrected read number of the support mutant, then obtaining the variation quality of the single sample, and judging through the variation quality;
the step (C) is carried out as follows:
(C1) For each site where quality needs to be calculated, calculating family quality, single-strand quality, double-strand quality by using the following formulas;
Quality(p,B,b,r)=logr(BIAS(B,p×B,B,b)×(r-1)+1)×10×log 10 (OR(B,p×B,B,b));
wherein Bias (B, pxb, B) = max (0, or (LS (B, pxb, B)) -1) × (B ÷ (B + B)) +1;
OR represents an odds ratio; the OR is calculated as follows:
OR(B,p×B,B,b)=(B×b)÷(p×B×B);
LS stands for Laplace smoothing; LS formula is as follows:
LS(B,p×B,B,b)=Laplace 2 (Laplace 1 (B,p×B,B,b));
laplace1 is expressed as: laplace 1 (B,p×B,B,b)=(B+p′,p×B+p′,B+p′,b+p′);
Laplace2 is expressed as: laplace 2 (B,p×B,B,b)=(B+p′×α,p×B+p′×α,B+p′×β,b+p′×β);
In Laplace 1 Formula and Laplace 2 In the formula, the default value of p' is 0.5;
in Laplace 2 In the formulas, α and β are as follows:
α=max(0,p′×((B+p×B)/(B+b)-1));
β=max(0,p′×(B+b)/(B+p×B)-1);
wherein:
(1) Let r =2 and calculate the family quality using the following definition for reads in a family:
p: error probability corresponding to the base quality threshold calculated in the step (A) reads deduplication;
b: the number of reads that actually observed to support any allele from the data;
b: actual observations from the data support the number of reads for the mutant allele;
(2) R =1.5 if the high throughput sequencing experiment is banked using PCR amplification, r =1.25 if the high throughput sequencing experiment is banked using capture or WGS; comparing that the families on one chromosome chain are on one single-strand; for a single-strand, the single-strand quality is calculated using the following definitions:
p: the error probability corresponding to each possible family quality;
b: actual observation of family numbers supporting any allele;
b: actual observation of family numbers supporting mutant alleles;
if the single-strand quality is greater than the maximum similarity quality value, then let the single-strand quality equal the maximum similarity quality value; the maximum family quality is as follows: for C: g > T: the a mutant form was 44, for T: a > C: the G mutant form was 48, 52 for the other SNV mutant forms, 60 for all InDel mutant forms;
after all p's are tried, the maximum single-strand quality is taken as the final single-strand quality;
(3) Forward or reverse direction of each single-strand on a double-strand; for a single-strand, double-strand quality was calculated using the following definitions:
v 1 ′=v 1 +v 2 ×min(1,(min(w 1 +w 2 ,w 0 )-w 1 )/w 1 );
wherein v is 1 ' is the final variable mass on a strand, v 1 Is the variable quality of this chain, i.e.the final single-strand quality, v 2 Is the quality of variation on the other strand, w 1 Is the average value of the family quality, w, of this chain 2 Is the average value of the family quality of the other chain, w 0 Equal to the maximum similarity quality, which is as follows, plus 10: for C: g > T: the a mutant form was 44, for T: a > C: the G mutant form was 48, 52 for the other SNV mutant forms, and 60 for all InDel mutant forms;
exchange v 1 And v 2 And exchange w 1 And w 2 According to the calculation of v 1 ' symmetry calculation of formula v 2 ′;
v 2 ′=v 2 +v 1 ×min(1,(min(w 1 +w 2 ,w 0 )-w 2 )/w 2 );
(C2) Defining double-strand quality as a single sample variation quality;
(C3) Calculating the single sample mutation quality of the tumor sample and the normal control sample according to the (C1) and the (C2) respectively; then calculating TLOD; TLOD represents the probability that the variation is neither somatic nor embryonic, as follows:
firstly, calculating the output values of the following functions;
TNreward(a 1 ,a 2 ,b 1 ,b 2 )=maxH(TNrewardBinomial(a 1 ,a 2 ,b 1 ,b 2 ),PLQTN(b 2 /b 1 ,a 2 /a 1 ));
wherein, TNrewardBonomial (a) 1 ,a 2 ,b 1 ,b 2 )=10/log(10)×(a 1 ×KLBernoulli(b 2 /b 1 ,a 2 /a 1 ) ); wherein, KLBernoulli (b) 2 /b 1 ,a 2 /a 1 )=(a 2 /a 1 )×log((a 2 /a 1 )/(b 2 /b 1 ))+(1-(a 2 /a 1 ))×log((1-(a 2 /a 1 ))/(1-(b 2 /b 1 ) ); if b is 2 /b 1 =0、a 2 /a 1 =0、b 2 /b 1 Is an indefinite form or a 2 /a 1 Is an infinite form, then KLBernoulli (b) 2 /b 1 ,a 2 /a 1 ) =0; maxH (a, b) = a if max (a, -a) < max (b, -b), otherwise maxH (a, b) = b;
PLQTN(t,n)=3×10×log 10 (min(t/n,10 2.5/3 ));
TNpenal(a 1 ,a 2 ,b 1 ,b 2 )=max(0,min(DSVQnormal,PLQ(a 2 /a 1 ))-12.5×(max(0,OR(a 1 ,a 2 ,b 1 ,b 2 )-1)) 2 ) (ii) a Wherein DSVQnormal refers to the single sample mutation quality of normal samples, namely double-strand quality of normal samples; OR represents the ratio of ratios, and the calculation formula of OR is as follows: OR (a) 1 ,a 2 ,b 1 ,b 2 )=(a 1 ×b 2 )÷(a 2 ×b 1 );a 1 、a 2 、b 1 And b 2 The meanings are as follows:
a 1 : family q of reads supporting any allele in the paired normal control sample(ii) a total of the qualitys;
a 2 : the sum of the family quality of reads supporting mutant alleles in the paired normal control sample;
b 1 : family quality sum of reads supporting any allele in the tomor sample;
b 2 : family quality sum of reads supporting mutant alleles in the tomor sample;
then, TLOD is calculated as follows:
TLOD(a 1 ,a 2 ,b 1 ,b 2 )=min(DSVQtumor,PLQ(b 2 /b 1 ))+TNreward(a 1 ,a 2 ,b 1 ,b 2 )-TNpenal(a 1 ,a 2 ,b 1 ,b 2 );
wherein PLQ (f) =90+3 × 10 × log 10 (f) (ii) a DSVQtomor refers to the single-sample mutation quality of a tomor sample, i.e., the double-strand quality of a tomor sample;
(C4) Calculating NLOD; NLOD represents the probability of coming from an embryo for variation, as follows:
firstly, calculating the genotype likelihood of each genotype aiming at each sample; GT for genetic type, GL for genetic type likelihood for short; the GL of three GTs, homozygous wild-type, heterozygous and homozygous mutant, was calculated as follows:
GL formula for homozygous wild type: GL (f, homref) = max (GLPowerLaw (f, homref), GLBinomial (f, homref));
GL calculation formula for heterozygote type: GL (f, hetero) = max (GLPowerLaw (f, hetero), GLBinomial (f, hetero));
GL calculation formula for homozygous mutant: GL (f, homalt) = max (GLPowerLaw (f, homalt), GLBinomial (f, homalt));
where GLPowerLaw (f, homref) =10 × 3 × min (0,log) 10 (0.02×(1-f)/f));
GLPowerLaw(f,hetero)=10×3×min(0,log 10 (min(f/(1-f),(1-f)/f)));
GLPowerLaw(f,homalt)=10×3×min(0,log 10 (0.02×f/(1-f)));
GLBinomial(f,homref)=10/log(10)×d×KLBernoulli(min(0.02,f),f);
GLBinomial(f,hetero)=10/log(10)×d×KLBernoulli(0.5,f);
GLBinomial(f,homalt)=10/log(10)×d×KLBernoulli(max(1–0.02,f),f);
KLBernoulli (x, y) = y × log (y/x) + (1-y) × log ((1-y)/(1-x)); KLBernoulli (x, y) =0 if x =0, y =0, x is indeterminate or y is indeterminate;
and a refers to the number of reads supporting mutant genes, d refers to the number of reads supporting any genotype, f = a/d is the mutation frequency;
then, calculating NLOD; NLOD = max (qtumhorref, qnormalomref);
wherein, on the tomor sample, qtomormref = -3-max (GL (f, homalt), GL (f, hetero));
wherein Qnormalhomref = GL (f, homref) -max (GL (f, homalt), GL (f, hetero));
(C5) The final quality of variation was calculated according to the following formula: min (TLOD, NLOD + 31);
if the final quality of variation is at least 60, determining as positive and outputting a variant form; otherwise, the result is judged to be negative.
2. The method of claim 1, wherein: in the method, the genetic variation detected is an SNV variation or an InDel variation.
3. The method of claim 1, wherein: when the detected genetic variation is an InDel variation, the step (A) further comprises a step of estimating the base quality before the base quality correction; the base quality estimation is performed as follows:
InDelQual(x,y,a,b,c)=min(x+10,y+10,STRQual(a,b))+10×log 10 (c);
wherein InDelQual is InDel mass having the same purpose as base mass, x and y are base mass of a previous base and a next base of the occurrence of InDel, a and b are unit size and unit number of the short tandem repeat sequence, and c is base number of InDel; the STRGUAL function is defined as follows:
STRQual(a,b)=44-10×log 10 (1+8×softplus(a×b-8)/(θ×a 2 ));
the softplus function is defined as follows:
softplus (v) = log (1 + exp (v)), where log refers to the natural logarithm;
if InDel is a deletion of one short tandem repeat unit, then θ =0.25, and in addition θ =1.
4. A method according to any one of claims 1-3, characterized in that: in the step (A), the base quality correction is carried out as follows:
if the high throughput sequencing is double-ended sequencing, merging before de-duplication and filtering, merging the R1 end and the R2 end, and then using the base quality of the merged reads;
for the overlapping region of the R1 end and the R2 end, the base quality after combination is calculated as follows:
when the bases corresponding to p and q are the same, the following formula is used: max (p, q);
when the bases corresponding to p and q are different, the following formula is used for calculation: max (p, q) -min (p, q);
wherein p is the base mass before the end R1 is corrected, and q is the base mass before the end R2 is corrected;
for base masses in non-overlapping regions of the R1 end and the R2 end, the combined base masses are equal to the base masses before correction;
if the single-ended sequencing is carried out, merging is not needed, and the base quality after correction is equal to the base quality before correction;
and/or
In the step (A), read deduplication is performed as follows:
placing PCR repeats or optically repeated reads into a family to remove repeats;
further, PCR repeats or optically repeated reads are placed in a family to remove duplicates according to a method comprising the following steps:
(a1) For each locus, the number of reads that start or stop at that locus, abbreviated as end depth, is found;
(a2) Each locus x attracts other loci y, the attraction strength s of x to y is calculated, the attraction strength s is in direct proportion to the depth of the x end, in inverse proportion to the depth of the y end and in exponential decay relation with the number z of bases between x and y;
the formula of the attractive force intensity s is as follows;
s=x/y×5 -z ;
if s > 1, then y is considered a PCR or optical repeat of x, thereby placing x and y in a family;
(a3) And (3) filtering: the threshold for base mass was calculated as follows:
Threshold(P)=argmax p∈P (p -l{q∈P|q≤p}| );
wherein P is a set of sequencing error probabilities corresponding to all base masses; the mathematical symbol e represents set attribution, so that the meaning of P e P means that P is any one element in P, and the meaning of q e P means that q is any one element in P;
bases on reads were filtered according to base quality thresholds: filtering out bases whose base quality is lower than a base quality threshold after correction;
after filtering, for each family, taking the base with the most frequent family as the base after de-weighting; the repeated reads are integrated into a consistent read.
5. A method according to any one of claims 1-3, characterized in that: in the step (B), four preferences of depletion bias, position bias, strand bias and mismatch bias are calculated in each chromosome chain; then selecting the highest preference as the integral preference of the chromosome chain; the number of reads supporting mutant is corrected, the corrected number of reads being equal to the number of reads before correction divided by the overall preference.
6. The method of claim 5, wherein: the four preferences of depletion bias, position bias, strand bias and mismatch bias are calculated as follows:
Bias(a 1 ,a 2 ,b 1 ,b 2 )=max(0,OR(LS(a 1 ,a 2 ,b 1 ,b 2 ))-1)*(b 2 ÷(b 2 +b 1 ))+1;
wherein OR represents an odds ratio; the OR is calculated as follows:
OR(a 1 ,a 2 ,b 1 ,b 2 )=(a 1 ×b 2 )÷(a 2 ×b 1 );
wherein LS represents Laplace smoothing; the LS formula is as follows: LS (a, b, c, d) = Laplace 2 (Laplace 1 (a, b, c, d)); wherein, laplace 1 The formula is as follows: laplace 1 (a, b, c, d) = (a + p, b + p, c + p, d + p); laplace2 has the following formula: laplace 2 (a,b,c,d)=(a+p×α,b+p×α,c+p×β,d+p×β);
In Laplace 1 Formula and Laplace 2 In the formula, the default value of p is 0.5;
in Laplace 2 In the formulas, α and β are as follows: α = max (0,p × ((a) 1 +a 2 )/(b 1 +b 2 )-1));β=max(0,p×(b 1 +b 2 )/(a 1 +a 2 )-1);
For the reduction bias, the meaning of the parameters is as follows:
a 1 : the number of reads supporting any allele that were deduplicated;
a 2 : after deduplication, the number of reads supporting any allele;
b 1 : the number of reads that support the mutant allele that are deduplicated;
b 2 : number of reads supporting mutant alleles after deduplication;
for position bias, the meaning of each parameter is as follows:
for each direction, extending 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66 bases in direction; the following four values are then substituted:
a 1 : the number of reads that support any allele within reads that have spanned the extended locus;
a 2 : the number of reads that support any allele in reads that do not span the extended locus;
b 1 : in reads that have spanned the extended locus, the number of reads supporting the mutant allele;
b 2 : number of reads that support mutant alleles in reads that do not span the extended locus;
sequentially substituting 11 extension units to generate a sequence consisting of 11 numerical values; taking every 3 adjacent values in the sequence as an array, thereby generating 9 arrays; taking the minimum value for each array, thereby producing 9 minimum values; taking the maximum value of the 9 minimum values as position bias;
for strand bias, the following values are substituted:
a 1 : in the forward direction of the chromosome strand S, the number of reads supporting any allele;
a 2 : in the reverse of chromosome chain S, the number of reads supporting any allele;
b 1 : in the forward direction of chromosome strand S, the number of reads supporting the mutant allele;
b 2 : in the reverse direction of the chromosome strand S, the number of reads supporting the mutant allele;
simultaneously calculating base quality inventory and strand position inventory, wherein the base quality inventory is BQI for short, the formula is represented by I, the formula comprises 1I, the strand position inventory is SPI for short, the formula is represented by I and comprises 2I, the numerical values of the 3I are calculated by the following method, and then the minimum numerical value is taken; then, dividing the strand bias by the minimum value to normalize the strand bias; finally ensuring that the minimum Strand bias is at least 1;
for BQI, c is the average base of all alleles of the other strandBase mass, d is the average base mass of the mutant allele of this strand, and I =10 d-c ;
For SPI, c is the average base distance to the end of read for all alleles of the other strand, d is the average base distance to the end of read for mutant alleles suspected of having a preferred strand, andeach read has two ends: left and right sides; the left and right termini thus yield two I, respectively;
let x =1,2,3, …,11 for mismatch bias; x is the mismatch number threshold, 1,2, …,11 are different attempted thresholds;
for each value of x, the following four values are substituted:
a 1 : mismatch numbers < x, number of reads supporting any allele;
a 2 : the mismatch number is more than or equal to x, and the number of reads of any allele is supported;
b 1 : mismatch numbers < x, number of reads supporting mutant alleles;
b 2 : the mismatch number is more than or equal to x, and supports the reads number of mutant alleles;
sequentially substituting 11 x values to generate a sequence consisting of 11 values; taking every 5 adjacent values in the sequence as an array, thereby generating 7 arrays; taking the minimum value for each array, thereby generating 7 minimum values; the maximum of the 7 minima was taken as the mismatch bias.
7. A gene variation detection system based on high-throughput sequencing comprises a device A, a device B and a device C;
the device A is capable of pre-processing high throughput sequencing data, mainly including base quality correction, read de-duplication, according to step (A) of any one of claims 1 to 6;
said device B is capable of calculating the preference generated by sequencing according to the step (B) of any one of claims 1 to 6, and obtaining the corrected read number of the support mutation type by reducing the effective read number of the support mutation type according to the preference;
the device C can calculate the correlation quality of all the individual samples by using the corrected read numbers of the supporting mutants according to the step (C) of any one of claims 1 to 6, and then obtain the variation quality of the individual samples, and perform the judgment by the variation quality.
8. Use of the method of any one of claims 1-6 or the system of claim 7 for the manufacture of a product for tumor pre-screening, tumor prognosis, tumor classification and/or tumor medication guidance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010222444.5A CN111243664B (en) | 2020-03-26 | 2020-03-26 | Gene variation detection method based on high-throughput sequencing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010222444.5A CN111243664B (en) | 2020-03-26 | 2020-03-26 | Gene variation detection method based on high-throughput sequencing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111243664A CN111243664A (en) | 2020-06-05 |
CN111243664B true CN111243664B (en) | 2023-04-18 |
Family
ID=70869208
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010222444.5A Active CN111243664B (en) | 2020-03-26 | 2020-03-26 | Gene variation detection method based on high-throughput sequencing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111243664B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111785325B (en) * | 2020-06-23 | 2021-10-22 | 西北工业大学 | Method for identifying heterogeneous cancer driver genes of mutually exclusive constraint graph Laplace |
CN115458051B (en) * | 2022-09-28 | 2023-03-21 | 北京泛生子基因科技有限公司 | Method, device and computer readable storage medium for simulating small variation in sequencing data and capable of retaining molecular tag information |
CN116741274B (en) * | 2023-02-07 | 2024-07-26 | 杭州联川基因诊断技术有限公司 | Method, device and medium for determining representative sequence in targeted sequencing data |
CN116895332B (en) * | 2023-09-11 | 2023-12-05 | 臻和(北京)生物科技有限公司 | Filtering method for interrupting false positive mutation generated by artificial fragments in library construction by enzyme digestion method |
-
2020
- 2020-03-26 CN CN202010222444.5A patent/CN111243664B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111243664A (en) | 2020-06-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111243664B (en) | Gene variation detection method based on high-throughput sequencing | |
JP7368483B2 (en) | An integrated machine learning framework for estimating homologous recombination defects | |
US20240304280A1 (en) | Validation methods and systems for sequence variant calls | |
Morgan et al. | The mouse universal genotyping array: from substrains to subspecies | |
EP4073805B1 (en) | Systems and methods for predicting homologous recombination deficiency status of a specimen | |
CN113366122B (en) | Free DNA end characterization | |
US10496679B2 (en) | Computer algorithm for automatic allele determination from fluorometer genotyping device | |
WO2017156290A9 (en) | A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing | |
WO2021232388A1 (en) | Method for determining base type of predetermined site in embryonic cell chromosome, and application thereof | |
IL258999A (en) | Methods for detecting copy-number variations in next-generation sequencing | |
CN113748467A (en) | Loss of function calculation model based on allele frequency | |
JP7333838B2 (en) | Systems, computer programs and methods for determining genetic patterns in embryos | |
WO2024140368A1 (en) | Sample cross contamination detection method and device | |
CN114400045B (en) | Method, probe set, kit and system for detecting homologous recombination repair defects based on second-generation sequencing | |
KR20220064952A (en) | SYSTEMS AND METHODS FOR DETERMINING GENOME PLODY | |
CN117497047B (en) | Method, equipment and medium for screening tumor gene markers based on exon sequencing | |
US20240376527A1 (en) | Cell-free dna end characteristics | |
Sarantidis | Algorithms to Explore the Chromosomal Clustering of Genes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |