CN111243664B - Gene variation detection method based on high-throughput sequencing - Google Patents
Gene variation detection method based on high-throughput sequencing Download PDFInfo
- Publication number
- CN111243664B CN111243664B CN202010222444.5A CN202010222444A CN111243664B CN 111243664 B CN111243664 B CN 111243664B CN 202010222444 A CN202010222444 A CN 202010222444A CN 111243664 B CN111243664 B CN 111243664B
- Authority
- CN
- China
- Prior art keywords
- quality
- reads
- strand
- base
- follows
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
技术领域technical field
本发明属于生物信息技术领域,具体涉及一种基于高通量测序的基因变异检测方法。The invention belongs to the technical field of biological information, and in particular relates to a gene variation detection method based on high-throughput sequencing.
背景技术Background technique
随着测序技术的发展,高通量测序的通量越来越大,因此产生的数据也越来越大。与此同时,高通量测序的应用也越来越广泛并且越来越重要。如今高通量测序在癌症的应用已经非常广泛且非常重要,例如,高通量测序被应用于:肿瘤早筛、肿瘤预后、肿瘤分类和肿瘤用药指导。与此同时,大量的高通量测序数据带来了很多计算方面的挑战。在高通量测序数据分析流程中,最重要的步骤就是变异检测(variant call)。在临床报告出具之前一般需要人工审核变异检测软件(variant caller)报出的变异检测结果。With the development of sequencing technology, the throughput of high-throughput sequencing is increasing, so the data generated is also increasing. At the same time, the application of high-throughput sequencing is becoming more and more extensive and more and more important. Nowadays, the application of high-throughput sequencing in cancer has been very extensive and very important. For example, high-throughput sequencing is used in early tumor screening, tumor prognosis, tumor classification and tumor drug guidance. At the same time, the large amount of high-throughput sequencing data brings many computational challenges. In the high-throughput sequencing data analysis process, the most important step is variant call. Before the clinical report is issued, it is generally necessary to manually review the variant detection results reported by the variant caller.
目前基于高通量测序数据进行变异检测的方法检测出的结果中都存在漏检和假阳性的问题。漏检有可能导致错误诊断,对患者的生命造成严重危害(比如说,降低生存概率)。而大量的假阳性变异需要增加人工审核的工作,导致人工审核需要的时间和成本变得非常高。总而言之,目前存在的软件自动化检测出的结果跟人工审核得到的结果的一致性比较差。除此之外,某些变异检测软件运行速度缓慢,从而导致更多的计算资源消耗,并且导致医生需要等待更长的时间才能拿到患者的基因检测报告。The current methods of variant detection based on high-throughput sequencing data all have problems of missed detection and false positives in the detection results. Missed detection may lead to wrong diagnosis and cause serious harm to the patient's life (for example, reduce the probability of survival). However, a large number of false positive mutations need to increase the work of manual review, resulting in very high time and cost for manual review. All in all, the consistency between the results of automatic detection by existing software and the results of manual review is relatively poor. In addition, some variant detection software runs slowly, which consumes more computing resources and causes doctors to wait longer to get the patient's genetic test report.
发明内容Contents of the invention
针对上述问题,本发明的目的是提供一种基于高通量测序的基因变异检测方法。In view of the above problems, the object of the present invention is to provide a gene variation detection method based on high-throughput sequencing.
第一方面,本发明要求保护一种基于高通量测序的基因变异检测方法。In the first aspect, the present invention claims a method for detecting gene variation based on high-throughput sequencing.
本发明所提供的基于高通量测序的基因变异检测方法支持单端测序和双端测序的数据。The gene variation detection method based on high-throughput sequencing provided by the present invention supports single-end sequencing and paired-end sequencing data.
本发明所提供的基于高通量测序的基因变异检测方法,可包括如下步骤:The gene variation detection method based on high-throughput sequencing provided by the present invention may include the following steps:
(A)对高通量测序数据进行预处理,主要包括碱基质量校正、read去重;(A) Preprocessing the high-throughput sequencing data, mainly including base quality correction and read deduplication;
(B)计算测序产生的偏好性,通过偏好性减少支持突变的有效read的数量,获得校正后的支持突变型的reads的数量;(B) Calculating the bias generated by sequencing, reducing the number of effective reads that support mutations through biases, and obtaining the corrected number of reads that support mutations;
(C)利用校正后的支持突变型的read数目,计算所有单个样本相关质量(quality),然后得出单个样本变异质量,通过变异质量进行判定。(C) Using the corrected number of reads supporting the mutant type, calculate the relevant quality of all individual samples, and then obtain the variation quality of a single sample, and judge by the variation quality.
在所述方法中,被检测的所述基因变异可为SNV变异或者InDel变异。如果没有可用的normal对照样本,则认为normal对照样本对应的数据是一个不含有任何read的BAM格式的虚拟文件,此虚拟文件等同于空文件。In the method, the detected genetic variation can be SNV variation or InDel variation. If there is no normal control sample available, the data corresponding to the normal control sample is considered to be a virtual file in BAM format that does not contain any reads, and this virtual file is equivalent to an empty file.
所有本发明中提到的质量都采用“Ewing B,Green P.Base-calling ofautomated sequencer traces using phred.II.Error probabilities[J].Genomeresearch,1998,8(3):186-194.”第一个公式中对质量(quality)的定义,因此任何质量(比如说:碱基质量和变异质量)都跟概率有一一对应的关系。因此质量和概率是可互换的。All the qualities mentioned in this invention are adopted "Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities [J]. Genomeresearch, 1998, 8 (3): 186-194." The first The definition of quality in this formula, so any quality (for example: base quality and mutation quality) has a one-to-one correspondence with probability. So mass and probability are interchangeable.
当被检测的所述基因变异为InDel变异时,步骤(A)中,在进行所述碱基质量校正之前还包括碱基质量估计的步骤;所述碱基质量估计按照如下进行:When the detected genetic variation is an InDel variation, in step (A), the step of base quality estimation is also included before performing the base quality correction; the base quality estimation is performed as follows:
InDelQual(x,y,a,b,c)=min(x+10,y+10,STRQual(a,b))+10×log10(c)InDelQual(x, y, a, b, c) = min(x+10, y+10, STRQual(a, b))+10×log 10 (c)
其中InDelQual是跟碱基质量有相同用途的InDel质量,x和y分别是发生InDel的前一个碱基和后一个碱基的碱基质量,a和b分别是短串联重复序列(STR)的单元大小(repeat size)和单元数量(repeat number),c是InDel的碱基数量。STRQual函数定义如下:Among them, InDelQual is the InDel quality that has the same purpose as the base quality, x and y are the base quality of the previous base and the next base where InDel occurs, respectively, and a and b are the units of the short tandem repeat (STR) The size (repeat size) and the number of units (repeat number), c is the number of bases of InDel. The STRQual function is defined as follows:
STRQual(a,b)=44-10×log10(1+8×softplus(a×b-8)/(θ×a2));STRQual(a,b)=44-10×log 10 (1+8×softplus(a×b-8)/(θ×a 2 ));
softplus函数定义如下:The softplus function is defined as follows:
softplus(v)=log(1+exp(v)),其中log指的是自然对数。softplus(v)=log(1+exp(v)), where log refers to the natural logarithm.
如果InDel是一个STR单元的缺失,那么θ=0.25,除此之外,θ=1。If InDel is a deletion of a STR unit, then θ=0.25, otherwise θ=1.
当被检测的所述基因变异为SNV或者InDel变异时,步骤(A)中,均可按照如下进行碱基质量校正:When the detected genetic variation is SNV or InDel variation, in step (A), base quality correction can be performed as follows:
如果所述高通量测序为双端测序,则在去重和过滤前进行合并,将R1端和R2端进行合并,然后使用合并后的reads的碱基质量。If the high-throughput sequencing is paired-end sequencing, merge before deduplication and filtering, merge the R1 and R2 ends, and then use the base quality of the merged reads.
对于R1端和R2端的重叠区内,合并后碱基质量的计算方法如下:For the overlapping region between the R1 and R2 ends, the calculation method of the combined base quality is as follows:
当p和q对应的碱基一样时,按照如下公式计算:When the bases corresponding to p and q are the same, calculate according to the following formula:
max(p,q);max(p,q);
当p和q对应的碱基不一样时,按照如下公式计算:When the bases corresponding to p and q are different, calculate according to the following formula:
max(p,q)-min(p,q);max(p,q)-min(p,q);
其中,p是R1端校正前的碱基质量,q是R2端校正前的碱基质量。Among them, p is the base quality before R1 end correction, and q is the base quality before R2 end correction.
对于R1端和R2端的非重叠区内的碱基质量,合并后的碱基质量等于校正前的碱基质量。For the base quality in the non-overlapping region of R1 and R2 ends, the combined base quality is equal to the base quality before correction.
如果是单端测序,不需要合并(校正后的碱基质量等于校正前的碱基质量)。If it is single-end sequencing, no merging is required (the base quality after correction is equal to the base quality before correction).
当被检测的所述基因变异为SNV或者InDel变异时,步骤(A)中,均可按照如下对reads进行去重:When the detected gene variation is SNV or InDel variation, in step (A), the reads can be deduplicated as follows:
把PCR重复或者光学重复的reads放到一个family(簇)中从而去除重复。Put the reads of PCR duplication or optical duplication into a family (cluster) to remove duplication.
具体可是按照包括如下步骤的方法把PCR重复或者光学重复的reads放到一个family(簇)中从而去除重复的:Specifically, the reads of PCR duplication or optical duplication can be put into a family (cluster) according to the method including the following steps to remove duplication:
(a1)对于每个基因座(每一条染色体上的每一个基因座,例如:chr1:1,chr1:2,...,chr1:249250621,chr2:1,...),找到在这个基因座起始或终止的reads的数目,简称末端深度。(a1) For each locus (every locus on each chromosome, e.g.: chr1:1, chr1:2, ..., chr1:249250621, chr2:1, ...), find in this gene The number of reads starting or ending at the seat, referred to as the end depth.
(a2)每个基因座x吸引其他基因座y,计算x对于y的吸引力强度s,吸引力强度s跟x末端深度(这个基因座起始/终止的reads的数目)成正比(比例常数=1),跟y末端深度成反比(比例常数=1)并且跟x与y之间碱基数目z成指数衰减关系(衰减系数=5);(a2) Each locus x attracts other loci y, and calculates the attraction strength s of x to y. The attraction strength s is proportional to the depth of the end of x (the number of reads starting/terminating at this locus) (proportionality constant =1), which is inversely proportional to the depth of the end of y (proportionality constant=1) and has an exponential decay relationship with the number of bases z between x and y (attenuation coefficient=5);
吸引力强度s的公式如下;The formula for the strength of attraction s is as follows;
s=x/y×5-z;s=x/y×5 -z ;
如果s>1(y被x强烈吸引),那么认为y是x的一个PCR或者光学重复,从而把x和y放到一个family里。If s > 1 (y is strongly attracted to x), then y is considered to be a PCR or optical repeat of x, thus putting x and y into a family.
(a3)过滤:计算碱基质量的阈值(threshold),公式如下:(a3) Filtering: calculate the threshold value (threshold) of base quality, the formula is as follows:
Threshold(P)=argmaxp∈P(p-|{q∈P|q≤p}|);Threshold(P)=argmax p∈P (p -|{q∈P|q≤p}| );
其中,P是所有的碱基质量对应的测序错误概率(一个碱基质量对应一个测序错误概率)的集合;数学符号∈代表集合归属,因此p∈P的意思是指p是P中任何一个元素,而q∈P的意思是指q是P中任何一个元素。Among them, P is the set of sequencing error probabilities corresponding to all base qualities (one base quality corresponds to one sequencing error probability); the mathematical symbol ∈ represents the set belonging, so p∈P means that p is any element in P , and q∈P means that q is any element in P.
根据碱基质量阈值对reads上的碱基进行过滤(filter):把校正后碱基质量低于碱基质量阈值的碱基过滤掉;如果一个碱基被过滤掉了,那么认为这个碱基不存在,可以理解为没有测出任何碱基。Filter the bases on the reads according to the base quality threshold: filter out the bases whose corrected base quality is lower than the base quality threshold; if a base is filtered out, then it is considered that the base is not Existence, it can be understood that no base has been detected.
需要注意的是,过滤碱基和过滤reads是两个概念。It should be noted that filtering bases and filtering reads are two concepts.
过滤之后,对每个family,取family最常见碱基(碱基出现次数最多)作为去重后consensus family碱基,也就是去重后的碱基;在此family既可以是针对碱基的也可以是针对reads的。After filtering, for each family, take the most common base of the family (the number of times the base appears the most) as the consensus family base after deduplication, that is, the base after deduplication; here the family can be either for the base or Can be for reads.
然后,把重复的reads整合成一条consensus read(一致read);在此需要提到无论去重的是reads还是碱基,去重后最终结果一样。因此既可以理解成对reads进行去重,也可以理解成对碱基进行过滤。Then, integrate the repeated reads into a consensus read (consensus read); here it needs to be mentioned that no matter whether the deduplication is reads or bases, the final result is the same after deduplication. Therefore, it can be understood as deduplication of reads, or as filtering of bases.
步骤(B)中,是在每一条染色体链算出deduplication bias,position bias,strand bias和mismatch bias四个偏好性;然后选最高偏好性作为染色体链整体偏好性;对支持突变型的read数目做校正,校正后的read数目等于校正前的read数目除以整体偏好性。In step (B), the four preferences of deduplication bias, position bias, strand bias and mismatch bias are calculated for each chromosome chain; then the highest bias is selected as the overall preference of the chromosome chain; the number of reads supporting the mutant type is corrected , the number of reads after correction is equal to the number of reads before correction divided by the overall preference.
当所述被检测的所述基因变异为SNV或者InDel变异时,均可按照如下公式计算deduplication bias,position bias,strand bias和mismatch bias四个偏好性:When the detected gene variation is SNV or InDel variation, the four preferences of deduplication bias, position bias, strand bias and mismatch bias can be calculated according to the following formula:
Bias(a1,a2,b1,b2)=max(0,OR(LS(a1,a2,b1,b2))-1)*(b2÷(b2+b1))+1;Bias(a 1 , a 2 , b 1 , b 2 )=max(0, OR(LS(a 1 , a 2 , b 1 , b 2 ))-1)*(b 2 ÷(b 2 +b 1 ))+1;
在公式中,OR代表比值比;OR的计算公式如下:In the formula, OR stands for odds ratio; OR is calculated as follows:
OR(a1,a2,b1,b2)=(a1×b2)÷(a2×b1)。OR(a 1 , a 2 , b 1 , b 2 )=(a 1 ×b 2 )÷(a 2 ×b 1 ).
在公式中,LS代表Laplace平滑;LS公式如下:LS(a,b,c,d)=Laplace2(Laplace1(a,b,c,d));其中,Laplace1公式如下:Laplace1(a,b,c,d)=(a+p,b+p,c+p,d+p);Laplace2公式如下:Laplace2(a,b,c,d)=(a+p×α,b+p×α,c+p×β,d+p×β);In the formula, LS represents Laplace smoothing; LS formula is as follows: LS (a, b, c, d)=Laplace 2 (Laplace 1 (a, b, c, d)); Wherein, Laplace 1 formula is as follows: Laplace 1 ( a, b, c, d)=(a+p, b+p, c+p, d+p); Laplace 2 formula is as follows: Laplace 2 (a, b, c, d)=(a+p×α , b+p×α, c+p×β, d+p×β);
在Laplace1公式和Laplace2公式中,p的默认值是0.5;In the Laplace 1 formula and Laplace 2 formula, the default value of p is 0.5;
在Laplace2公式中,α和β公式如下:α=max(0,p×((a1+a2)/(b1+b2)-1));β=max(0,p×(b1+b2)/(a1+a2)-1);In the Laplace 2 formula, the formulas of α and β are as follows: α=max(0, p×((a 1 +a 2 )/(b 1 +b 2 )-1)); β=max(0, p×( b 1 +b 2 )/(a 1 +a 2 )-1);
对于deduplication bias而言,各参数的意义如下:For deduplication bias, the meaning of each parameter is as follows:
a1:去重去掉的支持任何等位基因的reads的数量;a 1 : the number of reads that support any alleles removed by deduplication;
a2:去重之后,支持任何等位基因的reads的数量;a 2 : After deduplication, the number of reads supporting any allele;
b1:去重去掉的支持突变等位基因的reads的数量;b 1 : the number of reads supporting mutant alleles removed by deduplication;
b2:去重之后,支持突变等位基因的reads的数量。b 2 : After deduplication, the number of reads supporting the mutant allele.
如果去重前有7条reads,去重后剩下2条reads,那么去重去掉了7-2=5条reads。If there are 7 reads before deduplication and 2 reads are left after deduplication, then 7-2=5 reads are removed after deduplication.
对于position bias而言,各参数的意义如下:For position bias, the meaning of each parameter is as follows:
针对每个方向(总共左右两个方向),按照方向延长1、3、6、10、15、21、28、36、45、55、66个碱基;然后代入以下四个数值:For each direction (a total of left and right directions), extend 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66 bases according to the direction; then substitute the following four values:
a1:在已经跨越延长后基因座的reads里,支持任何等位基因的reads的数量;a 1 : Among the reads that have spanned the extended locus, the number of reads supporting any allele;
a2:在没有跨越延长后基因座的reads里,支持任何等位基因的reads的数量;a 2 : Among the reads that do not span the extended locus, the number of reads supporting any allele;
b1:在已经跨越延长后基因座的reads里,支持突变等位基因的reads的数量;b 1 : among the reads that have spanned the extended locus, the number of reads supporting the mutant allele;
b2:在没有跨越延长后基因座的reads里,支持突变等位基因的reads的数量。b 2 : Number of reads supporting the mutant allele among reads that do not span the extended locus.
依次代入11个延长单位产生一条由11个数值组成的序列;序列中取每3个相邻数值作为一个数组,从而产生9个数组;对每个数组取最小值,从而产生9个最小值;取9个最小值的最大值为position bias。Substituting 11 extended units in turn produces a sequence consisting of 11 values; taking every 3 adjacent values in the sequence as an array, thereby generating 9 arrays; taking the minimum value for each array, thereby generating 9 minimum values; Take the maximum value of the 9 minimum values as position bias.
对于strand bias而言,代入以下数值:For strand bias, substitute the following values:
a1:在染色体链S的正向,支持任何等位基因的reads的数量;a 1 : In the forward direction of the chromosome strand S, the number of reads supporting any allele;
a2:在染色体链S的反向,支持任何等位基因的reads的数量;a 2 : In the reverse direction of the chromosome strand S, the number of reads supporting any allele;
b1:在染色体链S的正向,支持突变等位基因的reads的数量;b 1 : in the forward direction of the chromosome strand S, the number of reads supporting the mutant allele;
b2:在染色体链S的反向,支持突变等位基因的reads的数量。b 2 : In the reverse direction of chromosome strand S, the number of reads supporting the mutant allele.
同时计算base quality imbalance(简称BQI,公式中用I表示,有1个I)和strandposition imbalance(简称SPI,公式中用I表示,有2个I),通过以下方法计算出3个I的数值,然后取最小数值;然后让strand bias除以此最小值,以此来归一化strand bias;最后确保strand bias最低至少是1;Simultaneously calculate base quality imbalance (abbreviated as BQI, represented by I in the formula, with 1 I) and strandposition imbalance (SPI for short, represented by I in the formula, with 2 Is), and calculate the values of 3 Is by the following method, Then take the minimum value; then divide the strand bias by this minimum value to normalize the strand bias; finally ensure that the strand bias is at least 1;
对于BQI,c是另外一条链的所有等位基因的平均碱基质量,d是这一条链的突变等位基因的平均碱基质量,而I=10d-c;For BQI, c is the average base quality of all alleles of another chain, d is the average base quality of the mutant alleles of this chain, and I=10 dc ;
对于SPI,c是另外一条链的所有等位基因的到read末端的平均碱基距离,d是怀疑有偏好性链的突变等位基因的到read末端的平均碱基距离,而每一条read有两个末端:左边和右边。因此左末端和右末端分别产生两个I。For SPI, c is the average base distance to the end of the read for all alleles of the other strand, d is the average base distance to the end of the read for the mutant alleles of the suspected biased strand, and Each read has two ends: left and right. So two I's are generated at the left and right ends respectively.
对于mismatch bias而言,让x=1,2,3,...,11;x是mismatch数量阈值(通过mismatch数量<x可以推断出)。1,2,……,11是不同的尝试的阈值。For the mismatch bias, let x=1, 2, 3, ..., 11; x is the threshold of the number of mismatches (can be inferred by the number of mismatches < x). 1, 2, ..., 11 are thresholds for different attempts.
针对每个x的值,代入以下四个数值:For each value of x, substitute the following four values:
a1:mismatch数量<x,支持任何等位基因的reads数量;a 1 : the number of mismatches < x, the number of reads supporting any allele;
a2:mismatch数量≥x,支持任何等位基因的reads数量;a 2 : The number of mismatches ≥ x, the number of reads supporting any allele;
b1:mismatch数量<x,支持突变等位基因的reads数量;b 1 : the number of mismatches < x, the number of reads supporting the mutant allele;
b2:mismatch数量≥x,支持突变等位基因的reads数量;b 2 : the number of mismatches ≥ x, the number of reads supporting the mutant allele;
依次代入11个x值产生一条由11个数值组成的序列;序列中取每5个相邻数值作为一个数组,从而产生7个数组;对每个数组取最小值,从而产生7个最小值;取7个最小值的最大值为mismatch bias。Substituting 11 x values in turn produces a sequence consisting of 11 values; taking every 5 adjacent values in the sequence as an array, thus generating 7 arrays; taking the minimum value for each array, thereby generating 7 minimum values; Take the maximum value of the 7 minimum values as the mismatch bias.
当所述被检测的所述基因变异为SNV或者InDel变异时,所述步骤(C)均可按照如下进行:When the detected gene variation is SNV or InDel variation, the step (C) can be carried out as follows:
(C1)针对每个需要计算quality的位点,采用如下公式计算family quality,single-strand quality,double-strand quality;(C1) For each site that needs to calculate the quality, use the following formula to calculate family quality, single-strand quality, double-strand quality;
Quality(p,B,b,r)=logr(BIAS(B,p×B,B,b)×(r-1)+1)×10×log10(OR(B,p×B,B,b))。Quality (p, B, b, r) = log r (BIAS (B, p × B, B, b) × (r-1) + 1) × 10 × log 10 (OR (B, p × B, B , b)).
其中:in:
(1)针对一个family里的reads,让r=2并且使用以下定义计算family quality:(1) For the reads in a family, let r=2 and use the following definition to calculate the family quality:
p:步骤(a3)reads去重中计算出的碱基质量的阈值所对应的错误概率;p: error probability corresponding to the threshold value of base quality calculated in step (a3) reads deduplication;
B:从数据实际观测到支持任何等位基因的read数量;B: The number of reads that support any allele from the actual observation of the data;
b:从数据实际观测到支持突变等位基因的read数量。b: Number of reads actually observed from the data supporting the mutant allele.
(2)如果高通量测序实验采用PCR扩增方法建库,则r=1.5,如果高通量测序实验采用捕获或者WGS方法建库,则r=1.25。比对到一条染色体链上的families都在一条single-strand上;针对一条single-strand,使用以下定义计算single-strand quality(最大family quality如下:对于C:G>T:A突变形式是44,对于T:A>C:G突变形式是48,对于其他SNV突变形式是52,对于所有InDel突变形式是60)。如果single-strand quality大于最大family quality数值,则让single-strand quality等于最大family quality数值;(2) If the high-throughput sequencing experiment uses the PCR amplification method to build the library, then r=1.5, and if the high-throughput sequencing experiment uses the capture or WGS method to build the library, then r=1.25. The families compared to a chromosome chain are all on a single-strand; for a single-strand, use the following definition to calculate the single-strand quality (the maximum family quality is as follows: for C: G > T: A mutant form is 44, For T: A>C:G mutant form is 48, for other SNV mutant forms is 52, for all InDel mutant forms is 60). If the single-strand quality is greater than the maximum family quality value, then let the single-strand quality be equal to the maximum family quality value;
p:每个可能的family quality(即尝试所有的family qualities)所对应的错误概率;p: The error probability corresponding to each possible family quality (ie try all family qualities);
B:实际观测到支持任何等位基因的family数量;B: The number of families actually observed to support any allele;
b:实际观测到支持突变等位基因的family数量;b: The number of families actually observed to support the mutant allele;
在尝试所有的p之后,取最大single-strand quality作为最终single-strandquality。After trying all p, take the largest single-strand quality as the final single-strand quality.
(3)每一条single-strand在一条double-strand上的正向或反向;针对一条single-strand,使用以下定义计算double-strand quality:(3) The forward or reverse direction of each single-strand on a double-strand; for a single-strand, use the following definition to calculate the double-strand quality:
v1′=v1+v2×min(1,(min(w1+w2,w0)-w1)/w1);v 1 '=v 1 +v 2 ×min(1, (min(w 1 +w 2 ,w 0 )-w 1 )/w 1 );
其中v1′是某一条链上的最终变异质量,v1是这一条链的变异质量(即最终single-strand quality),v2是另外一条链上的变异质量,w1是这一条链的family quality的平均值,w2是另外一条链的family quality的平均值,w0等于最大family quality(最大familyquality如下:对于C:G>T:A突变形式是44,对于T:A>C:G突变形式是48,对于其他SNV突变形式是52,对于所有InDel突变形式是60)加上10;Among them, v 1 ′ is the final mutation quality on a certain chain, v 1 is the mutation quality of this chain (that is, the final single-strand quality), v 2 is the mutation quality on another chain, and w 1 is the mutation quality of this chain. The average value of family quality, w 2 is the average value of the family quality of another chain, w 0 is equal to the maximum family quality (the maximum family quality is as follows: for C: G>T: A mutant form is 44, for T: A>C: G mutant form is 48, for other SNV mutant forms is 52, for all InDel mutant forms is 60) plus 10;
交换v1和v2并且交换w1和w2,根据计算v1′公式的对称性计算v2′;Exchange v 1 and v 2 and exchange w 1 and w 2 to calculate v 2 ′ according to the symmetry of the formula for calculating v 1 ′;
v2′=v2+v1×min(1,(min(w1+w2,w0)-w2)/w2)。v 2 ′=v 2 +v 1 ×min(1, (min(w 1 +w 2 , w 0 )−w 2 )/w 2 ).
(C2)把double-strand quality定义成单个样本变异质量。(C2) Define double-strand quality as the quality of variation of a single sample.
(C3)按照上述(C1)和(C2),分别计算得到tumor样本和normal对照样本的单样本突变质量。然后计算TLOD。TLOD代表变异既不来源于躯体(soma)也不来源于胚胎(germline)的概率,具体如下:(C3) According to the above (C1) and (C2), calculate the single-sample mutation quality of the tumor sample and the normal control sample, respectively. Then calculate TLOD. TLOD represents the probability that the variation is neither derived from the body (soma) nor from the embryo (germline), as follows:
首先,计算以下函数的输出值。First, calculate the output value of the following function.
TNreward(a1,a2,b1,b2)=maxH(TNrewardBinomial(a1,a2,b1,b2),PLQTN(b2/b1,a2/a1));TNreward(a 1 , a 2 , b 1 , b 2 )=maxH(TNrewardBinomial(a 1 , a 2 , b 1 , b 2 ), PLQTN(b 2 /b 1 , a 2 /a 1 ));
其中,TNrewardBinomial(a1,a2,b1,b2)=10/log(10)×(a1×KLBernoulli(b2/b1,a2/a1))Among them, TNrewardBinomial(a 1 , a 2 , b 1 , b 2 )=10/log(10)×(a1×KL Bernoulli(b 2 /b 1 , a 2 /a 1 ))
如果max(a,-a)<max(b,-b),那么maxH(a,b)=a,否则maxH(a,b)=b;If max(a,-a)<max(b,-b), then maxH(a,b)=a, otherwise maxH(a,b)=b;
PLQTN(t,n)=3×10×log10(min(t/n,102.5/3))。PLQTN(t, n)=3×10×log 10 (min(t/n, 10 2.5/3 )).
TNpenal(a1,a2,b1,b2)=max(0,min(DSVQnormal,PLQ(a2/a1))-12.5×(max(0,OR(a1,a2,b1,b2)-1))2);其中DSVQnormal指的是normal样本的单样本突变质量,也就是normal样本的double-strand quality。TNpenal(a 1 , a 2 , b 1 , b 2 )=max(0, min(DSVQnormal, PLQ(a 2 /a 1 ))-12.5×(max(0, OR(a 1 , a 2 , b 1 , b 2 )-1)) 2 ); where DSVQnormal refers to the single-sample mutation quality of the normal sample, that is, the double-strand quality of the normal sample.
a1、a2、b1和b2含义如下:a 1 , a 2 , b 1 and b 2 have the following meanings:
a1:配对normal对照样本中支持任何等位基因的reads的family quality总和;a 1 : The sum of the family quality of the reads supporting any allele in the paired normal control sample;
a2:配对normal对照样本中支持突变等位基因的reads的family quality总和;a 2 : The sum of the family quality of the reads supporting the mutant allele in the paired normal control sample;
b1:tumor样本中支持任何等位基因的reads的family quality总和;b 1 : the sum of family quality of reads supporting any allele in the tumor sample;
b2:tumor样本中支持突变等位基因的reads的family quality总和。b 2 : The sum of the family quality of the reads supporting the mutant allele in the tumor sample.
然后,计算TLOD,计算公式如下:Then, calculate TLOD, the calculation formula is as follows:
TLOD(a1,a2,b1,b2)=min(DSVQtumor,PLQ(b2/b1))+TNreward(a1,a2,b1,b2)-TNpenal(a1,a2,b1,b2)。TLOD(a 1 ,a 2 ,b 1 ,b 2 )=min(DSVQtumor,PLQ(b 2 /b 1 ))+TNreward(a 1 ,a 2 ,b 1 ,b 2 )-TNpenal(a 1 ,a 2 , b 1 , b 2 ).
其中,PLQ(f)=90+3×10×log10(f)。DSVQtumor指的是tumor样本的单样本突变质量,也就是tumor样本的double-strand quality。Wherein, PLQ(f)=90+3×10×log 10 (f). DSVQtumor refers to the single-sample mutation quality of the tumor sample, that is, the double-strand quality of the tumor sample.
(C4)计算NLOD。NLOD代表来变异来源于胚胎(germline)的概率。具体如下:(C4) Calculate NLOD. NLOD represents the probability that the variation originates from the embryo (germline). details as follows:
首先,针对每个样本(总共有tumor和配对的normal两个样本),计算每个genotype(GT)的genotype likelihood(GL)。GT和GL的定义和说明在https://samtools.github.io/hts-specs/VCFv4.2.pdf中。总而言之,homref(纯合野生型),hetero(杂合型)和homalt(纯合突变型)是三种GT。这三种GT的GL的计算方法如下:First, for each sample (there are two samples of tumor and paired normal in total), the genotype likelihood (GL) of each genotype (GT) is calculated. The definitions and descriptions of GT and GL are in https://samtools.github.io/hts-specs/VCFv4.2.pdf. To sum up, homref (homozygous wild type), hetero (heterozygous type) and homalt (homozygous mutant type) are three types of GT. The calculation methods of the GL of these three GTs are as follows:
GL(f,homref)=max(GLPowerLaw(f,homref),GLBinomial(f,homref));GL(f, homref) = max(GLPowerLaw(f, homref), GLBinomial(f, homref));
GL(f,hetero)=max(GLPowerLaw(f,hetero),GLBinomial(f,hetero));GL(f, hetero) = max(GLPowerLaw(f, hetero), GLBinomial(f, hetero));
GL(f,homalt)=max(GLPowerLaw(f,homalt),GLBinomial(f,homalt))。GL(f, homalt) = max(GLPowerLaw(f, homalt), GLBinomial(f, homalt)).
其中,GLPowerLaw(f,homref)=10×3×min(0,log1o(0.02×(1-f)/f));Among them, GLPowerLaw(f, homref) = 10×3×min(0, log 1o (0.02×(1-f)/f));
GLPowerLaw(f,hetero)=10×3×min(0,log10(min(f/(1-f),(1-f)/f)));GLPowerLaw(f, hetero) = 10×3×min(0, log 10 (min(f/(1-f), (1-f)/f)));
GLPowerLaw(f,homalt)=10×3×min(0,log10(0.02×f/(1-f)));GLPowerLaw(f, homalt) = 10×3×min(0, log 10 (0.02×f/(1-f)));
GLBinomial(f,homref)=10/log(10)×d×KLBernoulli(min(0.02,f),f);GLBinomial(f, homref) = 10/log(10) × d × KL Bernoulli(min(0.02, f), f);
GLBinomial(f,hetero)=10/log(10)×d×KLBernoulli(0.5,f);GLBinomial(f, hetero)=10/log(10)×d×KL Bernoulli(0.5,f);
GLBinomial(f,homalt)=10/log(10)×d×KLBernoulli(max(1-0.02,f),f);GLBinomial(f,homalt)=10/log(10)×d×KL Bernoulli(max(1-0.02,f),f);
KLBernoulli(x,y)=y×log(y/x)+(1-y)×log((1-y)/(1-x));KL Bernoulli(x,y)=y×log(y/x)+(1-y)×log((1-y)/(1-x));
如果x=0、y=0、x是不定式或者y是不定式,那么KLBernoulli(x,y)=0。比如说,(0/0)是不定式。If x=0, y=0, x is an infinitive or y is an infinitive, then KL Bernoulli(x,y)=0. For example, (0/0) is an infinitive.
并且a指的是支持突变型基因的read数量,d指的是支持任何基因型的read数量,f=a/d是突变频率。And a refers to the number of reads supporting a mutant gene, d refers to the number of reads supporting any genotype, and f=a/d is the mutation frequency.
然后计算NLOD。NLOD=max(Qtumorhomref,Qnormalhomref)。Then calculate the NLOD. NLOD=max(Qtumorhomref, Qnormalhomref).
其中在tumor样本上,Qtumorhomref=-3-max(GL(f,homalt),GL(f,hetero))。Qtumorhomref是使用tumor样本数据计算的,所以跟配对的normal样本无任何关系。where on the tumor sample, Qtumorhomref=-3-max(GL(f, homalt), GL(f, hetero)). Qtumorhomref is calculated using tumor sample data, so it has nothing to do with paired normal samples.
其中在配对的normal样本上,Qnormalhomref=GL(f,homref)-max(GL(f,homalt),GL(f,hetero))。Qnormalhomref是使用配对的normal样本数据计算的,所以跟tumor样本无任何关系。Wherein on the paired normal samples, Qnormalhomref=GL(f, homref)-max(GL(f, homalt), GL(f, hetero)). Qnormalhomref is calculated using paired normal sample data, so it has nothing to do with tumor samples.
(C5)最终变异质量是:min(TLOD,NLOD+31)。如果最终变异质量至少是60,则判定为阳性并且输出变异形式;否则判定为阴性。60的质量代表百万分之一的假阳率。变异检测软件MuTect(Cibulskis K,Lawrence M S,Carter S L,et al.Sensitive detection ofsomatic point mutations in impure and heterogeneous cancer samples[J].Naturebiotechnology,2013,31(3):213.)间接地使用了60的阈值,同时BWA能输出的最大比对质量是60。因此使用了60的阈值。变异输出格式为VCF。VCF格式定义在(Danecek P,Auton A,Abecasis G,et al.The variant call format and VCFtools[J].Bioinformatics,2011,27(15):2156-2158.)中有详细描述。(C5) The final mutation quality is: min(TLOD, NLOD+31). If the final mutation quality is at least 60, it is judged as positive and the mutant form is output; otherwise, it is judged as negative. A quality of 60 represents a false positive rate of one in a million. The mutation detection software MuTect (Cibulskis K, Lawrence M S, Carter S L, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples[J]. Naturebiotechnology, 2013, 31(3): 213.) indirectly used The threshold of 60, and the maximum comparison quality that BWA can output is 60. Therefore a threshold of 60 was used. The mutation output format is VCF. The VCF format definition is described in detail in (Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools[J]. Bioinformatics, 2011, 27(15): 2156-2158.).
第二方面,本发明要求保护一种基于高通量测序的基因变异检测系统。In the second aspect, the present invention claims a gene variation detection system based on high-throughput sequencing.
本发明所要求保护的基于高通量测序的基因变异检测系统,包括装置A、装置B和装置C;The gene variation detection system based on high-throughput sequencing as claimed in the present invention includes device A, device B and device C;
所述装置A能够按照前文第一方面中步骤(A)对高通量测序数据进行预处理,主要包括碱基质量校正、read去重;The device A can preprocess the high-throughput sequencing data according to step (A) in the first aspect above, mainly including base quality correction and read deduplication;
所述装置B能够按照前文第一方面中步骤(B)计算测序产生的偏好性,通过偏好性减少支持突变的有效read的数量,获得校正后的支持突变型的reads的数量;The device B can calculate the preference generated by sequencing according to the step (B) in the first aspect above, and reduce the number of effective reads supporting the mutation through the preference to obtain the corrected number of reads supporting the mutation;
所述装置C能够按照前文第一方面中步骤(C)利用校正后的支持突变型的read数目,计算所有单个样本相关质量(quality),然后得出单个样本变异质量,通过变异质量进行判定。The device C can calculate the relevant quality of all individual samples according to the step (C) in the first aspect above by using the corrected number of reads supporting the mutant type, and then obtain the variation quality of a single sample, and judge by the variation quality.
第三方面,本发明要求保护如下任一应用:In the third aspect, the present invention claims any of the following applications:
(I)前文第一方面所述的方法或前文第二方面所述的系统在制备用于肿瘤早筛、肿瘤预后、肿瘤分类和/或肿瘤用药指导的产品中的应用;(1) Application of the method described in the first aspect above or the system described in the second aspect above in the preparation of products for early tumor screening, tumor prognosis, tumor classification and/or tumor drug guidance;
所述产品能够按照前文第一方面所述方法的步骤实现从高通量测序数据中检测基因变异。The product can detect gene variation from high-throughput sequencing data according to the steps of the method described in the first aspect above.
(II)前文第一方面所述的方法或前文第二方面所述的系统在肿瘤早筛、肿瘤预后、肿瘤分类和/或肿瘤用药指导中的应用。(II) Application of the method described in the first aspect above or the system described in the second aspect above in early tumor screening, tumor prognosis, tumor classification and/or tumor drug guidance.
实验证明,针对组织样本的WGS数据,本发明提高了真阳率并且降低了假阳率,并且有可能减少生物学试验。Experiments prove that, for the WGS data of tissue samples, the present invention increases the true positive rate and reduces the false positive rate, and may reduce biological tests.
本发明由于采取以上技术方案,其具有以下优点:The present invention has the following advantages due to the adoption of the above technical scheme:
1、几乎没有漏检,也就是说几乎没有假阴性检测结果。1. Almost no missed detection, that is to say, almost no false negative test results.
2、产生的假阳突变数目很少。2. The number of false positive mutations produced is very small.
3、使用本发明描述的算法实现的软件比大多数软件的运行速度都更快。3. The software implemented using the algorithm described in the present invention runs faster than most software.
附图说明Description of drawings
图1为本发明方法跟类似软件在带有分子标签的数据上做的性能比较。Figure 1 is a performance comparison between the method of the present invention and similar software on data with molecular labels.
图2为本发明方法跟类似软件在带有分子标签的数据上做的性能比较。Figure 2 is a performance comparison between the method of the present invention and similar software on data with molecular labels.
图3为本发明方法跟类似软件在带有分子标签的数据上做的性能比较。Figure 3 is a performance comparison between the method of the present invention and similar software on data with molecular labels.
图4为本发明方法跟类似软件在带有分子标签的数据上做的性能比较。Fig. 4 is a performance comparison between the method of the present invention and similar software on data with molecular labels.
图5为本发明方法跟类似软件在tumor-normal组织数据上做的性能比较。Fig. 5 is the performance comparison of the method of the present invention and similar software on tumor-normal tissue data.
图6为本发明方法跟类似软件在tumor-normal组织数据上做的性能比较。Fig. 6 is the performance comparison done by the method of the present invention and similar software on tumor-normal tissue data.
各图中,精确率=真阳数量/(真阳数量+假阳数量)。召回率=真阳数量/(真阳数量+假阴数量)。In each figure, precision rate=number of true positives/(number of true positives+number of false positives). Recall = Number of True Positives/(Number of True Positives + Number of False Negatives).
具体实施方式Detailed ways
下述实施例中所使用的实验方法如无特殊说明,均为常规方法。The experimental methods used in the following examples are conventional methods unless otherwise specified.
下述实施例中所用的材料、试剂等,如无特殊说明,均可从商业途径得到。The materials and reagents used in the following examples can be obtained from commercial sources unless otherwise specified.
实施例1、本发明基于高通量测序的基因变异检测方法的建立Embodiment 1, the establishment of the gene variation detection method based on high-throughput sequencing of the present invention
以SNV变异为例,详述本发明所建立的基于高通量测序的基因变异检测方法。Taking SNV variation as an example, the high-throughput sequencing-based gene variation detection method established in the present invention is described in detail.
一、将样本基因组DNA进行高通量测序。1. Perform high-throughput sequencing of the genomic DNA of the sample.
二、对高通量测序的数据进行预处理,主要包括:碱基质量校正、read去重;具体方法包括:2. Preprocessing the data of high-throughput sequencing, mainly including: base quality correction, read deduplication; specific methods include:
本发明采用“Ewing B,Green P.Base-calling of automated sequencer tracesusing phred.II.Error probabilities[J].Genome research,1998,8(3):186-194.”第一个公式中对质量(quality)的定义,因此任何质量(比如说:碱基质量和变异质量)都跟概率有一一对应的关系。因此质量和概率是可互换的。The present invention adopts "Ewing B, Green P. Base-calling of automated sequencer tracing using phred.II. Error probabilities [J]. Genome research, 1998, 8 (3): 186-194." Quality ( quality), so any quality (for example: base quality and mutation quality) has a one-to-one correspondence with probability. So mass and probability are interchangeable.
所述碱基质量校正,具体如下: The base quality correction is as follows :
a.如果是双端测序,需在去重和过滤前进行合并,将R1端和R2端进行合并(merge),然后使用合并后的reads的碱基质量。a. If it is paired-end sequencing, merge before deduplication and filtering, merge the R1 and R2 ends, and then use the base quality of the merged reads.
对于R1和R2的重叠区内,合并后碱基质量的计算方法如下:For the overlapping region of R1 and R2, the calculation method of the combined base quality is as follows:
当p和q对应的碱基一样时:max(p,q);When the bases corresponding to p and q are the same: max(p, q);
当p和q对应的碱基不一样时:max(p,q)-min(p,q);When the bases corresponding to p and q are different: max(p, q)-min(p, q);
其中,p是R1端校正前的碱基质量,q是R2端校正前的碱基质量。Among them, p is the base quality before R1 end correction, and q is the base quality before R2 end correction.
对于R1和R2的非重叠区内的碱基质量,合并后的碱基质量等于校正前的碱基质量。For the base quality in the non-overlapping region of R1 and R2, the combined base quality is equal to the base quality before correction.
b.如果是单端测序,不需要合并(校正后的碱基质量等于校正前的碱基质量)。b. If it is single-end sequencing, no merging is required (the base quality after correction is equal to the base quality before correction).
所述reads去重,具体如下: The reads are deduplicated, as follows :
方法包括:把PCR或者光学重复的reads放到一个family(簇)之中从而去除重复,具体步骤包括如下:The method includes: putting PCR or optically repeated reads into a family (cluster) to remove duplication, and the specific steps include the following:
a)对于每个基因座(每一条染色体上的每一个基因座,例如:chr1:1,chr1:2,...,chrl:249250621,chr2:1,...),找到在这个基因座起始或终止的reads的数目(简称末端深度)。a) For each locus (every locus on each chromosome, for example: chr1:1, chr1:2, ..., chrl:249250621, chr2:1, ...), find the The number of reads starting or ending (referred to as terminal depth).
b)每个基因座x吸引其他基因座y,计算x对于y的吸引力强度s,吸引力强度s跟x末端深度(这个基因座起始/终止的reads的数目)成正比(比例常数=1),跟y末端深度成反比(比例常数=1)并且跟x与y之间碱基数目z成指数衰减关系(衰减系数=5)。b) Each locus x attracts other loci y, and calculates the attraction strength s of x to y. The attraction strength s is proportional to the depth of the end of x (the number of reads starting/terminating at this locus) (proportionality constant = 1), which is inversely proportional to the depth of the y end (proportionality constant = 1) and has an exponential decay relationship with the base number z between x and y (attenuation coefficient = 5).
如果y被x强烈吸引,那么y被认为是x的一个PCR或者光学重复。If y is strongly attracted to x, then y is considered a PCR or optical duplicate of x.
吸引力强度s的公式如下:The formula for the strength of attraction s is as follows:
s=x/y×5-z;s=x/y×5 -z ;
如果s>1(y被x强烈吸引),那么认为y是x的一个PCR或者光学重复,从而把x和y放到一个family里。If s > 1 (y is strongly attracted to x), then y is considered to be a PCR or optical repeat of x, thus putting x and y into a family.
c)过滤:计算碱基质量的阈值(threshold),公式如下:c) Filtering: calculate the threshold of base quality (threshold), the formula is as follows:
Threshold(P)=argmaxp∈P(p-|{q∈P|q≤p}|);Threshold(P)=argmax p∈P (p -|{q∈P|q≤p}| );
其中,P是所有的碱基质量对应的测序错误概率(一个碱基质量对应一个测序错误概率)的集合;数学符号∈代表集合归属,因此p∈P的意思是指p是P中任何一个元素,而q∈P的意思是指q是P中任何一个元素。Among them, P is the set of sequencing error probabilities corresponding to all base qualities (one base quality corresponds to one sequencing error probability); the mathematical symbol ∈ represents the set belonging, so p∈P means that p is any element in P , and q∈P means that q is any element in P.
根据碱基质量阈值对reads上的碱基进行过滤(filter):把校正后碱基质量低于碱基质量阈值的碱基过滤掉。如果一个碱基被过滤掉了,那么认为这个碱基不存在,可以理解为没有测出任何碱基。Filter the bases on the reads according to the base quality threshold: filter out the bases whose base quality after correction is lower than the base quality threshold. If a base is filtered out, it is considered that the base does not exist, and it can be understood that no base has been detected.
需要注意的是,过滤碱基和过滤reads是两个概念。It should be noted that filtering bases and filtering reads are two concepts.
过滤之后,对每个family,取family最常见碱基(碱基出现次数最多)作为去重后consensus family碱基,也就是去重后的碱基。在此family既可以是针对碱基的也可以是针对reads的。After filtering, for each family, the most common base (the base with the most occurrences) of the family is taken as the consensus family base after deduplication, that is, the base after deduplication. Here family can be either base-oriented or reads-oriented.
然后,把重复的reads整合成一条consensus read(一致read)。在此需要提到无论去重的是reads还是碱基,去重后最终结果一样。因此既可以理解成对reads进行去重,也可以理解成对碱基进行过滤。Then, integrate the repeated reads into a consensus read (consistent read). It should be mentioned here that no matter whether the deduplication is reads or bases, the final result is the same after deduplication. Therefore, it can be understood as deduplication of reads, or as filtering of bases.
三、计算测序产生的偏好性,通过偏好性减少支持突变的有效reads数量,获得校正后的支持突变型的read数量。3. Calculate the preference generated by sequencing, reduce the number of effective reads that support mutations through preference, and obtain the corrected number of reads that support mutations.
1、计算测序过程中产生的偏好性,包括:deduplication bias,position bias,strand bias和mismatch bias。1. Calculate the bias generated during the sequencing process, including: deduplication bias, position bias, strand bias and mismatch bias.
Bias(a1,a2,b1,b2)=max(0,OR(LS(a1,a2,b1,b2))-1)*(b2÷(b2+b1))+1;Bias(a 1 , a 2 , b 1 , b 2 )=max(0, OR(LS(a 1 , a 2 , b 1 , b 2 ))-1)*(b 2 ÷(b 2 +b 1 ))+1;
在公式中,OR代表比值比;OR的计算公式如下:In the formula, OR stands for odds ratio; OR is calculated as follows:
OR(a1,a2,b1,b2)=(a1×b2)÷(a2×b1)。OR(a 1 , a 2 , b 1 , b 2 )=(a 1 ×b 2 )÷(a 2 ×b 1 ).
在公式中,LS代表Laplace平滑;LS公式如下:LS(a,b,c,d)=Laplace2(Laplace1(a,b,c,d));其中,Laplace1公式如下:Laplace1(a,b,c,d)=(a+p,b+p,c+p,d+p);Laplace2公式如下:Laplace2(a,b,c,d)=(a+p×α,b+p×α,c+p×β,d+p×β);In the formula, LS represents Laplace smoothing; LS formula is as follows: LS (a, b, c, d)=Laplace 2 (Laplace 1 (a, b, c, d)); Wherein, Laplace 1 formula is as follows: Laplace 1 ( a, b, c, d)=(a+p, b+p, c+p, d+p); Laplace 2 formula is as follows: Laplace 2 (a, b, c, d)=(a+p×α , b+p×α, c+p×β, d+p×β);
在Laplace1公式和Laplace2公式中,p的默认值是0.5;In the Laplace 1 formula and Laplace 2 formula, the default value of p is 0.5;
在Laplace2公式中,α和β公式如下:α=max(0,p×((a1+a2)/(b1+b2)-1));β=max(0,p×(b1+b2)/(a1+a2)-1);In the Laplace 2 formula, the formulas of α and β are as follows: α=max(0, p×((a 1 +a 2 )/(b 1 +b 2 )-1)); β=max(0, p×( b 1 +b 2 )/(a 1 +a 2 )-1);
对于deduplication bias而言,各参数的意义如下:For deduplication bias, the meaning of each parameter is as follows:
a1:去重去掉的支持任何等位基因的reads的数量;a 1 : the number of reads that support any alleles removed by deduplication;
a2:去重之后,支持任何等位基因的reads的数量;a 2 : After deduplication, the number of reads supporting any allele;
b1:去重去掉的支持突变等位基因的reads的数量;b 1 : the number of reads supporting mutant alleles removed by deduplication;
b2:去重之后,支持突变等位基因的reads的数量。b 2 : After deduplication, the number of reads supporting the mutant allele.
如果去重前有7条reads,去重后剩下2条reads,那么去重去掉了7-2=5条reads。If there are 7 reads before deduplication and 2 reads are left after deduplication, then 7-2=5 reads are removed after deduplication.
对于position bias而言,各参数的意义如下:For position bias, the meaning of each parameter is as follows:
针对每个方向(总共左右两个方向),按照方向延长1、3、6、10、15、21、28、36、45、55、66个碱基;然后代入以下四个数值:For each direction (a total of left and right directions), extend 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66 bases according to the direction; then substitute the following four values:
a1:在已经跨越延长后基因座的reads里,支持任何等位基因的reads的数量;a 1 : Among the reads that have spanned the extended locus, the number of reads supporting any allele;
a2:在没有跨越延长后基因座的reads里,支持任何等位基因的reads的数量;a 2 : Among the reads that do not span the extended locus, the number of reads supporting any allele;
b1:在已经跨越延长后基因座的reads里,支持突变等位基因的reads的数量;b 1 : among the reads that have spanned the extended locus, the number of reads supporting the mutant allele;
b2:在没有跨越延长后基因座的reads里,支持突变等位基因的reads的数量。b 2 : Number of reads supporting the mutant allele among reads that do not span the extended locus.
依次代入11个延长单位产生一条由11个数值组成的序列;序列中取每3个相邻数值作为一个数组,从而产生9个数组;对每个数组取最小值,从而产生9个最小值;取9个最小值的最大值为position bias。Substituting 11 extended units in turn produces a sequence consisting of 11 values; taking every 3 adjacent values in the sequence as an array, thereby generating 9 arrays; taking the minimum value for each array, thereby generating 9 minimum values; Take the maximum value of the 9 minimum values as position bias.
对于strand bias而言,代入以下数值:For strand bias, substitute the following values:
a1:在染色体链S的正向,支持任何等位基因的reads的数量;a 1 : In the forward direction of the chromosome strand S, the number of reads supporting any allele;
a2:在染色体链S的反向,支持任何等位基因的reads的数量;a 2 : In the reverse direction of the chromosome strand S, the number of reads supporting any allele;
b1:在染色体链S的正向,支持突变等位基因的reads的数量;b 1 : in the forward direction of the chromosome strand S, the number of reads supporting the mutant allele;
b2:在染色体链S的反向,支持突变等位基因的reads的数量。b 2 : In the reverse direction of chromosome strand S, the number of reads supporting the mutant allele.
同时计算base quality imbalance(简称BQI,公式中用I表示,有1个I)和strandposition imbalance(简称SPI,公式中用I表示,有2个I),通过以下方法计算出3个I的数值,然后取最小数值;然后让strand bias除以此最小值,以此来归一化strand bias;最后确保strand bias最低至少是1;Simultaneously calculate base quality imbalance (abbreviated as BQI, represented by I in the formula, with 1 I) and strandposition imbalance (SPI for short, represented by I in the formula, with 2 Is), and calculate the values of 3 Is by the following method, Then take the minimum value; then divide the strand bias by this minimum value to normalize the strand bias; finally ensure that the strand bias is at least 1;
对于BQI,c是另外一条链的所有等位基因的平均碱基质量,d是这一条链的突变等位基因的平均碱基质量,而I=10d-c;For BQI, c is the average base quality of all alleles of another chain, d is the average base quality of the mutant alleles of this chain, and I=10 dc ;
对于SPI,c是另外一条链的所有等位基因的到read末端的平均碱基距离,d是怀疑有偏好性链的突变等位基因的到read末端的平均碱基距离,而每一条read有两个末端:左边和右边。因此左末端和右末端分别产生两个I。For SPI, c is the average base distance to the end of the read for all alleles of the other strand, d is the average base distance to the end of the read for the mutant alleles of the suspected biased strand, and Each read has two ends: left and right. So two I's are generated at the left and right ends respectively.
对于mismatch bias而言,让x=1,2,3,...,11;x是mismatch数量阈值(通过mismatch数量<x可以推断出)。1,2,……,11是不同的尝试的阈值。For the mismatch bias, let x=1, 2, 3, ..., 11; x is the mismatch number threshold (it can be inferred by the mismatch number<x). 1, 2, ..., 11 are thresholds for different attempts.
针对每个x的值,代入以下四个数值:For each value of x, substitute the following four values:
a1:mismatch数量<x,支持任何等位基因的reads数量;a 1 : the number of mismatches < x, the number of reads supporting any allele;
a2:mismatch数量≥x,支持任何等位基因的reads数量;a 2 : The number of mismatches ≥ x, the number of reads supporting any allele;
b1:mismatch数量<x,支持突变等位基因的reads数量;b 1 : the number of mismatches < x, the number of reads supporting the mutant allele;
b2:mismatch数量≥x,支持突变等位基因的reads数量;b 2 : the number of mismatches ≥ x, the number of reads supporting the mutant allele;
依次代入11个x值产生一条由11个数值组成的序列;序列中取每5个相邻数值作为一个数组,从而产生7个数组;对每个数组取最小值,从而产生7个最小值;取7个最小值的最大值为mismatch bias。Substituting 11 x values in turn produces a sequence consisting of 11 values; taking every 5 adjacent values in the sequence as an array, thus generating 7 arrays; taking the minimum value for each array, thereby generating 7 minimum values; Take the maximum value of the 7 minimum values as the mismatch bias.
根据上述在每一条染色体链算出四个偏好性。选最高偏好性作为染色体链整体偏好性。对支持突变型的read数目做校正,校正后的reads数目等于校正前的reads数目除以整体偏好性。Four preferences are calculated on each chromosome strand according to the above. Select the highest preference as the overall preference of the chromosome chain. Correct the number of reads that support the mutant type, and the number of reads after correction is equal to the number of reads before correction divided by the overall preference.
四、利用校正后的支持突变型的read数目,计算所有单个样本相关质量(quality),然后得出单个样本变异质量,通过变异质量进行判定。4. Use the corrected number of reads supporting the mutant type to calculate the relevant quality of all individual samples, and then obtain the variation quality of a single sample, and judge by the variation quality.
包括:include:
1、针对每个需要计算quality的位点,采用以下公式计算family quality,single-strand quality,double-strand quality,tumor-normal quality和variantquality。1. For each locus whose quality needs to be calculated, use the following formula to calculate family quality, single-strand quality, double-strand quality, tumor-normal quality and variant quality.
Quality(p,B,b,r)=logr(BIAS(B,p×B,B,b)×(r-1)+1)×10×log10(OR(B,p×B,B,b))。Quality (p, B, b, r) = log r (BIAS (B, p × B, B, b) × (r-1) + 1) × 10 × log 10 (OR (B, p × B, B , b)).
其中:in:
(1)针对一个family里的reads,让r=2并且使用以下定义计算family quality:(1) For the reads in a family, let r=2 and use the following definition to calculate the family quality:
p:步骤(a3)reads去重中计算出的碱基质量的阈值所对应的错误概率;p: error probability corresponding to the threshold value of base quality calculated in step (a3) reads deduplication;
B:从数据实际观测到支持任何等位基因的read数量;B: The number of reads that support any allele from the actual observation of the data;
b:从数据实际观测到支持突变等位基因的read数量。b: Number of reads actually observed from the data supporting the mutant allele.
(2)如果高通量测序实验采用PCR扩增方法建库,则r=1.5,如果高通量测序实验采用捕获或者WGS方法建库,则r=1.25。比对到一条染色体链上的families都在一条single-strand上;针对一条single-strand,使用以下定义计算single-strand quality(最大family quality如下:对于C:G>T:A突变形式是44,对于T:A>C:G突变形式是48,对于其他SNV突变形式是52,对于所有InDel突变形式是60)。如果single-strand quality大于最大family quality数值,则让single-strand quality等于最大family quality数值;(2) If the high-throughput sequencing experiment uses the PCR amplification method to build the library, then r=1.5, and if the high-throughput sequencing experiment uses the capture or WGS method to build the library, then r=1.25. The families compared to a chromosome chain are all on a single-strand; for a single-strand, use the following definition to calculate the single-strand quality (the maximum family quality is as follows: for C: G > T: A mutant form is 44, For T: A>C:G mutant form is 48, for other SNV mutant forms is 52, for all InDel mutant forms is 60). If the single-strand quality is greater than the maximum family quality value, then let the single-strand quality be equal to the maximum family quality value;
p:每个可能的family quality(即尝试所有的family qualities)所对应的错误概率;p: The error probability corresponding to each possible family quality (ie try all family qualities);
B:实际观测到支持任何等位基因的family数量;B: The number of families actually observed to support any allele;
b:实际观测到支持突变等位基因的family数量;b: The number of families actually observed to support the mutant allele;
在尝试所有的p之后,取最大single-strand quality作为最终single-strandquality。After trying all p, take the largest single-strand quality as the final single-strand quality.
(3)每一条single-strand在一条double-strand上的正向或反向;针对一条single-strand,使用以下定义计算double-strand quality:(3) The forward or reverse direction of each single-strand on a double-strand; for a single-strand, use the following definition to calculate the double-strand quality:
v1′=v1+v2×min(1,(min(w1+w2,w0)-w1)/w1);v 1 '=v 1 +v 2 ×min(1, (min(w 1 +w 2 ,w 0 )-w 1 )/w 1 );
其中v1′是某一条链上的最终变异质量,v1是这一条链的变异质量(即最终single-strand quality),v2是另外一条链上的变异质量,w1是这一条链的family quality的平均值,w2是另外一条链的family quality的平均值,w0等于最大family quality(最大familyquality如下:对于C:G>T:A突变形式是44,对于T:A>C:G突变形式是48,对于其他SNV突变形式是52,对于所有InDel突变形式是60)加上10;Among them, v 1 ′ is the final mutation quality on a certain chain, v 1 is the mutation quality of this chain (that is, the final single-strand quality), v 2 is the mutation quality on another chain, and w 1 is the mutation quality of this chain. The average value of family quality, w 2 is the average value of the family quality of another chain, w 0 is equal to the maximum family quality (the maximum family quality is as follows: for C: G>T: A mutant form is 44, for T: A>C: G mutant form is 48, for other SNV mutant forms is 52, for all InDel mutant forms is 60) plus 10;
交换v1和v2并且交换w1和w2,根据计算v1′公式的对称性计算v2′;Exchange v 1 and v 2 and exchange w 1 and w 2 to calculate v 2 ′ according to the symmetry of the formula for calculating v 1 ′;
v2′=v2+v1×min(1,(min(w1+w2,w0)-w2)/w2)。v 2 ′=v 2 +v 1 ×min(1, (min(w 1 +w 2 , w 0 )−w 2 )/w 2 ).
2、把double-strand quality定义成单个样本变异质量。2. Define double-strand quality as the variation quality of a single sample.
3、按照上述步骤1和2,分别计算得到tumor样本和normal对照样本的单样本突变质量。然后计算TLOD。TLOD代表变异既不来源于躯体(soma)也不来源于胚胎(germline)的概率,具体如下:3. According to the above steps 1 and 2, calculate the single-sample mutation quality of the tumor sample and the normal control sample, respectively. Then calculate TLOD. TLOD represents the probability that the variation is neither derived from the body (soma) nor from the embryo (germline), as follows:
首先,计算以下函数的输出值。First, calculate the output value of the following function.
TNreward(a1,a2,b1,b2)=maxH(TNrewardBinomial(a1,a2,b1,b2),PLQTN(b2/b1,a2/a1));TNreward(a 1 , a 2 , b 1 , b 2 )=maxH(TNrewardBinomial(a 1 , a 2 , b 1 , b 2 ), PLQTN(b 2 /b 1 , a 2 /a 1 ));
其中,TNrewardBinomial(a1,a2,b1,b2)=10/log(10)×(a1×KLBernoulli(b2/b1,a2/a1))Among them, TNrewardBinomial(a 1 , a 2 , b 1 , b 2 )=10/log(10)×(a1×KL Bernoulli(b 2 /b 1 , a 2 /a 1 ))
如果max(a,-a)<max(b,-b),那么maxH(a,b)=a,否则maxH(a,b)=b;If max(a,-a)<max(b,-b), then maxH(a,b)=a, otherwise maxH(a,b)=b;
PLQTN(t,n)=3×10×log10(min(t/n,102.5/3))。PLQTN(t, n)=3×10×log 10 (min(t/n, 10 2.5/3 )).
TNpenal(a1,a2,b1,b2)=max(0,min(DSVQnormal,PLQ(a2/a1))-12.5×(max(0,OR(a1,a2,b1,b2)-1))2);其中DSVQnormal指的是normal样本的单样本突变质量,也就是normal样本的double-strand quality。TNpenal(a 1 , a 2 , b 1 , b 2 )=max(0, min(DSVQnormal, PLQ(a 2 /a 1 ))-12.5×(max(0, OR(a 1 , a 2 , b 1 , b 2 )-1)) 2 ); where DSVQnormal refers to the single-sample mutation quality of the normal sample, that is, the double-strand quality of the normal sample.
a1、a2、b1和b2含义如下:a 1 , a 2 , b 1 and b 2 have the following meanings:
a1:配对normal对照样本中支持任何等位基因的reads的family quality总和;a 1 : The sum of the family quality of the reads supporting any allele in the paired normal control sample;
a2:配对normal对照样本中支持突变等位基因的reads的family quality总和;a 2 : The sum of the family quality of the reads supporting the mutant allele in the paired normal control sample;
b1:tumor样本中支持任何等位基因的reads的family quality总和;b 1 : the sum of family quality of reads supporting any allele in the tumor sample;
b2:tumor样本中支持突变等位基因的reads的family quality总和。b 2 : The sum of the family quality of the reads supporting the mutant allele in the tumor sample.
然后,计算TLOD,计算公式如下:Then, calculate TLOD, the calculation formula is as follows:
TLOD(a1,a2,b1,b2)=min(DSVQtumor,PLQ(b2/b1))+TNreward(a1,a2,b1,b2)-TNpenal(a1,a2,b1,b2)。TLOD(a 1 ,a 2 ,b 1 ,b 2 )=min(DSVQtumor,PLQ(b 2 /b 1 ))+TNreward(a 1 ,a 2 ,b 1 ,b 2 )-TNpenal(a 1 ,a 2 , b 1 , b 2 ).
其中,PLQ(f)=90+3×10×log10(f)。DSVQtumor指的是tumor样本的单样本突变质量,也就是tumor样本的double-strand quality。Wherein, PLQ(f)=90+3×10×log 10 (f). DSVQtumor refers to the single-sample mutation quality of the tumor sample, that is, the double-strand quality of the tumor sample.
(C4)计算NLOD。NLOD代表来变异来源于胚胎(germline)的概率。具体如下:(C4) Calculate NLOD. NLOD represents the probability that the variation originates from the embryo (germline). details as follows:
首先,针对每个样本(总共有tumor和配对的normal两个样本),计算每个genotype(GT)的genotype likelihood(GL)。GT和GL的定义和说明在https://samtools.github.io/hts-specs/VCFv4.2.pdf中。总而言之,homref(纯合野生型),hetero(杂合型)和homalt(纯合突变型)是三种GT。这三种GT的GL的计算方法如下:First, for each sample (there are two samples of tumor and paired normal in total), the genotype likelihood (GL) of each genotype (GT) is calculated. The definitions and descriptions of GT and GL are in https://samtools.github.io/hts-specs/VCFv4.2.pdf. To sum up, homref (homozygous wild type), hetero (heterozygous type) and homalt (homozygous mutant type) are three types of GT. The calculation methods of the GL of these three GTs are as follows:
GL(f,homref)=max(GLPowerLaw(f,homref),GLBinomial(f,homref));GL(f, homref) = max(GLPowerLaw(f, homref), GLBinomial(f, homref));
GL(f,hetero)=max(GLPowerLaw(f,hetero),GLBinomial(f,hetero));GL(f, hetero) = max(GLPowerLaw(f, hetero), GLBinomial(f, hetero));
GL(f,homalt)=max(GLPowerLaw(f,homalt),GLBinomial(f,homalt))。GL(f, homalt) = max(GLPowerLaw(f, homalt), GLBinomial(f, homalt)).
其中,GLPowerLaw(f,homref)=10×3×min(0,log10(0.02×(1-f)/f));Among them, GLPowerLaw(f, homref) = 10×3×min(0, log 10 (0.02×(1-f)/f));
GLPowerLaw(f,hetero)=10×3×min(0,log10(min(f/(1-f),(1-f)/f)));GLPowerLaw(f, hetero) = 10×3×min(0, log 10 (min(f/(1-f), (1-f)/f)));
GLPowerLaw(f,homalt)=10×3×min(0,log10(0.02×f/(1-f)));GLPowerLaw(f, homalt) = 10×3×min(0, log 10 (0.02×f/(1-f)));
GLBinomial(f,homref)=10/log(10)×d×KLBernoulli(min(0.02,f),f);GLBinomial(f, homref) = 10/log(10) × d × KL Bernoulli(min(0.02, f), f);
GLBinomial(f,hetero)=10/log(10)×d×KLBernoulli(0.5,f);GLBinomial(f, hetero)=10/log(10)×d×KL Bernoulli(0.5,f);
GLBinomial(f,homalt)=10/log(10)×d×KLBernoulli(max(1-0.02,f),f);GLBinomial(f,homalt)=10/log(10)×d×KL Bernoulli(max(1-0.02,f),f);
KLBernoulli(x,y)=y×log(y/x)+(1-y)×log((1-y)/(1-x));KL Bernoulli(x,y)=y×log(y/x)+(1-y)×log((1-y)/(1-x));
如果x=0、y=0、x是不定式或者y是不定式,那么KLBernoulli(x,y)=0。比如说,(0/0)是不定式。If x=0, y=0, x is an infinitive or y is an infinitive, then KL Bernoulli(x,y)=0. For example, (0/0) is an infinitive.
并且a指的是支持突变型基因的read数量,d指的是支持任何基因型的read数量,f=a/d是突变频率。And a refers to the number of reads supporting a mutant gene, d refers to the number of reads supporting any genotype, and f=a/d is the mutation frequency.
然后计算NLOD。NLOD=max(Qtumorhomref,Qnormalhomref)。Then calculate the NLOD. NLOD=max(Qtumorhomref, Qnormalhomref).
其中在tumor样本上,Qtumorhomref=-3-max(GL(f,homalt),GL(f,hetero))。Qtumorhomref是使用tumor样本数据计算的,所以跟配对的normal样本无任何关系。where on the tumor sample, Qtumorhomref=-3-max(GL(f, homalt), GL(f, hetero)). Qtumorhomref is calculated using tumor sample data, so it has nothing to do with paired normal samples.
其中在配对的normal样本上,Qnormalhomref=GL(f,homref)-max(GL(f,homalt),GL(f,hetero))。Qnormalhomref是使用配对的normal样本数据计算的,所以跟tumor样本无任何关系。Wherein on the paired normal samples, Qnormalhomref=GL(f, homref)-max(GL(f, homalt), GL(f, hetero)). Qnormalhomref is calculated using paired normal sample data, so it has nothing to do with tumor samples.
5、计算最终变异质量5. Calculate the final mutation quality
按照如下公式计算最终变异质量:min(TLOD,NLOD+31)。Calculate the final mutation quality according to the following formula: min(TLOD, NLOD+31).
如果最终变异质量至少是60,则判定为阳性并且输出变异形式;否则判定为阴性。60的质量代表百万分之一的假阳率。变异检测软件MuTect(Cibulskis K,Lawrence M S,Carter S L,et al.Sensitive detection of somatic point mutations in impure andheterogeneous cancer samples[J].Nature biotechnology,2013,31(3):213.)间接地使用了60的阈值,同时BWA能输出的最大比对质量是60。因此使用了60的阈值。变异输出格式为VCF。VCF格式定义在(Danecek P,Auton A,Abecasis G,et al.The variant callformat and VCFtools[J].Bioinformatics,2011,27(15):2156-2158.)中有详细描述。If the final mutation quality is at least 60, it is judged as positive and the mutant form is output; otherwise, it is judged as negative. A quality of 60 represents a false positive rate of one in a million. The mutation detection software MuTect (Cibulskis K, Lawrence M S, Carter S L, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples[J]. Nature biotechnology, 2013, 31(3): 213.) is used indirectly The threshold of 60 is set, and the maximum comparison quality that BWA can output is 60. Therefore a threshold of 60 was used. The mutation output format is VCF. The VCF format definition is described in detail in (Danecek P, Auton A, Abecasis G, et al. The variant callformat and VCFtools[J]. Bioinformatics, 2011, 27(15): 2156-2158.).
检测InDel:Detect InDel:
与SNV变异相比,在“碱基质量校正”之前增加“碱基质量估计”的步骤,其余均为SNV变异检测方法相同。“碱基质量估计”方法具体如下:Compared with SNV variation, the step of "base quality estimation" is added before "base quality correction", and the rest are the same as SNV variation detection methods. The Base Quality Estimation method is as follows:
InDelQual(x,y,a,b,c)=min(x+10,y+10,STRQual(a,b))+10×log10(c)InDelQual(x, y, a, b, c) = min(x+10, y+10, STRQual(a, b))+10×log 10 (c)
其中InDelQual是跟碱基质量有相同用途的InDel质量,x和y分别是发生InDel的前一个碱基和后一个碱基的碱基质量,a和b分别是短串联重复序列(STR)的单元大小(repeat size)和单元数量(repeat number),c是InDel的碱基数量。STRQual函数定义如下:Among them, InDelQual is the InDel quality that has the same purpose as the base quality, x and y are the base quality of the previous base and the next base where InDel occurs, respectively, and a and b are the units of the short tandem repeat (STR) The size (repeat size) and the number of units (repeat number), c is the number of bases of InDel. The STRQual function is defined as follows:
STRQual(a,b)=44-10×log10(1+8×softplus(a×b-8)/(θ×a2))STRQual(a, b)=44-10×log 10 (1+8×softplus(a×b-8)/(θ×a 2 ))
softplus函数定义如下:The softplus function is defined as follows:
softplus(v)=log(1+exp(v)),其中log指的是自然对数。softplus(v)=log(1+exp(v)), where log refers to the natural logarithm.
如果InDel是一个STR单元的缺失,那么θ=0.25,除此之外,θ=1。If InDel is a deletion of a STR unit, then θ=0.25, otherwise θ=1.
实施例2、本发明基于高通量测序的基因变异检测方法的应用实例Example 2. Application examples of the gene variation detection method based on high-throughput sequencing of the present invention
一、本发明方法和smCounter2方法的比较One, the comparison of the method of the present invention and smCounter2 method
采用本发明实施例1建立的方法和现有的smCounter2方法(参考文献“Xu C,Gu X,Padmanabhan R,et al.smCounter2:an accurate low-frequency variant caller fortargeted sequencing data with unique molecular identifiers[J].Bioinformatics,2019,35(8):1299-1309.”)分别在N13532(SRR7526729)和N0261(SRR7526728)样本上对SNV变异进行检测。Adopt the method that the embodiment of the present invention 1 establishes and existing smCounter2 method (reference "Xu C, Gu X, Padmanabhan R, et al.smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers [J] .Bioinformatics, 2019, 35(8):1299-1309.") Detected SNV variation on N13532 (SRR7526729) and N0261 (SRR7526728) samples, respectively.
N13532(SRR7526729)样本和N0261(SRR7526728)样本:数据在以下英文网站可以找到。https://www.ncbi.nlm.nih.gov/sra/SRR7526728;https://www.ncbi.nlm.nih.gov/sra/SRR7526729。除此之外,smCounter2文章中也有描述。SRR7526728产品号:CDHS-13907Z-562;SRR7526729产品号:CDHS-13532Z-10181。N13532 (SRR7526729) sample and N0261 (SRR7526728) sample: data can be found at the following English websites. https://www.ncbi.nlm.nih.gov/sra/SRR7526728; https://www.ncbi.nlm.nih.gov/sra/SRR7526729. In addition, it is also described in the smCounter2 article. SRR7526728 product number: CDHS-13907Z-562; SRR7526729 product number: CDHS-13532Z-10181.
图1至图4显示了本发明和smCounter2在N13532(SRR7526729)和N0261(SRR7526728)样本(项目编号:SRP153933)上的精确率和召回率。由图可见,本发明的优点是更高的F1分数和曲线下面积。Figures 1 to 4 show the precision and recall rates of the present invention and smCounter2 on N13532 (SRR7526729) and N0261 (SRR7526728) samples (item number: SRP153933). As can be seen from the figure, the advantages of the present invention are higher F1 score and area under the curve.
smCounter2需要通过对相同的样本进行重复测序并且预先得到样本的某些(但不是全部)已知阳性突变来做机器学习训练,但是本发明不需要这些,因此本发明减少了对已有数据的要求。针对血液样本的分子标签数据(图1至图4),本发明提高了真阳率并且降低了假阳率。smCounter2 needs to do machine learning training by repeatedly sequencing the same sample and obtaining some (but not all) known positive mutations of the sample in advance, but the present invention does not require these, so the present invention reduces the requirement for existing data . For the molecular label data of blood samples (Fig. 1 to Fig. 4), the present invention improves the true positive rate and reduces the false positive rate.
二、本发明方法、Mutect2方法和Strelka2方法在标准混合样本上的性能比较Two, the performance comparison of the inventive method, the Mutect2 method and the Strelka2 method on the standard mixed sample
采用本发明实施例1建立的方法和现有的Mutect2方法和Strelka2方法分别在标准混合样本上对SNV变异进行检测。The method established in Example 1 of the present invention and the existing Mutect2 method and Strelka2 method are used to detect SNV variation on standard mixed samples.
Strelka2方法:参见“Kim S,Scheffler K,Halpern A L,et al.Strelka2:fastand accurate calling of germline and somatic variants[J].Nature methods,2018,15(8):591.”一文。Strelka2源代码在https://github.com/Illumina/strelka。Strelka2 method: see "Kim S, Scheffler K, Halpern A L, et al. Strelka2: fast and accurate calling of germline and somatic variants [J]. Nature methods, 2018, 15(8): 591." Strelka2 source code is at https://github.com/Illumina/strelka.
Mutect2方法:参见“Cibulskis K,Lawrence M S,Carter S L,et al.Sensitivedetection of somatic point mutations in impure and heterogeneous cancersamples[J].Nature biotechnology,2013,31(3):213.”一文。Mutect2源代码在https:// github.com/broadinstitute/gatk。Mutect2 method: see "Cibulskis K, Lawrence M S, Carter S L, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancersamples[J]. Nature biotechnology, 2013, 31(3): 213." Mutect2 source code is at https://github.com/broadinstitute/gatk .
标准混合样本含有虚拟的tumor和normal的WGS(全基因组)数据:tumor是平均测序深度为90×的混合比例是3/7的NA12878/NA24385两个样本混合,normal是平均测序深度为30×的纯NA24385单个样本。标准混合样本数据在:ftp://ftp- trace.ncbi.nlm.nih.gov/giab/ftp/use_cases/mixtures。The standard mixed sample contains virtual tumor and normal WGS (whole genome) data: tumor is a mixture of two samples of NA12878/NA24385 with an average sequencing depth of 90×, and the normal is a mixture of NA12878/NA24385 samples with an average sequencing depth of 30× A single sample of pure NA24385. Standard mixture sample data are at: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/use_cases/mixtures .
NA12878的germline变异都在:ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz。The germline variants of NA12878 are all in: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_ CHROM1-X_v .3.3.2_highconf_PGandRTGphasetransfer.vcf.gz.
其中关于这些变异的描述性文档在:The descriptive documentation for these variants is at:
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/README_NISTv3.3.2.txt。ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/README_NISTv3.3.2.txt.
图5和图6显示了本发明方法、Mutect2方法和Strelka2方法在标准混合样本上的性能测试结果。由图可见,本发明的优点是更高的F1分数和曲线下面积。Figure 5 and Figure 6 show the performance test results of the method of the present invention, the Mutect2 method and the Strelka2 method on standard mixed samples. As can be seen from the figure, the advantages of the present invention are higher F1 score and area under the curve.
Strelka2需要通过对其他相似的样本进行测序并且预先得到其他相似样本的全部已知阳性突变来做机器学习训练,但是本发明不需要这些。总而言之,针对组织样本的WGS数据,本发明提高了真阳率并且降低了假阳率,并且有可能减少生物学试验。Strelka2 needs to do machine learning training by sequencing other similar samples and obtaining all known positive mutations of other similar samples in advance, but the present invention does not need these. In summary, the present invention increases the true positive rate and reduces the false positive rate for WGS data of tissue samples, and potentially reduces biological testing.
本实施例结果表明,使用本发明实施例1建立的方法(简称,UVC),并且跟其他软件做了比较和评估。评估显示UVC性能非常好:无论数据是否带有分子标签,UVC既提高了真阳率又降低了假阳率。The results of this example show that the method established in Example 1 of the present invention (abbreviated as UVC) is used, and compared and evaluated with other software. The evaluation showed that UVC performed very well: UVC both increased the true positive rate and decreased the false positive rate, regardless of whether the data were molecularly labeled or not.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010222444.5A CN111243664B (en) | 2020-03-26 | 2020-03-26 | Gene variation detection method based on high-throughput sequencing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010222444.5A CN111243664B (en) | 2020-03-26 | 2020-03-26 | Gene variation detection method based on high-throughput sequencing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111243664A CN111243664A (en) | 2020-06-05 |
CN111243664B true CN111243664B (en) | 2023-04-18 |
Family
ID=70869208
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010222444.5A Active CN111243664B (en) | 2020-03-26 | 2020-03-26 | Gene variation detection method based on high-throughput sequencing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111243664B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111785325B (en) * | 2020-06-23 | 2021-10-22 | 西北工业大学 | Mutually-exclusively constrained graph Laplacian approach to heterogeneous cancer driver gene identification |
CN115458051B (en) * | 2022-09-28 | 2023-03-21 | 北京泛生子基因科技有限公司 | Method, device and computer readable storage medium for simulating small variation in sequencing data and capable of retaining molecular tag information |
CN116884492A (en) * | 2023-02-07 | 2023-10-13 | 杭州联川基因诊断技术有限公司 | Method, equipment and medium for mTag class selection of targeted sequencing data |
CN116895332B (en) * | 2023-09-11 | 2023-12-05 | 臻和(北京)生物科技有限公司 | Filtering method for interrupting false positive mutation generated by artificial fragments in library construction by enzyme digestion method |
-
2020
- 2020-03-26 CN CN202010222444.5A patent/CN111243664B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111243664A (en) | 2020-06-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111243664B (en) | Gene variation detection method based on high-throughput sequencing | |
JP7302081B2 (en) | Variant Classifier Based on Deep Neural Networks | |
KR102273717B1 (en) | Deep learning-based variant classifier | |
JP7368483B2 (en) | An integrated machine learning framework for estimating homologous recombination defects | |
Vollger et al. | Increased mutation and gene conversion within human segmental duplications | |
TWI868095B (en) | Cell-free dna end characteristics | |
EP4073805B1 (en) | Systems and methods for predicting homologous recombination deficiency status of a specimen | |
CN110870016B (en) | Validation methods and systems for sequence variant calling | |
US10496679B2 (en) | Computer algorithm for automatic allele determination from fluorometer genotyping device | |
WO2019200338A1 (en) | Variant classifier based on deep neural networks | |
CN110168648A (en) | The verification method and system of sequence variations identification | |
CN110383385A (en) | Method for detecting mutational burden from tumor samples | |
CN114566214B (en) | Method and detection device, computer-readable storage medium and application for detecting genomic deletion and insertion variation | |
US20230064530A1 (en) | Detection of Genetic Variants in Human Leukocyte Antigen Genes | |
RU2822040C1 (en) | Method of detecting copy number variations (cnv) based on sequencing data of complete human exome and low-coverage genome | |
US20250226056A1 (en) | Variant classifier based on deep neural networks | |
KR20250096737A (en) | Component Mixture Model for Tissue Identification from DNA Samples | |
Sarantidis | Algorithms to Explore the Chromosomal Clustering of Genes | |
Woods et al. | Extended regions of suspected mis-assembly in the rat reference genome |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20250606 Address after: 102206 1st floor, Building 11, Zone 1, Yard 8, Shengshenyuan Road, Zhongguancun Life Science Park, Huilongguan Town, Changping District, Beijing (Changping Demonstration Park) Patentee after: GENETRON HEALTH(BEIJING) LABORATORY Co.,Ltd. Country or region after: China Address before: 102206 building 11, area 1, No.8, shengshengyuan Road, Zhongguancun Life Science Park, Changping District, Beijing Patentee before: GENETRON HEALTH (BEIJING) Co.,Ltd. Country or region before: China |