CN111243664B - Gene variation detection method based on high-throughput sequencing - Google Patents

Gene variation detection method based on high-throughput sequencing Download PDF

Info

Publication number
CN111243664B
CN111243664B CN202010222444.5A CN202010222444A CN111243664B CN 111243664 B CN111243664 B CN 111243664B CN 202010222444 A CN202010222444 A CN 202010222444A CN 111243664 B CN111243664 B CN 111243664B
Authority
CN
China
Prior art keywords
quality
reads
strand
base
follows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010222444.5A
Other languages
Chinese (zh)
Other versions
CN111243664A (en
Inventor
赵霄飞
王思振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genetron Health Beijing Laboratory Co ltd
Original Assignee
Genetron Health Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genetron Health Beijing Co ltd filed Critical Genetron Health Beijing Co ltd
Priority to CN202010222444.5A priority Critical patent/CN111243664B/en
Publication of CN111243664A publication Critical patent/CN111243664A/en
Application granted granted Critical
Publication of CN111243664B publication Critical patent/CN111243664B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a gene variation detection method based on high-throughput sequencing. The method comprises the following steps: preprocessing high-throughput sequencing data, mainly comprising base quality correction and read de-duplication; calculating the preference generated by sequencing, and reducing the number of effective reads supporting mutation through the preference to obtain the corrected read number supporting mutation; and calculating the correlation quality of all single samples by using the corrected read number of the supporting mutant types, then obtaining the variation quality of the single samples, and judging according to the variation quality. Experiments prove that the method improves the true positive rate and reduces the false positive rate aiming at the WGS data of the tissue sample, and possibly reduces biological tests.

Description

一种基于高通量测序的基因变异检测方法A Gene Variation Detection Method Based on High-throughput Sequencing

技术领域technical field

本发明属于生物信息技术领域,具体涉及一种基于高通量测序的基因变异检测方法。The invention belongs to the technical field of biological information, and in particular relates to a gene variation detection method based on high-throughput sequencing.

背景技术Background technique

随着测序技术的发展,高通量测序的通量越来越大,因此产生的数据也越来越大。与此同时,高通量测序的应用也越来越广泛并且越来越重要。如今高通量测序在癌症的应用已经非常广泛且非常重要,例如,高通量测序被应用于:肿瘤早筛、肿瘤预后、肿瘤分类和肿瘤用药指导。与此同时,大量的高通量测序数据带来了很多计算方面的挑战。在高通量测序数据分析流程中,最重要的步骤就是变异检测(variant call)。在临床报告出具之前一般需要人工审核变异检测软件(variant caller)报出的变异检测结果。With the development of sequencing technology, the throughput of high-throughput sequencing is increasing, so the data generated is also increasing. At the same time, the application of high-throughput sequencing is becoming more and more extensive and more and more important. Nowadays, the application of high-throughput sequencing in cancer has been very extensive and very important. For example, high-throughput sequencing is used in early tumor screening, tumor prognosis, tumor classification and tumor drug guidance. At the same time, the large amount of high-throughput sequencing data brings many computational challenges. In the high-throughput sequencing data analysis process, the most important step is variant call. Before the clinical report is issued, it is generally necessary to manually review the variant detection results reported by the variant caller.

目前基于高通量测序数据进行变异检测的方法检测出的结果中都存在漏检和假阳性的问题。漏检有可能导致错误诊断,对患者的生命造成严重危害(比如说,降低生存概率)。而大量的假阳性变异需要增加人工审核的工作,导致人工审核需要的时间和成本变得非常高。总而言之,目前存在的软件自动化检测出的结果跟人工审核得到的结果的一致性比较差。除此之外,某些变异检测软件运行速度缓慢,从而导致更多的计算资源消耗,并且导致医生需要等待更长的时间才能拿到患者的基因检测报告。The current methods of variant detection based on high-throughput sequencing data all have problems of missed detection and false positives in the detection results. Missed detection may lead to wrong diagnosis and cause serious harm to the patient's life (for example, reduce the probability of survival). However, a large number of false positive mutations need to increase the work of manual review, resulting in very high time and cost for manual review. All in all, the consistency between the results of automatic detection by existing software and the results of manual review is relatively poor. In addition, some variant detection software runs slowly, which consumes more computing resources and causes doctors to wait longer to get the patient's genetic test report.

发明内容Contents of the invention

针对上述问题,本发明的目的是提供一种基于高通量测序的基因变异检测方法。In view of the above problems, the object of the present invention is to provide a gene variation detection method based on high-throughput sequencing.

第一方面,本发明要求保护一种基于高通量测序的基因变异检测方法。In the first aspect, the present invention claims a method for detecting gene variation based on high-throughput sequencing.

本发明所提供的基于高通量测序的基因变异检测方法支持单端测序和双端测序的数据。The gene variation detection method based on high-throughput sequencing provided by the present invention supports single-end sequencing and paired-end sequencing data.

本发明所提供的基于高通量测序的基因变异检测方法,可包括如下步骤:The gene variation detection method based on high-throughput sequencing provided by the present invention may include the following steps:

(A)对高通量测序数据进行预处理,主要包括碱基质量校正、read去重;(A) Preprocessing the high-throughput sequencing data, mainly including base quality correction and read deduplication;

(B)计算测序产生的偏好性,通过偏好性减少支持突变的有效read的数量,获得校正后的支持突变型的reads的数量;(B) Calculating the bias generated by sequencing, reducing the number of effective reads that support mutations through biases, and obtaining the corrected number of reads that support mutations;

(C)利用校正后的支持突变型的read数目,计算所有单个样本相关质量(quality),然后得出单个样本变异质量,通过变异质量进行判定。(C) Using the corrected number of reads supporting the mutant type, calculate the relevant quality of all individual samples, and then obtain the variation quality of a single sample, and judge by the variation quality.

在所述方法中,被检测的所述基因变异可为SNV变异或者InDel变异。如果没有可用的normal对照样本,则认为normal对照样本对应的数据是一个不含有任何read的BAM格式的虚拟文件,此虚拟文件等同于空文件。In the method, the detected genetic variation can be SNV variation or InDel variation. If there is no normal control sample available, the data corresponding to the normal control sample is considered to be a virtual file in BAM format that does not contain any reads, and this virtual file is equivalent to an empty file.

所有本发明中提到的质量都采用“Ewing B,Green P.Base-calling ofautomated sequencer traces using phred.II.Error probabilities[J].Genomeresearch,1998,8(3):186-194.”第一个公式中对质量(quality)的定义,因此任何质量(比如说:碱基质量和变异质量)都跟概率有一一对应的关系。因此质量和概率是可互换的。All the qualities mentioned in this invention are adopted "Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities [J]. Genomeresearch, 1998, 8 (3): 186-194." The first The definition of quality in this formula, so any quality (for example: base quality and mutation quality) has a one-to-one correspondence with probability. So mass and probability are interchangeable.

当被检测的所述基因变异为InDel变异时,步骤(A)中,在进行所述碱基质量校正之前还包括碱基质量估计的步骤;所述碱基质量估计按照如下进行:When the detected genetic variation is an InDel variation, in step (A), the step of base quality estimation is also included before performing the base quality correction; the base quality estimation is performed as follows:

InDelQual(x,y,a,b,c)=min(x+10,y+10,STRQual(a,b))+10×log10(c)InDelQual(x, y, a, b, c) = min(x+10, y+10, STRQual(a, b))+10×log 10 (c)

其中InDelQual是跟碱基质量有相同用途的InDel质量,x和y分别是发生InDel的前一个碱基和后一个碱基的碱基质量,a和b分别是短串联重复序列(STR)的单元大小(repeat size)和单元数量(repeat number),c是InDel的碱基数量。STRQual函数定义如下:Among them, InDelQual is the InDel quality that has the same purpose as the base quality, x and y are the base quality of the previous base and the next base where InDel occurs, respectively, and a and b are the units of the short tandem repeat (STR) The size (repeat size) and the number of units (repeat number), c is the number of bases of InDel. The STRQual function is defined as follows:

STRQual(a,b)=44-10×log10(1+8×softplus(a×b-8)/(θ×a2));STRQual(a,b)=44-10×log 10 (1+8×softplus(a×b-8)/(θ×a 2 ));

softplus函数定义如下:The softplus function is defined as follows:

softplus(v)=log(1+exp(v)),其中log指的是自然对数。softplus(v)=log(1+exp(v)), where log refers to the natural logarithm.

如果InDel是一个STR单元的缺失,那么θ=0.25,除此之外,θ=1。If InDel is a deletion of a STR unit, then θ=0.25, otherwise θ=1.

当被检测的所述基因变异为SNV或者InDel变异时,步骤(A)中,均可按照如下进行碱基质量校正:When the detected genetic variation is SNV or InDel variation, in step (A), base quality correction can be performed as follows:

如果所述高通量测序为双端测序,则在去重和过滤前进行合并,将R1端和R2端进行合并,然后使用合并后的reads的碱基质量。If the high-throughput sequencing is paired-end sequencing, merge before deduplication and filtering, merge the R1 and R2 ends, and then use the base quality of the merged reads.

对于R1端和R2端的重叠区内,合并后碱基质量的计算方法如下:For the overlapping region between the R1 and R2 ends, the calculation method of the combined base quality is as follows:

当p和q对应的碱基一样时,按照如下公式计算:When the bases corresponding to p and q are the same, calculate according to the following formula:

max(p,q);max(p,q);

当p和q对应的碱基不一样时,按照如下公式计算:When the bases corresponding to p and q are different, calculate according to the following formula:

max(p,q)-min(p,q);max(p,q)-min(p,q);

其中,p是R1端校正前的碱基质量,q是R2端校正前的碱基质量。Among them, p is the base quality before R1 end correction, and q is the base quality before R2 end correction.

对于R1端和R2端的非重叠区内的碱基质量,合并后的碱基质量等于校正前的碱基质量。For the base quality in the non-overlapping region of R1 and R2 ends, the combined base quality is equal to the base quality before correction.

如果是单端测序,不需要合并(校正后的碱基质量等于校正前的碱基质量)。If it is single-end sequencing, no merging is required (the base quality after correction is equal to the base quality before correction).

当被检测的所述基因变异为SNV或者InDel变异时,步骤(A)中,均可按照如下对reads进行去重:When the detected gene variation is SNV or InDel variation, in step (A), the reads can be deduplicated as follows:

把PCR重复或者光学重复的reads放到一个family(簇)中从而去除重复。Put the reads of PCR duplication or optical duplication into a family (cluster) to remove duplication.

具体可是按照包括如下步骤的方法把PCR重复或者光学重复的reads放到一个family(簇)中从而去除重复的:Specifically, the reads of PCR duplication or optical duplication can be put into a family (cluster) according to the method including the following steps to remove duplication:

(a1)对于每个基因座(每一条染色体上的每一个基因座,例如:chr1:1,chr1:2,...,chr1:249250621,chr2:1,...),找到在这个基因座起始或终止的reads的数目,简称末端深度。(a1) For each locus (every locus on each chromosome, e.g.: chr1:1, chr1:2, ..., chr1:249250621, chr2:1, ...), find in this gene The number of reads starting or ending at the seat, referred to as the end depth.

(a2)每个基因座x吸引其他基因座y,计算x对于y的吸引力强度s,吸引力强度s跟x末端深度(这个基因座起始/终止的reads的数目)成正比(比例常数=1),跟y末端深度成反比(比例常数=1)并且跟x与y之间碱基数目z成指数衰减关系(衰减系数=5);(a2) Each locus x attracts other loci y, and calculates the attraction strength s of x to y. The attraction strength s is proportional to the depth of the end of x (the number of reads starting/terminating at this locus) (proportionality constant =1), which is inversely proportional to the depth of the end of y (proportionality constant=1) and has an exponential decay relationship with the number of bases z between x and y (attenuation coefficient=5);

吸引力强度s的公式如下;The formula for the strength of attraction s is as follows;

s=x/y×5-zs=x/y×5 -z ;

如果s>1(y被x强烈吸引),那么认为y是x的一个PCR或者光学重复,从而把x和y放到一个family里。If s > 1 (y is strongly attracted to x), then y is considered to be a PCR or optical repeat of x, thus putting x and y into a family.

(a3)过滤:计算碱基质量的阈值(threshold),公式如下:(a3) Filtering: calculate the threshold value (threshold) of base quality, the formula is as follows:

Threshold(P)=argmaxp∈P(p-|{q∈P|q≤p}|);Threshold(P)=argmax p∈P (p -|{q∈P|q≤p}| );

其中,P是所有的碱基质量对应的测序错误概率(一个碱基质量对应一个测序错误概率)的集合;数学符号∈代表集合归属,因此p∈P的意思是指p是P中任何一个元素,而q∈P的意思是指q是P中任何一个元素。Among them, P is the set of sequencing error probabilities corresponding to all base qualities (one base quality corresponds to one sequencing error probability); the mathematical symbol ∈ represents the set belonging, so p∈P means that p is any element in P , and q∈P means that q is any element in P.

根据碱基质量阈值对reads上的碱基进行过滤(filter):把校正后碱基质量低于碱基质量阈值的碱基过滤掉;如果一个碱基被过滤掉了,那么认为这个碱基不存在,可以理解为没有测出任何碱基。Filter the bases on the reads according to the base quality threshold: filter out the bases whose corrected base quality is lower than the base quality threshold; if a base is filtered out, then it is considered that the base is not Existence, it can be understood that no base has been detected.

需要注意的是,过滤碱基和过滤reads是两个概念。It should be noted that filtering bases and filtering reads are two concepts.

过滤之后,对每个family,取family最常见碱基(碱基出现次数最多)作为去重后consensus family碱基,也就是去重后的碱基;在此family既可以是针对碱基的也可以是针对reads的。After filtering, for each family, take the most common base of the family (the number of times the base appears the most) as the consensus family base after deduplication, that is, the base after deduplication; here the family can be either for the base or Can be for reads.

然后,把重复的reads整合成一条consensus read(一致read);在此需要提到无论去重的是reads还是碱基,去重后最终结果一样。因此既可以理解成对reads进行去重,也可以理解成对碱基进行过滤。Then, integrate the repeated reads into a consensus read (consensus read); here it needs to be mentioned that no matter whether the deduplication is reads or bases, the final result is the same after deduplication. Therefore, it can be understood as deduplication of reads, or as filtering of bases.

步骤(B)中,是在每一条染色体链算出deduplication bias,position bias,strand bias和mismatch bias四个偏好性;然后选最高偏好性作为染色体链整体偏好性;对支持突变型的read数目做校正,校正后的read数目等于校正前的read数目除以整体偏好性。In step (B), the four preferences of deduplication bias, position bias, strand bias and mismatch bias are calculated for each chromosome chain; then the highest bias is selected as the overall preference of the chromosome chain; the number of reads supporting the mutant type is corrected , the number of reads after correction is equal to the number of reads before correction divided by the overall preference.

当所述被检测的所述基因变异为SNV或者InDel变异时,均可按照如下公式计算deduplication bias,position bias,strand bias和mismatch bias四个偏好性:When the detected gene variation is SNV or InDel variation, the four preferences of deduplication bias, position bias, strand bias and mismatch bias can be calculated according to the following formula:

Bias(a1,a2,b1,b2)=max(0,OR(LS(a1,a2,b1,b2))-1)*(b2÷(b2+b1))+1;Bias(a 1 , a 2 , b 1 , b 2 )=max(0, OR(LS(a 1 , a 2 , b 1 , b 2 ))-1)*(b 2 ÷(b 2 +b 1 ))+1;

在公式中,OR代表比值比;OR的计算公式如下:In the formula, OR stands for odds ratio; OR is calculated as follows:

OR(a1,a2,b1,b2)=(a1×b2)÷(a2×b1)。OR(a 1 , a 2 , b 1 , b 2 )=(a 1 ×b 2 )÷(a 2 ×b 1 ).

在公式中,LS代表Laplace平滑;LS公式如下:LS(a,b,c,d)=Laplace2(Laplace1(a,b,c,d));其中,Laplace1公式如下:Laplace1(a,b,c,d)=(a+p,b+p,c+p,d+p);Laplace2公式如下:Laplace2(a,b,c,d)=(a+p×α,b+p×α,c+p×β,d+p×β);In the formula, LS represents Laplace smoothing; LS formula is as follows: LS (a, b, c, d)=Laplace 2 (Laplace 1 (a, b, c, d)); Wherein, Laplace 1 formula is as follows: Laplace 1 ( a, b, c, d)=(a+p, b+p, c+p, d+p); Laplace 2 formula is as follows: Laplace 2 (a, b, c, d)=(a+p×α , b+p×α, c+p×β, d+p×β);

在Laplace1公式和Laplace2公式中,p的默认值是0.5;In the Laplace 1 formula and Laplace 2 formula, the default value of p is 0.5;

在Laplace2公式中,α和β公式如下:α=max(0,p×((a1+a2)/(b1+b2)-1));β=max(0,p×(b1+b2)/(a1+a2)-1);In the Laplace 2 formula, the formulas of α and β are as follows: α=max(0, p×((a 1 +a 2 )/(b 1 +b 2 )-1)); β=max(0, p×( b 1 +b 2 )/(a 1 +a 2 )-1);

对于deduplication bias而言,各参数的意义如下:For deduplication bias, the meaning of each parameter is as follows:

a1:去重去掉的支持任何等位基因的reads的数量;a 1 : the number of reads that support any alleles removed by deduplication;

a2:去重之后,支持任何等位基因的reads的数量;a 2 : After deduplication, the number of reads supporting any allele;

b1:去重去掉的支持突变等位基因的reads的数量;b 1 : the number of reads supporting mutant alleles removed by deduplication;

b2:去重之后,支持突变等位基因的reads的数量。b 2 : After deduplication, the number of reads supporting the mutant allele.

如果去重前有7条reads,去重后剩下2条reads,那么去重去掉了7-2=5条reads。If there are 7 reads before deduplication and 2 reads are left after deduplication, then 7-2=5 reads are removed after deduplication.

对于position bias而言,各参数的意义如下:For position bias, the meaning of each parameter is as follows:

针对每个方向(总共左右两个方向),按照方向延长1、3、6、10、15、21、28、36、45、55、66个碱基;然后代入以下四个数值:For each direction (a total of left and right directions), extend 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66 bases according to the direction; then substitute the following four values:

a1:在已经跨越延长后基因座的reads里,支持任何等位基因的reads的数量;a 1 : Among the reads that have spanned the extended locus, the number of reads supporting any allele;

a2:在没有跨越延长后基因座的reads里,支持任何等位基因的reads的数量;a 2 : Among the reads that do not span the extended locus, the number of reads supporting any allele;

b1:在已经跨越延长后基因座的reads里,支持突变等位基因的reads的数量;b 1 : among the reads that have spanned the extended locus, the number of reads supporting the mutant allele;

b2:在没有跨越延长后基因座的reads里,支持突变等位基因的reads的数量。b 2 : Number of reads supporting the mutant allele among reads that do not span the extended locus.

依次代入11个延长单位产生一条由11个数值组成的序列;序列中取每3个相邻数值作为一个数组,从而产生9个数组;对每个数组取最小值,从而产生9个最小值;取9个最小值的最大值为position bias。Substituting 11 extended units in turn produces a sequence consisting of 11 values; taking every 3 adjacent values in the sequence as an array, thereby generating 9 arrays; taking the minimum value for each array, thereby generating 9 minimum values; Take the maximum value of the 9 minimum values as position bias.

对于strand bias而言,代入以下数值:For strand bias, substitute the following values:

a1:在染色体链S的正向,支持任何等位基因的reads的数量;a 1 : In the forward direction of the chromosome strand S, the number of reads supporting any allele;

a2:在染色体链S的反向,支持任何等位基因的reads的数量;a 2 : In the reverse direction of the chromosome strand S, the number of reads supporting any allele;

b1:在染色体链S的正向,支持突变等位基因的reads的数量;b 1 : in the forward direction of the chromosome strand S, the number of reads supporting the mutant allele;

b2:在染色体链S的反向,支持突变等位基因的reads的数量。b 2 : In the reverse direction of chromosome strand S, the number of reads supporting the mutant allele.

同时计算base quality imbalance(简称BQI,公式中用I表示,有1个I)和strandposition imbalance(简称SPI,公式中用I表示,有2个I),通过以下方法计算出3个I的数值,然后取最小数值;然后让strand bias除以此最小值,以此来归一化strand bias;最后确保strand bias最低至少是1;Simultaneously calculate base quality imbalance (abbreviated as BQI, represented by I in the formula, with 1 I) and strandposition imbalance (SPI for short, represented by I in the formula, with 2 Is), and calculate the values of 3 Is by the following method, Then take the minimum value; then divide the strand bias by this minimum value to normalize the strand bias; finally ensure that the strand bias is at least 1;

对于BQI,c是另外一条链的所有等位基因的平均碱基质量,d是这一条链的突变等位基因的平均碱基质量,而I=10d-cFor BQI, c is the average base quality of all alleles of another chain, d is the average base quality of the mutant alleles of this chain, and I=10 dc ;

对于SPI,c是另外一条链的所有等位基因的到read末端的平均碱基距离,d是怀疑有偏好性链的突变等位基因的到read末端的平均碱基距离,而

Figure BDA0002426561100000051
每一条read有两个末端:左边和右边。因此左末端和右末端分别产生两个I。For SPI, c is the average base distance to the end of the read for all alleles of the other strand, d is the average base distance to the end of the read for the mutant alleles of the suspected biased strand, and
Figure BDA0002426561100000051
Each read has two ends: left and right. So two I's are generated at the left and right ends respectively.

对于mismatch bias而言,让x=1,2,3,...,11;x是mismatch数量阈值(通过mismatch数量<x可以推断出)。1,2,……,11是不同的尝试的阈值。For the mismatch bias, let x=1, 2, 3, ..., 11; x is the threshold of the number of mismatches (can be inferred by the number of mismatches < x). 1, 2, ..., 11 are thresholds for different attempts.

针对每个x的值,代入以下四个数值:For each value of x, substitute the following four values:

a1:mismatch数量<x,支持任何等位基因的reads数量;a 1 : the number of mismatches < x, the number of reads supporting any allele;

a2:mismatch数量≥x,支持任何等位基因的reads数量;a 2 : The number of mismatches ≥ x, the number of reads supporting any allele;

b1:mismatch数量<x,支持突变等位基因的reads数量;b 1 : the number of mismatches < x, the number of reads supporting the mutant allele;

b2:mismatch数量≥x,支持突变等位基因的reads数量;b 2 : the number of mismatches ≥ x, the number of reads supporting the mutant allele;

依次代入11个x值产生一条由11个数值组成的序列;序列中取每5个相邻数值作为一个数组,从而产生7个数组;对每个数组取最小值,从而产生7个最小值;取7个最小值的最大值为mismatch bias。Substituting 11 x values in turn produces a sequence consisting of 11 values; taking every 5 adjacent values in the sequence as an array, thus generating 7 arrays; taking the minimum value for each array, thereby generating 7 minimum values; Take the maximum value of the 7 minimum values as the mismatch bias.

当所述被检测的所述基因变异为SNV或者InDel变异时,所述步骤(C)均可按照如下进行:When the detected gene variation is SNV or InDel variation, the step (C) can be carried out as follows:

(C1)针对每个需要计算quality的位点,采用如下公式计算family quality,single-strand quality,double-strand quality;(C1) For each site that needs to calculate the quality, use the following formula to calculate family quality, single-strand quality, double-strand quality;

Quality(p,B,b,r)=logr(BIAS(B,p×B,B,b)×(r-1)+1)×10×log10(OR(B,p×B,B,b))。Quality (p, B, b, r) = log r (BIAS (B, p × B, B, b) × (r-1) + 1) × 10 × log 10 (OR (B, p × B, B , b)).

其中:in:

(1)针对一个family里的reads,让r=2并且使用以下定义计算family quality:(1) For the reads in a family, let r=2 and use the following definition to calculate the family quality:

p:步骤(a3)reads去重中计算出的碱基质量的阈值所对应的错误概率;p: error probability corresponding to the threshold value of base quality calculated in step (a3) reads deduplication;

B:从数据实际观测到支持任何等位基因的read数量;B: The number of reads that support any allele from the actual observation of the data;

b:从数据实际观测到支持突变等位基因的read数量。b: Number of reads actually observed from the data supporting the mutant allele.

(2)如果高通量测序实验采用PCR扩增方法建库,则r=1.5,如果高通量测序实验采用捕获或者WGS方法建库,则r=1.25。比对到一条染色体链上的families都在一条single-strand上;针对一条single-strand,使用以下定义计算single-strand quality(最大family quality如下:对于C:G>T:A突变形式是44,对于T:A>C:G突变形式是48,对于其他SNV突变形式是52,对于所有InDel突变形式是60)。如果single-strand quality大于最大family quality数值,则让single-strand quality等于最大family quality数值;(2) If the high-throughput sequencing experiment uses the PCR amplification method to build the library, then r=1.5, and if the high-throughput sequencing experiment uses the capture or WGS method to build the library, then r=1.25. The families compared to a chromosome chain are all on a single-strand; for a single-strand, use the following definition to calculate the single-strand quality (the maximum family quality is as follows: for C: G > T: A mutant form is 44, For T: A>C:G mutant form is 48, for other SNV mutant forms is 52, for all InDel mutant forms is 60). If the single-strand quality is greater than the maximum family quality value, then let the single-strand quality be equal to the maximum family quality value;

p:每个可能的family quality(即尝试所有的family qualities)所对应的错误概率;p: The error probability corresponding to each possible family quality (ie try all family qualities);

B:实际观测到支持任何等位基因的family数量;B: The number of families actually observed to support any allele;

b:实际观测到支持突变等位基因的family数量;b: The number of families actually observed to support the mutant allele;

在尝试所有的p之后,取最大single-strand quality作为最终single-strandquality。After trying all p, take the largest single-strand quality as the final single-strand quality.

(3)每一条single-strand在一条double-strand上的正向或反向;针对一条single-strand,使用以下定义计算double-strand quality:(3) The forward or reverse direction of each single-strand on a double-strand; for a single-strand, use the following definition to calculate the double-strand quality:

v1′=v1+v2×min(1,(min(w1+w2,w0)-w1)/w1);v 1 '=v 1 +v 2 ×min(1, (min(w 1 +w 2 ,w 0 )-w 1 )/w 1 );

其中v1′是某一条链上的最终变异质量,v1是这一条链的变异质量(即最终single-strand quality),v2是另外一条链上的变异质量,w1是这一条链的family quality的平均值,w2是另外一条链的family quality的平均值,w0等于最大family quality(最大familyquality如下:对于C:G>T:A突变形式是44,对于T:A>C:G突变形式是48,对于其他SNV突变形式是52,对于所有InDel突变形式是60)加上10;Among them, v 1 ′ is the final mutation quality on a certain chain, v 1 is the mutation quality of this chain (that is, the final single-strand quality), v 2 is the mutation quality on another chain, and w 1 is the mutation quality of this chain. The average value of family quality, w 2 is the average value of the family quality of another chain, w 0 is equal to the maximum family quality (the maximum family quality is as follows: for C: G>T: A mutant form is 44, for T: A>C: G mutant form is 48, for other SNV mutant forms is 52, for all InDel mutant forms is 60) plus 10;

交换v1和v2并且交换w1和w2,根据计算v1′公式的对称性计算v2′;Exchange v 1 and v 2 and exchange w 1 and w 2 to calculate v 2 ′ according to the symmetry of the formula for calculating v 1 ′;

v2′=v2+v1×min(1,(min(w1+w2,w0)-w2)/w2)。v 2 ′=v 2 +v 1 ×min(1, (min(w 1 +w 2 , w 0 )−w 2 )/w 2 ).

(C2)把double-strand quality定义成单个样本变异质量。(C2) Define double-strand quality as the quality of variation of a single sample.

(C3)按照上述(C1)和(C2),分别计算得到tumor样本和normal对照样本的单样本突变质量。然后计算TLOD。TLOD代表变异既不来源于躯体(soma)也不来源于胚胎(germline)的概率,具体如下:(C3) According to the above (C1) and (C2), calculate the single-sample mutation quality of the tumor sample and the normal control sample, respectively. Then calculate TLOD. TLOD represents the probability that the variation is neither derived from the body (soma) nor from the embryo (germline), as follows:

首先,计算以下函数的输出值。First, calculate the output value of the following function.

TNreward(a1,a2,b1,b2)=maxH(TNrewardBinomial(a1,a2,b1,b2),PLQTN(b2/b1,a2/a1));TNreward(a 1 , a 2 , b 1 , b 2 )=maxH(TNrewardBinomial(a 1 , a 2 , b 1 , b 2 ), PLQTN(b 2 /b 1 , a 2 /a 1 ));

其中,TNrewardBinomial(a1,a2,b1,b2)=10/log(10)×(a1×KLBernoulli(b2/b1,a2/a1))Among them, TNrewardBinomial(a 1 , a 2 , b 1 , b 2 )=10/log(10)×(a1×KL Bernoulli(b 2 /b 1 , a 2 /a 1 ))

如果max(a,-a)<max(b,-b),那么maxH(a,b)=a,否则maxH(a,b)=b;If max(a,-a)<max(b,-b), then maxH(a,b)=a, otherwise maxH(a,b)=b;

PLQTN(t,n)=3×10×log10(min(t/n,102.5/3))。PLQTN(t, n)=3×10×log 10 (min(t/n, 10 2.5/3 )).

TNpenal(a1,a2,b1,b2)=max(0,min(DSVQnormal,PLQ(a2/a1))-12.5×(max(0,OR(a1,a2,b1,b2)-1))2);其中DSVQnormal指的是normal样本的单样本突变质量,也就是normal样本的double-strand quality。TNpenal(a 1 , a 2 , b 1 , b 2 )=max(0, min(DSVQnormal, PLQ(a 2 /a 1 ))-12.5×(max(0, OR(a 1 , a 2 , b 1 , b 2 )-1)) 2 ); where DSVQnormal refers to the single-sample mutation quality of the normal sample, that is, the double-strand quality of the normal sample.

a1、a2、b1和b2含义如下:a 1 , a 2 , b 1 and b 2 have the following meanings:

a1:配对normal对照样本中支持任何等位基因的reads的family quality总和;a 1 : The sum of the family quality of the reads supporting any allele in the paired normal control sample;

a2:配对normal对照样本中支持突变等位基因的reads的family quality总和;a 2 : The sum of the family quality of the reads supporting the mutant allele in the paired normal control sample;

b1:tumor样本中支持任何等位基因的reads的family quality总和;b 1 : the sum of family quality of reads supporting any allele in the tumor sample;

b2:tumor样本中支持突变等位基因的reads的family quality总和。b 2 : The sum of the family quality of the reads supporting the mutant allele in the tumor sample.

然后,计算TLOD,计算公式如下:Then, calculate TLOD, the calculation formula is as follows:

TLOD(a1,a2,b1,b2)=min(DSVQtumor,PLQ(b2/b1))+TNreward(a1,a2,b1,b2)-TNpenal(a1,a2,b1,b2)。TLOD(a 1 ,a 2 ,b 1 ,b 2 )=min(DSVQtumor,PLQ(b 2 /b 1 ))+TNreward(a 1 ,a 2 ,b 1 ,b 2 )-TNpenal(a 1 ,a 2 , b 1 , b 2 ).

其中,PLQ(f)=90+3×10×log10(f)。DSVQtumor指的是tumor样本的单样本突变质量,也就是tumor样本的double-strand quality。Wherein, PLQ(f)=90+3×10×log 10 (f). DSVQtumor refers to the single-sample mutation quality of the tumor sample, that is, the double-strand quality of the tumor sample.

(C4)计算NLOD。NLOD代表来变异来源于胚胎(germline)的概率。具体如下:(C4) Calculate NLOD. NLOD represents the probability that the variation originates from the embryo (germline). details as follows:

首先,针对每个样本(总共有tumor和配对的normal两个样本),计算每个genotype(GT)的genotype likelihood(GL)。GT和GL的定义和说明在https://samtools.github.io/hts-specs/VCFv4.2.pdf中。总而言之,homref(纯合野生型),hetero(杂合型)和homalt(纯合突变型)是三种GT。这三种GT的GL的计算方法如下:First, for each sample (there are two samples of tumor and paired normal in total), the genotype likelihood (GL) of each genotype (GT) is calculated. The definitions and descriptions of GT and GL are in https://samtools.github.io/hts-specs/VCFv4.2.pdf. To sum up, homref (homozygous wild type), hetero (heterozygous type) and homalt (homozygous mutant type) are three types of GT. The calculation methods of the GL of these three GTs are as follows:

GL(f,homref)=max(GLPowerLaw(f,homref),GLBinomial(f,homref));GL(f, homref) = max(GLPowerLaw(f, homref), GLBinomial(f, homref));

GL(f,hetero)=max(GLPowerLaw(f,hetero),GLBinomial(f,hetero));GL(f, hetero) = max(GLPowerLaw(f, hetero), GLBinomial(f, hetero));

GL(f,homalt)=max(GLPowerLaw(f,homalt),GLBinomial(f,homalt))。GL(f, homalt) = max(GLPowerLaw(f, homalt), GLBinomial(f, homalt)).

其中,GLPowerLaw(f,homref)=10×3×min(0,log1o(0.02×(1-f)/f));Among them, GLPowerLaw(f, homref) = 10×3×min(0, log 1o (0.02×(1-f)/f));

GLPowerLaw(f,hetero)=10×3×min(0,log10(min(f/(1-f),(1-f)/f)));GLPowerLaw(f, hetero) = 10×3×min(0, log 10 (min(f/(1-f), (1-f)/f)));

GLPowerLaw(f,homalt)=10×3×min(0,log10(0.02×f/(1-f)));GLPowerLaw(f, homalt) = 10×3×min(0, log 10 (0.02×f/(1-f)));

GLBinomial(f,homref)=10/log(10)×d×KLBernoulli(min(0.02,f),f);GLBinomial(f, homref) = 10/log(10) × d × KL Bernoulli(min(0.02, f), f);

GLBinomial(f,hetero)=10/log(10)×d×KLBernoulli(0.5,f);GLBinomial(f, hetero)=10/log(10)×d×KL Bernoulli(0.5,f);

GLBinomial(f,homalt)=10/log(10)×d×KLBernoulli(max(1-0.02,f),f);GLBinomial(f,homalt)=10/log(10)×d×KL Bernoulli(max(1-0.02,f),f);

KLBernoulli(x,y)=y×log(y/x)+(1-y)×log((1-y)/(1-x));KL Bernoulli(x,y)=y×log(y/x)+(1-y)×log((1-y)/(1-x));

如果x=0、y=0、x是不定式或者y是不定式,那么KLBernoulli(x,y)=0。比如说,(0/0)是不定式。If x=0, y=0, x is an infinitive or y is an infinitive, then KL Bernoulli(x,y)=0. For example, (0/0) is an infinitive.

并且a指的是支持突变型基因的read数量,d指的是支持任何基因型的read数量,f=a/d是突变频率。And a refers to the number of reads supporting a mutant gene, d refers to the number of reads supporting any genotype, and f=a/d is the mutation frequency.

然后计算NLOD。NLOD=max(Qtumorhomref,Qnormalhomref)。Then calculate the NLOD. NLOD=max(Qtumorhomref, Qnormalhomref).

其中在tumor样本上,Qtumorhomref=-3-max(GL(f,homalt),GL(f,hetero))。Qtumorhomref是使用tumor样本数据计算的,所以跟配对的normal样本无任何关系。where on the tumor sample, Qtumorhomref=-3-max(GL(f, homalt), GL(f, hetero)). Qtumorhomref is calculated using tumor sample data, so it has nothing to do with paired normal samples.

其中在配对的normal样本上,Qnormalhomref=GL(f,homref)-max(GL(f,homalt),GL(f,hetero))。Qnormalhomref是使用配对的normal样本数据计算的,所以跟tumor样本无任何关系。Wherein on the paired normal samples, Qnormalhomref=GL(f, homref)-max(GL(f, homalt), GL(f, hetero)). Qnormalhomref is calculated using paired normal sample data, so it has nothing to do with tumor samples.

(C5)最终变异质量是:min(TLOD,NLOD+31)。如果最终变异质量至少是60,则判定为阳性并且输出变异形式;否则判定为阴性。60的质量代表百万分之一的假阳率。变异检测软件MuTect(Cibulskis K,Lawrence M S,Carter S L,et al.Sensitive detection ofsomatic point mutations in impure and heterogeneous cancer samples[J].Naturebiotechnology,2013,31(3):213.)间接地使用了60的阈值,同时BWA能输出的最大比对质量是60。因此使用了60的阈值。变异输出格式为VCF。VCF格式定义在(Danecek P,Auton A,Abecasis G,et al.The variant call format and VCFtools[J].Bioinformatics,2011,27(15):2156-2158.)中有详细描述。(C5) The final mutation quality is: min(TLOD, NLOD+31). If the final mutation quality is at least 60, it is judged as positive and the mutant form is output; otherwise, it is judged as negative. A quality of 60 represents a false positive rate of one in a million. The mutation detection software MuTect (Cibulskis K, Lawrence M S, Carter S L, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples[J]. Naturebiotechnology, 2013, 31(3): 213.) indirectly used The threshold of 60, and the maximum comparison quality that BWA can output is 60. Therefore a threshold of 60 was used. The mutation output format is VCF. The VCF format definition is described in detail in (Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools[J]. Bioinformatics, 2011, 27(15): 2156-2158.).

第二方面,本发明要求保护一种基于高通量测序的基因变异检测系统。In the second aspect, the present invention claims a gene variation detection system based on high-throughput sequencing.

本发明所要求保护的基于高通量测序的基因变异检测系统,包括装置A、装置B和装置C;The gene variation detection system based on high-throughput sequencing as claimed in the present invention includes device A, device B and device C;

所述装置A能够按照前文第一方面中步骤(A)对高通量测序数据进行预处理,主要包括碱基质量校正、read去重;The device A can preprocess the high-throughput sequencing data according to step (A) in the first aspect above, mainly including base quality correction and read deduplication;

所述装置B能够按照前文第一方面中步骤(B)计算测序产生的偏好性,通过偏好性减少支持突变的有效read的数量,获得校正后的支持突变型的reads的数量;The device B can calculate the preference generated by sequencing according to the step (B) in the first aspect above, and reduce the number of effective reads supporting the mutation through the preference to obtain the corrected number of reads supporting the mutation;

所述装置C能够按照前文第一方面中步骤(C)利用校正后的支持突变型的read数目,计算所有单个样本相关质量(quality),然后得出单个样本变异质量,通过变异质量进行判定。The device C can calculate the relevant quality of all individual samples according to the step (C) in the first aspect above by using the corrected number of reads supporting the mutant type, and then obtain the variation quality of a single sample, and judge by the variation quality.

第三方面,本发明要求保护如下任一应用:In the third aspect, the present invention claims any of the following applications:

(I)前文第一方面所述的方法或前文第二方面所述的系统在制备用于肿瘤早筛、肿瘤预后、肿瘤分类和/或肿瘤用药指导的产品中的应用;(1) Application of the method described in the first aspect above or the system described in the second aspect above in the preparation of products for early tumor screening, tumor prognosis, tumor classification and/or tumor drug guidance;

所述产品能够按照前文第一方面所述方法的步骤实现从高通量测序数据中检测基因变异。The product can detect gene variation from high-throughput sequencing data according to the steps of the method described in the first aspect above.

(II)前文第一方面所述的方法或前文第二方面所述的系统在肿瘤早筛、肿瘤预后、肿瘤分类和/或肿瘤用药指导中的应用。(II) Application of the method described in the first aspect above or the system described in the second aspect above in early tumor screening, tumor prognosis, tumor classification and/or tumor drug guidance.

实验证明,针对组织样本的WGS数据,本发明提高了真阳率并且降低了假阳率,并且有可能减少生物学试验。Experiments prove that, for the WGS data of tissue samples, the present invention increases the true positive rate and reduces the false positive rate, and may reduce biological tests.

本发明由于采取以上技术方案,其具有以下优点:The present invention has the following advantages due to the adoption of the above technical scheme:

1、几乎没有漏检,也就是说几乎没有假阴性检测结果。1. Almost no missed detection, that is to say, almost no false negative test results.

2、产生的假阳突变数目很少。2. The number of false positive mutations produced is very small.

3、使用本发明描述的算法实现的软件比大多数软件的运行速度都更快。3. The software implemented using the algorithm described in the present invention runs faster than most software.

附图说明Description of drawings

图1为本发明方法跟类似软件在带有分子标签的数据上做的性能比较。Figure 1 is a performance comparison between the method of the present invention and similar software on data with molecular labels.

图2为本发明方法跟类似软件在带有分子标签的数据上做的性能比较。Figure 2 is a performance comparison between the method of the present invention and similar software on data with molecular labels.

图3为本发明方法跟类似软件在带有分子标签的数据上做的性能比较。Figure 3 is a performance comparison between the method of the present invention and similar software on data with molecular labels.

图4为本发明方法跟类似软件在带有分子标签的数据上做的性能比较。Fig. 4 is a performance comparison between the method of the present invention and similar software on data with molecular labels.

图5为本发明方法跟类似软件在tumor-normal组织数据上做的性能比较。Fig. 5 is the performance comparison of the method of the present invention and similar software on tumor-normal tissue data.

图6为本发明方法跟类似软件在tumor-normal组织数据上做的性能比较。Fig. 6 is the performance comparison done by the method of the present invention and similar software on tumor-normal tissue data.

各图中,精确率=真阳数量/(真阳数量+假阳数量)。召回率=真阳数量/(真阳数量+假阴数量)。In each figure, precision rate=number of true positives/(number of true positives+number of false positives). Recall = Number of True Positives/(Number of True Positives + Number of False Negatives).

具体实施方式Detailed ways

下述实施例中所使用的实验方法如无特殊说明,均为常规方法。The experimental methods used in the following examples are conventional methods unless otherwise specified.

下述实施例中所用的材料、试剂等,如无特殊说明,均可从商业途径得到。The materials and reagents used in the following examples can be obtained from commercial sources unless otherwise specified.

实施例1、本发明基于高通量测序的基因变异检测方法的建立Embodiment 1, the establishment of the gene variation detection method based on high-throughput sequencing of the present invention

以SNV变异为例,详述本发明所建立的基于高通量测序的基因变异检测方法。Taking SNV variation as an example, the high-throughput sequencing-based gene variation detection method established in the present invention is described in detail.

一、将样本基因组DNA进行高通量测序。1. Perform high-throughput sequencing of the genomic DNA of the sample.

二、对高通量测序的数据进行预处理,主要包括:碱基质量校正、read去重;具体方法包括:2. Preprocessing the data of high-throughput sequencing, mainly including: base quality correction, read deduplication; specific methods include:

本发明采用“Ewing B,Green P.Base-calling of automated sequencer tracesusing phred.II.Error probabilities[J].Genome research,1998,8(3):186-194.”第一个公式中对质量(quality)的定义,因此任何质量(比如说:碱基质量和变异质量)都跟概率有一一对应的关系。因此质量和概率是可互换的。The present invention adopts "Ewing B, Green P. Base-calling of automated sequencer tracing using phred.II. Error probabilities [J]. Genome research, 1998, 8 (3): 186-194." Quality ( quality), so any quality (for example: base quality and mutation quality) has a one-to-one correspondence with probability. So mass and probability are interchangeable.

所述碱基质量校正,具体如下 The base quality correction is as follows :

a.如果是双端测序,需在去重和过滤前进行合并,将R1端和R2端进行合并(merge),然后使用合并后的reads的碱基质量。a. If it is paired-end sequencing, merge before deduplication and filtering, merge the R1 and R2 ends, and then use the base quality of the merged reads.

对于R1和R2的重叠区内,合并后碱基质量的计算方法如下:For the overlapping region of R1 and R2, the calculation method of the combined base quality is as follows:

当p和q对应的碱基一样时:max(p,q);When the bases corresponding to p and q are the same: max(p, q);

当p和q对应的碱基不一样时:max(p,q)-min(p,q);When the bases corresponding to p and q are different: max(p, q)-min(p, q);

其中,p是R1端校正前的碱基质量,q是R2端校正前的碱基质量。Among them, p is the base quality before R1 end correction, and q is the base quality before R2 end correction.

对于R1和R2的非重叠区内的碱基质量,合并后的碱基质量等于校正前的碱基质量。For the base quality in the non-overlapping region of R1 and R2, the combined base quality is equal to the base quality before correction.

b.如果是单端测序,不需要合并(校正后的碱基质量等于校正前的碱基质量)。b. If it is single-end sequencing, no merging is required (the base quality after correction is equal to the base quality before correction).

所述reads去重,具体如下 The reads are deduplicated, as follows :

方法包括:把PCR或者光学重复的reads放到一个family(簇)之中从而去除重复,具体步骤包括如下:The method includes: putting PCR or optically repeated reads into a family (cluster) to remove duplication, and the specific steps include the following:

a)对于每个基因座(每一条染色体上的每一个基因座,例如:chr1:1,chr1:2,...,chrl:249250621,chr2:1,...),找到在这个基因座起始或终止的reads的数目(简称末端深度)。a) For each locus (every locus on each chromosome, for example: chr1:1, chr1:2, ..., chrl:249250621, chr2:1, ...), find the The number of reads starting or ending (referred to as terminal depth).

b)每个基因座x吸引其他基因座y,计算x对于y的吸引力强度s,吸引力强度s跟x末端深度(这个基因座起始/终止的reads的数目)成正比(比例常数=1),跟y末端深度成反比(比例常数=1)并且跟x与y之间碱基数目z成指数衰减关系(衰减系数=5)。b) Each locus x attracts other loci y, and calculates the attraction strength s of x to y. The attraction strength s is proportional to the depth of the end of x (the number of reads starting/terminating at this locus) (proportionality constant = 1), which is inversely proportional to the depth of the y end (proportionality constant = 1) and has an exponential decay relationship with the base number z between x and y (attenuation coefficient = 5).

如果y被x强烈吸引,那么y被认为是x的一个PCR或者光学重复。If y is strongly attracted to x, then y is considered a PCR or optical duplicate of x.

吸引力强度s的公式如下:The formula for the strength of attraction s is as follows:

s=x/y×5-zs=x/y×5 -z ;

如果s>1(y被x强烈吸引),那么认为y是x的一个PCR或者光学重复,从而把x和y放到一个family里。If s > 1 (y is strongly attracted to x), then y is considered to be a PCR or optical repeat of x, thus putting x and y into a family.

c)过滤:计算碱基质量的阈值(threshold),公式如下:c) Filtering: calculate the threshold of base quality (threshold), the formula is as follows:

Threshold(P)=argmaxp∈P(p-|{q∈P|q≤p}|);Threshold(P)=argmax p∈P (p -|{q∈P|q≤p}| );

其中,P是所有的碱基质量对应的测序错误概率(一个碱基质量对应一个测序错误概率)的集合;数学符号∈代表集合归属,因此p∈P的意思是指p是P中任何一个元素,而q∈P的意思是指q是P中任何一个元素。Among them, P is the set of sequencing error probabilities corresponding to all base qualities (one base quality corresponds to one sequencing error probability); the mathematical symbol ∈ represents the set belonging, so p∈P means that p is any element in P , and q∈P means that q is any element in P.

根据碱基质量阈值对reads上的碱基进行过滤(filter):把校正后碱基质量低于碱基质量阈值的碱基过滤掉。如果一个碱基被过滤掉了,那么认为这个碱基不存在,可以理解为没有测出任何碱基。Filter the bases on the reads according to the base quality threshold: filter out the bases whose base quality after correction is lower than the base quality threshold. If a base is filtered out, it is considered that the base does not exist, and it can be understood that no base has been detected.

需要注意的是,过滤碱基和过滤reads是两个概念。It should be noted that filtering bases and filtering reads are two concepts.

过滤之后,对每个family,取family最常见碱基(碱基出现次数最多)作为去重后consensus family碱基,也就是去重后的碱基。在此family既可以是针对碱基的也可以是针对reads的。After filtering, for each family, the most common base (the base with the most occurrences) of the family is taken as the consensus family base after deduplication, that is, the base after deduplication. Here family can be either base-oriented or reads-oriented.

然后,把重复的reads整合成一条consensus read(一致read)。在此需要提到无论去重的是reads还是碱基,去重后最终结果一样。因此既可以理解成对reads进行去重,也可以理解成对碱基进行过滤。Then, integrate the repeated reads into a consensus read (consistent read). It should be mentioned here that no matter whether the deduplication is reads or bases, the final result is the same after deduplication. Therefore, it can be understood as deduplication of reads, or as filtering of bases.

三、计算测序产生的偏好性,通过偏好性减少支持突变的有效reads数量,获得校正后的支持突变型的read数量。3. Calculate the preference generated by sequencing, reduce the number of effective reads that support mutations through preference, and obtain the corrected number of reads that support mutations.

1、计算测序过程中产生的偏好性,包括:deduplication bias,position bias,strand bias和mismatch bias。1. Calculate the bias generated during the sequencing process, including: deduplication bias, position bias, strand bias and mismatch bias.

Bias(a1,a2,b1,b2)=max(0,OR(LS(a1,a2,b1,b2))-1)*(b2÷(b2+b1))+1;Bias(a 1 , a 2 , b 1 , b 2 )=max(0, OR(LS(a 1 , a 2 , b 1 , b 2 ))-1)*(b 2 ÷(b 2 +b 1 ))+1;

在公式中,OR代表比值比;OR的计算公式如下:In the formula, OR stands for odds ratio; OR is calculated as follows:

OR(a1,a2,b1,b2)=(a1×b2)÷(a2×b1)。OR(a 1 , a 2 , b 1 , b 2 )=(a 1 ×b 2 )÷(a 2 ×b 1 ).

在公式中,LS代表Laplace平滑;LS公式如下:LS(a,b,c,d)=Laplace2(Laplace1(a,b,c,d));其中,Laplace1公式如下:Laplace1(a,b,c,d)=(a+p,b+p,c+p,d+p);Laplace2公式如下:Laplace2(a,b,c,d)=(a+p×α,b+p×α,c+p×β,d+p×β);In the formula, LS represents Laplace smoothing; LS formula is as follows: LS (a, b, c, d)=Laplace 2 (Laplace 1 (a, b, c, d)); Wherein, Laplace 1 formula is as follows: Laplace 1 ( a, b, c, d)=(a+p, b+p, c+p, d+p); Laplace 2 formula is as follows: Laplace 2 (a, b, c, d)=(a+p×α , b+p×α, c+p×β, d+p×β);

在Laplace1公式和Laplace2公式中,p的默认值是0.5;In the Laplace 1 formula and Laplace 2 formula, the default value of p is 0.5;

在Laplace2公式中,α和β公式如下:α=max(0,p×((a1+a2)/(b1+b2)-1));β=max(0,p×(b1+b2)/(a1+a2)-1);In the Laplace 2 formula, the formulas of α and β are as follows: α=max(0, p×((a 1 +a 2 )/(b 1 +b 2 )-1)); β=max(0, p×( b 1 +b 2 )/(a 1 +a 2 )-1);

对于deduplication bias而言,各参数的意义如下:For deduplication bias, the meaning of each parameter is as follows:

a1:去重去掉的支持任何等位基因的reads的数量;a 1 : the number of reads that support any alleles removed by deduplication;

a2:去重之后,支持任何等位基因的reads的数量;a 2 : After deduplication, the number of reads supporting any allele;

b1:去重去掉的支持突变等位基因的reads的数量;b 1 : the number of reads supporting mutant alleles removed by deduplication;

b2:去重之后,支持突变等位基因的reads的数量。b 2 : After deduplication, the number of reads supporting the mutant allele.

如果去重前有7条reads,去重后剩下2条reads,那么去重去掉了7-2=5条reads。If there are 7 reads before deduplication and 2 reads are left after deduplication, then 7-2=5 reads are removed after deduplication.

对于position bias而言,各参数的意义如下:For position bias, the meaning of each parameter is as follows:

针对每个方向(总共左右两个方向),按照方向延长1、3、6、10、15、21、28、36、45、55、66个碱基;然后代入以下四个数值:For each direction (a total of left and right directions), extend 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66 bases according to the direction; then substitute the following four values:

a1:在已经跨越延长后基因座的reads里,支持任何等位基因的reads的数量;a 1 : Among the reads that have spanned the extended locus, the number of reads supporting any allele;

a2:在没有跨越延长后基因座的reads里,支持任何等位基因的reads的数量;a 2 : Among the reads that do not span the extended locus, the number of reads supporting any allele;

b1:在已经跨越延长后基因座的reads里,支持突变等位基因的reads的数量;b 1 : among the reads that have spanned the extended locus, the number of reads supporting the mutant allele;

b2:在没有跨越延长后基因座的reads里,支持突变等位基因的reads的数量。b 2 : Number of reads supporting the mutant allele among reads that do not span the extended locus.

依次代入11个延长单位产生一条由11个数值组成的序列;序列中取每3个相邻数值作为一个数组,从而产生9个数组;对每个数组取最小值,从而产生9个最小值;取9个最小值的最大值为position bias。Substituting 11 extended units in turn produces a sequence consisting of 11 values; taking every 3 adjacent values in the sequence as an array, thereby generating 9 arrays; taking the minimum value for each array, thereby generating 9 minimum values; Take the maximum value of the 9 minimum values as position bias.

对于strand bias而言,代入以下数值:For strand bias, substitute the following values:

a1:在染色体链S的正向,支持任何等位基因的reads的数量;a 1 : In the forward direction of the chromosome strand S, the number of reads supporting any allele;

a2:在染色体链S的反向,支持任何等位基因的reads的数量;a 2 : In the reverse direction of the chromosome strand S, the number of reads supporting any allele;

b1:在染色体链S的正向,支持突变等位基因的reads的数量;b 1 : in the forward direction of the chromosome strand S, the number of reads supporting the mutant allele;

b2:在染色体链S的反向,支持突变等位基因的reads的数量。b 2 : In the reverse direction of chromosome strand S, the number of reads supporting the mutant allele.

同时计算base quality imbalance(简称BQI,公式中用I表示,有1个I)和strandposition imbalance(简称SPI,公式中用I表示,有2个I),通过以下方法计算出3个I的数值,然后取最小数值;然后让strand bias除以此最小值,以此来归一化strand bias;最后确保strand bias最低至少是1;Simultaneously calculate base quality imbalance (abbreviated as BQI, represented by I in the formula, with 1 I) and strandposition imbalance (SPI for short, represented by I in the formula, with 2 Is), and calculate the values of 3 Is by the following method, Then take the minimum value; then divide the strand bias by this minimum value to normalize the strand bias; finally ensure that the strand bias is at least 1;

对于BQI,c是另外一条链的所有等位基因的平均碱基质量,d是这一条链的突变等位基因的平均碱基质量,而I=10d-cFor BQI, c is the average base quality of all alleles of another chain, d is the average base quality of the mutant alleles of this chain, and I=10 dc ;

对于SPI,c是另外一条链的所有等位基因的到read末端的平均碱基距离,d是怀疑有偏好性链的突变等位基因的到read末端的平均碱基距离,而

Figure BDA0002426561100000121
每一条read有两个末端:左边和右边。因此左末端和右末端分别产生两个I。For SPI, c is the average base distance to the end of the read for all alleles of the other strand, d is the average base distance to the end of the read for the mutant alleles of the suspected biased strand, and
Figure BDA0002426561100000121
Each read has two ends: left and right. So two I's are generated at the left and right ends respectively.

对于mismatch bias而言,让x=1,2,3,...,11;x是mismatch数量阈值(通过mismatch数量<x可以推断出)。1,2,……,11是不同的尝试的阈值。For the mismatch bias, let x=1, 2, 3, ..., 11; x is the mismatch number threshold (it can be inferred by the mismatch number<x). 1, 2, ..., 11 are thresholds for different attempts.

针对每个x的值,代入以下四个数值:For each value of x, substitute the following four values:

a1:mismatch数量<x,支持任何等位基因的reads数量;a 1 : the number of mismatches < x, the number of reads supporting any allele;

a2:mismatch数量≥x,支持任何等位基因的reads数量;a 2 : The number of mismatches ≥ x, the number of reads supporting any allele;

b1:mismatch数量<x,支持突变等位基因的reads数量;b 1 : the number of mismatches < x, the number of reads supporting the mutant allele;

b2:mismatch数量≥x,支持突变等位基因的reads数量;b 2 : the number of mismatches ≥ x, the number of reads supporting the mutant allele;

依次代入11个x值产生一条由11个数值组成的序列;序列中取每5个相邻数值作为一个数组,从而产生7个数组;对每个数组取最小值,从而产生7个最小值;取7个最小值的最大值为mismatch bias。Substituting 11 x values in turn produces a sequence consisting of 11 values; taking every 5 adjacent values in the sequence as an array, thus generating 7 arrays; taking the minimum value for each array, thereby generating 7 minimum values; Take the maximum value of the 7 minimum values as the mismatch bias.

根据上述在每一条染色体链算出四个偏好性。选最高偏好性作为染色体链整体偏好性。对支持突变型的read数目做校正,校正后的reads数目等于校正前的reads数目除以整体偏好性。Four preferences are calculated on each chromosome strand according to the above. Select the highest preference as the overall preference of the chromosome chain. Correct the number of reads that support the mutant type, and the number of reads after correction is equal to the number of reads before correction divided by the overall preference.

四、利用校正后的支持突变型的read数目,计算所有单个样本相关质量(quality),然后得出单个样本变异质量,通过变异质量进行判定。4. Use the corrected number of reads supporting the mutant type to calculate the relevant quality of all individual samples, and then obtain the variation quality of a single sample, and judge by the variation quality.

包括:include:

1、针对每个需要计算quality的位点,采用以下公式计算family quality,single-strand quality,double-strand quality,tumor-normal quality和variantquality。1. For each locus whose quality needs to be calculated, use the following formula to calculate family quality, single-strand quality, double-strand quality, tumor-normal quality and variant quality.

Quality(p,B,b,r)=logr(BIAS(B,p×B,B,b)×(r-1)+1)×10×log10(OR(B,p×B,B,b))。Quality (p, B, b, r) = log r (BIAS (B, p × B, B, b) × (r-1) + 1) × 10 × log 10 (OR (B, p × B, B , b)).

其中:in:

(1)针对一个family里的reads,让r=2并且使用以下定义计算family quality:(1) For the reads in a family, let r=2 and use the following definition to calculate the family quality:

p:步骤(a3)reads去重中计算出的碱基质量的阈值所对应的错误概率;p: error probability corresponding to the threshold value of base quality calculated in step (a3) reads deduplication;

B:从数据实际观测到支持任何等位基因的read数量;B: The number of reads that support any allele from the actual observation of the data;

b:从数据实际观测到支持突变等位基因的read数量。b: Number of reads actually observed from the data supporting the mutant allele.

(2)如果高通量测序实验采用PCR扩增方法建库,则r=1.5,如果高通量测序实验采用捕获或者WGS方法建库,则r=1.25。比对到一条染色体链上的families都在一条single-strand上;针对一条single-strand,使用以下定义计算single-strand quality(最大family quality如下:对于C:G>T:A突变形式是44,对于T:A>C:G突变形式是48,对于其他SNV突变形式是52,对于所有InDel突变形式是60)。如果single-strand quality大于最大family quality数值,则让single-strand quality等于最大family quality数值;(2) If the high-throughput sequencing experiment uses the PCR amplification method to build the library, then r=1.5, and if the high-throughput sequencing experiment uses the capture or WGS method to build the library, then r=1.25. The families compared to a chromosome chain are all on a single-strand; for a single-strand, use the following definition to calculate the single-strand quality (the maximum family quality is as follows: for C: G > T: A mutant form is 44, For T: A>C:G mutant form is 48, for other SNV mutant forms is 52, for all InDel mutant forms is 60). If the single-strand quality is greater than the maximum family quality value, then let the single-strand quality be equal to the maximum family quality value;

p:每个可能的family quality(即尝试所有的family qualities)所对应的错误概率;p: The error probability corresponding to each possible family quality (ie try all family qualities);

B:实际观测到支持任何等位基因的family数量;B: The number of families actually observed to support any allele;

b:实际观测到支持突变等位基因的family数量;b: The number of families actually observed to support the mutant allele;

在尝试所有的p之后,取最大single-strand quality作为最终single-strandquality。After trying all p, take the largest single-strand quality as the final single-strand quality.

(3)每一条single-strand在一条double-strand上的正向或反向;针对一条single-strand,使用以下定义计算double-strand quality:(3) The forward or reverse direction of each single-strand on a double-strand; for a single-strand, use the following definition to calculate the double-strand quality:

v1′=v1+v2×min(1,(min(w1+w2,w0)-w1)/w1);v 1 '=v 1 +v 2 ×min(1, (min(w 1 +w 2 ,w 0 )-w 1 )/w 1 );

其中v1′是某一条链上的最终变异质量,v1是这一条链的变异质量(即最终single-strand quality),v2是另外一条链上的变异质量,w1是这一条链的family quality的平均值,w2是另外一条链的family quality的平均值,w0等于最大family quality(最大familyquality如下:对于C:G>T:A突变形式是44,对于T:A>C:G突变形式是48,对于其他SNV突变形式是52,对于所有InDel突变形式是60)加上10;Among them, v 1 ′ is the final mutation quality on a certain chain, v 1 is the mutation quality of this chain (that is, the final single-strand quality), v 2 is the mutation quality on another chain, and w 1 is the mutation quality of this chain. The average value of family quality, w 2 is the average value of the family quality of another chain, w 0 is equal to the maximum family quality (the maximum family quality is as follows: for C: G>T: A mutant form is 44, for T: A>C: G mutant form is 48, for other SNV mutant forms is 52, for all InDel mutant forms is 60) plus 10;

交换v1和v2并且交换w1和w2,根据计算v1′公式的对称性计算v2′;Exchange v 1 and v 2 and exchange w 1 and w 2 to calculate v 2 ′ according to the symmetry of the formula for calculating v 1 ′;

v2′=v2+v1×min(1,(min(w1+w2,w0)-w2)/w2)。v 2 ′=v 2 +v 1 ×min(1, (min(w 1 +w 2 , w 0 )−w 2 )/w 2 ).

2、把double-strand quality定义成单个样本变异质量。2. Define double-strand quality as the variation quality of a single sample.

3、按照上述步骤1和2,分别计算得到tumor样本和normal对照样本的单样本突变质量。然后计算TLOD。TLOD代表变异既不来源于躯体(soma)也不来源于胚胎(germline)的概率,具体如下:3. According to the above steps 1 and 2, calculate the single-sample mutation quality of the tumor sample and the normal control sample, respectively. Then calculate TLOD. TLOD represents the probability that the variation is neither derived from the body (soma) nor from the embryo (germline), as follows:

首先,计算以下函数的输出值。First, calculate the output value of the following function.

TNreward(a1,a2,b1,b2)=maxH(TNrewardBinomial(a1,a2,b1,b2),PLQTN(b2/b1,a2/a1));TNreward(a 1 , a 2 , b 1 , b 2 )=maxH(TNrewardBinomial(a 1 , a 2 , b 1 , b 2 ), PLQTN(b 2 /b 1 , a 2 /a 1 ));

其中,TNrewardBinomial(a1,a2,b1,b2)=10/log(10)×(a1×KLBernoulli(b2/b1,a2/a1))Among them, TNrewardBinomial(a 1 , a 2 , b 1 , b 2 )=10/log(10)×(a1×KL Bernoulli(b 2 /b 1 , a 2 /a 1 ))

如果max(a,-a)<max(b,-b),那么maxH(a,b)=a,否则maxH(a,b)=b;If max(a,-a)<max(b,-b), then maxH(a,b)=a, otherwise maxH(a,b)=b;

PLQTN(t,n)=3×10×log10(min(t/n,102.5/3))。PLQTN(t, n)=3×10×log 10 (min(t/n, 10 2.5/3 )).

TNpenal(a1,a2,b1,b2)=max(0,min(DSVQnormal,PLQ(a2/a1))-12.5×(max(0,OR(a1,a2,b1,b2)-1))2);其中DSVQnormal指的是normal样本的单样本突变质量,也就是normal样本的double-strand quality。TNpenal(a 1 , a 2 , b 1 , b 2 )=max(0, min(DSVQnormal, PLQ(a 2 /a 1 ))-12.5×(max(0, OR(a 1 , a 2 , b 1 , b 2 )-1)) 2 ); where DSVQnormal refers to the single-sample mutation quality of the normal sample, that is, the double-strand quality of the normal sample.

a1、a2、b1和b2含义如下:a 1 , a 2 , b 1 and b 2 have the following meanings:

a1:配对normal对照样本中支持任何等位基因的reads的family quality总和;a 1 : The sum of the family quality of the reads supporting any allele in the paired normal control sample;

a2:配对normal对照样本中支持突变等位基因的reads的family quality总和;a 2 : The sum of the family quality of the reads supporting the mutant allele in the paired normal control sample;

b1:tumor样本中支持任何等位基因的reads的family quality总和;b 1 : the sum of family quality of reads supporting any allele in the tumor sample;

b2:tumor样本中支持突变等位基因的reads的family quality总和。b 2 : The sum of the family quality of the reads supporting the mutant allele in the tumor sample.

然后,计算TLOD,计算公式如下:Then, calculate TLOD, the calculation formula is as follows:

TLOD(a1,a2,b1,b2)=min(DSVQtumor,PLQ(b2/b1))+TNreward(a1,a2,b1,b2)-TNpenal(a1,a2,b1,b2)。TLOD(a 1 ,a 2 ,b 1 ,b 2 )=min(DSVQtumor,PLQ(b 2 /b 1 ))+TNreward(a 1 ,a 2 ,b 1 ,b 2 )-TNpenal(a 1 ,a 2 , b 1 , b 2 ).

其中,PLQ(f)=90+3×10×log10(f)。DSVQtumor指的是tumor样本的单样本突变质量,也就是tumor样本的double-strand quality。Wherein, PLQ(f)=90+3×10×log 10 (f). DSVQtumor refers to the single-sample mutation quality of the tumor sample, that is, the double-strand quality of the tumor sample.

(C4)计算NLOD。NLOD代表来变异来源于胚胎(germline)的概率。具体如下:(C4) Calculate NLOD. NLOD represents the probability that the variation originates from the embryo (germline). details as follows:

首先,针对每个样本(总共有tumor和配对的normal两个样本),计算每个genotype(GT)的genotype likelihood(GL)。GT和GL的定义和说明在https://samtools.github.io/hts-specs/VCFv4.2.pdf中。总而言之,homref(纯合野生型),hetero(杂合型)和homalt(纯合突变型)是三种GT。这三种GT的GL的计算方法如下:First, for each sample (there are two samples of tumor and paired normal in total), the genotype likelihood (GL) of each genotype (GT) is calculated. The definitions and descriptions of GT and GL are in https://samtools.github.io/hts-specs/VCFv4.2.pdf. To sum up, homref (homozygous wild type), hetero (heterozygous type) and homalt (homozygous mutant type) are three types of GT. The calculation methods of the GL of these three GTs are as follows:

GL(f,homref)=max(GLPowerLaw(f,homref),GLBinomial(f,homref));GL(f, homref) = max(GLPowerLaw(f, homref), GLBinomial(f, homref));

GL(f,hetero)=max(GLPowerLaw(f,hetero),GLBinomial(f,hetero));GL(f, hetero) = max(GLPowerLaw(f, hetero), GLBinomial(f, hetero));

GL(f,homalt)=max(GLPowerLaw(f,homalt),GLBinomial(f,homalt))。GL(f, homalt) = max(GLPowerLaw(f, homalt), GLBinomial(f, homalt)).

其中,GLPowerLaw(f,homref)=10×3×min(0,log10(0.02×(1-f)/f));Among them, GLPowerLaw(f, homref) = 10×3×min(0, log 10 (0.02×(1-f)/f));

GLPowerLaw(f,hetero)=10×3×min(0,log10(min(f/(1-f),(1-f)/f)));GLPowerLaw(f, hetero) = 10×3×min(0, log 10 (min(f/(1-f), (1-f)/f)));

GLPowerLaw(f,homalt)=10×3×min(0,log10(0.02×f/(1-f)));GLPowerLaw(f, homalt) = 10×3×min(0, log 10 (0.02×f/(1-f)));

GLBinomial(f,homref)=10/log(10)×d×KLBernoulli(min(0.02,f),f);GLBinomial(f, homref) = 10/log(10) × d × KL Bernoulli(min(0.02, f), f);

GLBinomial(f,hetero)=10/log(10)×d×KLBernoulli(0.5,f);GLBinomial(f, hetero)=10/log(10)×d×KL Bernoulli(0.5,f);

GLBinomial(f,homalt)=10/log(10)×d×KLBernoulli(max(1-0.02,f),f);GLBinomial(f,homalt)=10/log(10)×d×KL Bernoulli(max(1-0.02,f),f);

KLBernoulli(x,y)=y×log(y/x)+(1-y)×log((1-y)/(1-x));KL Bernoulli(x,y)=y×log(y/x)+(1-y)×log((1-y)/(1-x));

如果x=0、y=0、x是不定式或者y是不定式,那么KLBernoulli(x,y)=0。比如说,(0/0)是不定式。If x=0, y=0, x is an infinitive or y is an infinitive, then KL Bernoulli(x,y)=0. For example, (0/0) is an infinitive.

并且a指的是支持突变型基因的read数量,d指的是支持任何基因型的read数量,f=a/d是突变频率。And a refers to the number of reads supporting a mutant gene, d refers to the number of reads supporting any genotype, and f=a/d is the mutation frequency.

然后计算NLOD。NLOD=max(Qtumorhomref,Qnormalhomref)。Then calculate the NLOD. NLOD=max(Qtumorhomref, Qnormalhomref).

其中在tumor样本上,Qtumorhomref=-3-max(GL(f,homalt),GL(f,hetero))。Qtumorhomref是使用tumor样本数据计算的,所以跟配对的normal样本无任何关系。where on the tumor sample, Qtumorhomref=-3-max(GL(f, homalt), GL(f, hetero)). Qtumorhomref is calculated using tumor sample data, so it has nothing to do with paired normal samples.

其中在配对的normal样本上,Qnormalhomref=GL(f,homref)-max(GL(f,homalt),GL(f,hetero))。Qnormalhomref是使用配对的normal样本数据计算的,所以跟tumor样本无任何关系。Wherein on the paired normal samples, Qnormalhomref=GL(f, homref)-max(GL(f, homalt), GL(f, hetero)). Qnormalhomref is calculated using paired normal sample data, so it has nothing to do with tumor samples.

5、计算最终变异质量5. Calculate the final mutation quality

按照如下公式计算最终变异质量:min(TLOD,NLOD+31)。Calculate the final mutation quality according to the following formula: min(TLOD, NLOD+31).

如果最终变异质量至少是60,则判定为阳性并且输出变异形式;否则判定为阴性。60的质量代表百万分之一的假阳率。变异检测软件MuTect(Cibulskis K,Lawrence M S,Carter S L,et al.Sensitive detection of somatic point mutations in impure andheterogeneous cancer samples[J].Nature biotechnology,2013,31(3):213.)间接地使用了60的阈值,同时BWA能输出的最大比对质量是60。因此使用了60的阈值。变异输出格式为VCF。VCF格式定义在(Danecek P,Auton A,Abecasis G,et al.The variant callformat and VCFtools[J].Bioinformatics,2011,27(15):2156-2158.)中有详细描述。If the final mutation quality is at least 60, it is judged as positive and the mutant form is output; otherwise, it is judged as negative. A quality of 60 represents a false positive rate of one in a million. The mutation detection software MuTect (Cibulskis K, Lawrence M S, Carter S L, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples[J]. Nature biotechnology, 2013, 31(3): 213.) is used indirectly The threshold of 60 is set, and the maximum comparison quality that BWA can output is 60. Therefore a threshold of 60 was used. The mutation output format is VCF. The VCF format definition is described in detail in (Danecek P, Auton A, Abecasis G, et al. The variant callformat and VCFtools[J]. Bioinformatics, 2011, 27(15): 2156-2158.).

检测InDel:Detect InDel:

与SNV变异相比,在“碱基质量校正”之前增加“碱基质量估计”的步骤,其余均为SNV变异检测方法相同。“碱基质量估计”方法具体如下:Compared with SNV variation, the step of "base quality estimation" is added before "base quality correction", and the rest are the same as SNV variation detection methods. The Base Quality Estimation method is as follows:

InDelQual(x,y,a,b,c)=min(x+10,y+10,STRQual(a,b))+10×log10(c)InDelQual(x, y, a, b, c) = min(x+10, y+10, STRQual(a, b))+10×log 10 (c)

其中InDelQual是跟碱基质量有相同用途的InDel质量,x和y分别是发生InDel的前一个碱基和后一个碱基的碱基质量,a和b分别是短串联重复序列(STR)的单元大小(repeat size)和单元数量(repeat number),c是InDel的碱基数量。STRQual函数定义如下:Among them, InDelQual is the InDel quality that has the same purpose as the base quality, x and y are the base quality of the previous base and the next base where InDel occurs, respectively, and a and b are the units of the short tandem repeat (STR) The size (repeat size) and the number of units (repeat number), c is the number of bases of InDel. The STRQual function is defined as follows:

STRQual(a,b)=44-10×log10(1+8×softplus(a×b-8)/(θ×a2))STRQual(a, b)=44-10×log 10 (1+8×softplus(a×b-8)/(θ×a 2 ))

softplus函数定义如下:The softplus function is defined as follows:

softplus(v)=log(1+exp(v)),其中log指的是自然对数。softplus(v)=log(1+exp(v)), where log refers to the natural logarithm.

如果InDel是一个STR单元的缺失,那么θ=0.25,除此之外,θ=1。If InDel is a deletion of a STR unit, then θ=0.25, otherwise θ=1.

实施例2、本发明基于高通量测序的基因变异检测方法的应用实例Example 2. Application examples of the gene variation detection method based on high-throughput sequencing of the present invention

一、本发明方法和smCounter2方法的比较One, the comparison of the method of the present invention and smCounter2 method

采用本发明实施例1建立的方法和现有的smCounter2方法(参考文献“Xu C,Gu X,Padmanabhan R,et al.smCounter2:an accurate low-frequency variant caller fortargeted sequencing data with unique molecular identifiers[J].Bioinformatics,2019,35(8):1299-1309.”)分别在N13532(SRR7526729)和N0261(SRR7526728)样本上对SNV变异进行检测。Adopt the method that the embodiment of the present invention 1 establishes and existing smCounter2 method (reference "Xu C, Gu X, Padmanabhan R, et al.smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers [J] .Bioinformatics, 2019, 35(8):1299-1309.") Detected SNV variation on N13532 (SRR7526729) and N0261 (SRR7526728) samples, respectively.

N13532(SRR7526729)样本和N0261(SRR7526728)样本:数据在以下英文网站可以找到。https://www.ncbi.nlm.nih.gov/sra/SRR7526728;https://www.ncbi.nlm.nih.gov/sra/SRR7526729。除此之外,smCounter2文章中也有描述。SRR7526728产品号:CDHS-13907Z-562;SRR7526729产品号:CDHS-13532Z-10181。N13532 (SRR7526729) sample and N0261 (SRR7526728) sample: data can be found at the following English websites. https://www.ncbi.nlm.nih.gov/sra/SRR7526728; https://www.ncbi.nlm.nih.gov/sra/SRR7526729. In addition, it is also described in the smCounter2 article. SRR7526728 product number: CDHS-13907Z-562; SRR7526729 product number: CDHS-13532Z-10181.

图1至图4显示了本发明和smCounter2在N13532(SRR7526729)和N0261(SRR7526728)样本(项目编号:SRP153933)上的精确率和召回率。由图可见,本发明的优点是更高的F1分数和曲线下面积。Figures 1 to 4 show the precision and recall rates of the present invention and smCounter2 on N13532 (SRR7526729) and N0261 (SRR7526728) samples (item number: SRP153933). As can be seen from the figure, the advantages of the present invention are higher F1 score and area under the curve.

smCounter2需要通过对相同的样本进行重复测序并且预先得到样本的某些(但不是全部)已知阳性突变来做机器学习训练,但是本发明不需要这些,因此本发明减少了对已有数据的要求。针对血液样本的分子标签数据(图1至图4),本发明提高了真阳率并且降低了假阳率。smCounter2 needs to do machine learning training by repeatedly sequencing the same sample and obtaining some (but not all) known positive mutations of the sample in advance, but the present invention does not require these, so the present invention reduces the requirement for existing data . For the molecular label data of blood samples (Fig. 1 to Fig. 4), the present invention improves the true positive rate and reduces the false positive rate.

二、本发明方法、Mutect2方法和Strelka2方法在标准混合样本上的性能比较Two, the performance comparison of the inventive method, the Mutect2 method and the Strelka2 method on the standard mixed sample

采用本发明实施例1建立的方法和现有的Mutect2方法和Strelka2方法分别在标准混合样本上对SNV变异进行检测。The method established in Example 1 of the present invention and the existing Mutect2 method and Strelka2 method are used to detect SNV variation on standard mixed samples.

Strelka2方法:参见“Kim S,Scheffler K,Halpern A L,et al.Strelka2:fastand accurate calling of germline and somatic variants[J].Nature methods,2018,15(8):591.”一文。Strelka2源代码在https://github.com/Illumina/strelka。Strelka2 method: see "Kim S, Scheffler K, Halpern A L, et al. Strelka2: fast and accurate calling of germline and somatic variants [J]. Nature methods, 2018, 15(8): 591." Strelka2 source code is at https://github.com/Illumina/strelka.

Mutect2方法:参见“Cibulskis K,Lawrence M S,Carter S L,et al.Sensitivedetection of somatic point mutations in impure and heterogeneous cancersamples[J].Nature biotechnology,2013,31(3):213.”一文。Mutect2源代码在https:// github.com/broadinstitute/gatkMutect2 method: see "Cibulskis K, Lawrence M S, Carter S L, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancersamples[J]. Nature biotechnology, 2013, 31(3): 213." Mutect2 source code is at https://github.com/broadinstitute/gatk .

标准混合样本含有虚拟的tumor和normal的WGS(全基因组)数据:tumor是平均测序深度为90×的混合比例是3/7的NA12878/NA24385两个样本混合,normal是平均测序深度为30×的纯NA24385单个样本。标准混合样本数据在:ftp://ftp- trace.ncbi.nlm.nih.gov/giab/ftp/use_cases/mixturesThe standard mixed sample contains virtual tumor and normal WGS (whole genome) data: tumor is a mixture of two samples of NA12878/NA24385 with an average sequencing depth of 90×, and the normal is a mixture of NA12878/NA24385 samples with an average sequencing depth of 30× A single sample of pure NA24385. Standard mixture sample data are at: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/use_cases/mixtures .

NA12878的germline变异都在:ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz。The germline variants of NA12878 are all in: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_ CHROM1-X_v .3.3.2_highconf_PGandRTGphasetransfer.vcf.gz.

其中关于这些变异的描述性文档在:The descriptive documentation for these variants is at:

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/README_NISTv3.3.2.txt。ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/README_NISTv3.3.2.txt.

图5和图6显示了本发明方法、Mutect2方法和Strelka2方法在标准混合样本上的性能测试结果。由图可见,本发明的优点是更高的F1分数和曲线下面积。Figure 5 and Figure 6 show the performance test results of the method of the present invention, the Mutect2 method and the Strelka2 method on standard mixed samples. As can be seen from the figure, the advantages of the present invention are higher F1 score and area under the curve.

Strelka2需要通过对其他相似的样本进行测序并且预先得到其他相似样本的全部已知阳性突变来做机器学习训练,但是本发明不需要这些。总而言之,针对组织样本的WGS数据,本发明提高了真阳率并且降低了假阳率,并且有可能减少生物学试验。Strelka2 needs to do machine learning training by sequencing other similar samples and obtaining all known positive mutations of other similar samples in advance, but the present invention does not need these. In summary, the present invention increases the true positive rate and reduces the false positive rate for WGS data of tissue samples, and potentially reduces biological testing.

本实施例结果表明,使用本发明实施例1建立的方法(简称,UVC),并且跟其他软件做了比较和评估。评估显示UVC性能非常好:无论数据是否带有分子标签,UVC既提高了真阳率又降低了假阳率。The results of this example show that the method established in Example 1 of the present invention (abbreviated as UVC) is used, and compared and evaluated with other software. The evaluation showed that UVC performed very well: UVC both increased the true positive rate and decreased the false positive rate, regardless of whether the data were molecularly labeled or not.

Claims (8)

1. A gene variation detection method based on high-throughput sequencing comprises the following steps:
(A) Preprocessing the high-throughput sequencing data, mainly comprising base quality correction and read de-duplication;
(B) Calculating the preference generated by sequencing, and reducing the number of effective reads supporting mutation through the preference to obtain the corrected read number supporting mutation;
(C) Calculating the relative quality of all single samples by using the corrected read number of the support mutant, then obtaining the variation quality of the single sample, and judging through the variation quality;
the step (C) is carried out as follows:
(C1) For each site where quality needs to be calculated, calculating family quality, single-strand quality, double-strand quality by using the following formulas;
Quality(p,B,b,r)=logr(BIAS(B,p×B,B,b)×(r-1)+1)×10×log 10 (OR(B,p×B,B,b));
wherein Bias (B, pxb, B) = max (0, or (LS (B, pxb, B)) -1) × (B ÷ (B + B)) +1;
OR represents an odds ratio; the OR is calculated as follows:
OR(B,p×B,B,b)=(B×b)÷(p×B×B);
LS stands for Laplace smoothing; LS formula is as follows:
LS(B,p×B,B,b)=Laplace 2 (Laplace 1 (B,p×B,B,b));
laplace1 is expressed as: laplace 1 (B,p×B,B,b)=(B+p′,p×B+p′,B+p′,b+p′);
Laplace2 is expressed as: laplace 2 (B,p×B,B,b)=(B+p′×α,p×B+p′×α,B+p′×β,b+p′×β);
In Laplace 1 Formula and Laplace 2 In the formula, the default value of p' is 0.5;
in Laplace 2 In the formulas, α and β are as follows:
α=max(0,p′×((B+p×B)/(B+b)-1));
β=max(0,p′×(B+b)/(B+p×B)-1);
wherein:
(1) Let r =2 and calculate the family quality using the following definition for reads in a family:
p: error probability corresponding to the base quality threshold calculated in the step (A) reads deduplication;
b: the number of reads that actually observed to support any allele from the data;
b: actual observations from the data support the number of reads for the mutant allele;
(2) R =1.5 if the high throughput sequencing experiment is banked using PCR amplification, r =1.25 if the high throughput sequencing experiment is banked using capture or WGS; comparing that the families on one chromosome chain are on one single-strand; for a single-strand, the single-strand quality is calculated using the following definitions:
p: the error probability corresponding to each possible family quality;
b: actual observation of family numbers supporting any allele;
b: actual observation of family numbers supporting mutant alleles;
if the single-strand quality is greater than the maximum similarity quality value, then let the single-strand quality equal the maximum similarity quality value; the maximum family quality is as follows: for C: g > T: the a mutant form was 44, for T: a > C: the G mutant form was 48, 52 for the other SNV mutant forms, 60 for all InDel mutant forms;
after all p's are tried, the maximum single-strand quality is taken as the final single-strand quality;
(3) Forward or reverse direction of each single-strand on a double-strand; for a single-strand, double-strand quality was calculated using the following definitions:
v 1 ′=v 1 +v 2 ×min(1,(min(w 1 +w 2 ,w 0 )-w 1 )/w 1 );
wherein v is 1 ' is the final variable mass on a strand, v 1 Is the variable quality of this chain, i.e.the final single-strand quality, v 2 Is the quality of variation on the other strand, w 1 Is the average value of the family quality, w, of this chain 2 Is the average value of the family quality of the other chain, w 0 Equal to the maximum similarity quality, which is as follows, plus 10: for C: g > T: the a mutant form was 44, for T: a > C: the G mutant form was 48, 52 for the other SNV mutant forms, and 60 for all InDel mutant forms;
exchange v 1 And v 2 And exchange w 1 And w 2 According to the calculation of v 1 ' symmetry calculation of formula v 2 ′;
v 2 ′=v 2 +v 1 ×min(1,(min(w 1 +w 2 ,w 0 )-w 2 )/w 2 );
(C2) Defining double-strand quality as a single sample variation quality;
(C3) Calculating the single sample mutation quality of the tumor sample and the normal control sample according to the (C1) and the (C2) respectively; then calculating TLOD; TLOD represents the probability that the variation is neither somatic nor embryonic, as follows:
firstly, calculating the output values of the following functions;
TNreward(a 1 ,a 2 ,b 1 ,b 2 )=maxH(TNrewardBinomial(a 1 ,a 2 ,b 1 ,b 2 ),PLQTN(b 2 /b 1 ,a 2 /a 1 ));
wherein, TNrewardBonomial (a) 1 ,a 2 ,b 1 ,b 2 )=10/log(10)×(a 1 ×KLBernoulli(b 2 /b 1 ,a 2 /a 1 ) ); wherein, KLBernoulli (b) 2 /b 1 ,a 2 /a 1 )=(a 2 /a 1 )×log((a 2 /a 1 )/(b 2 /b 1 ))+(1-(a 2 /a 1 ))×log((1-(a 2 /a 1 ))/(1-(b 2 /b 1 ) ); if b is 2 /b 1 =0、a 2 /a 1 =0、b 2 /b 1 Is an indefinite form or a 2 /a 1 Is an infinite form, then KLBernoulli (b) 2 /b 1 ,a 2 /a 1 ) =0; maxH (a, b) = a if max (a, -a) < max (b, -b), otherwise maxH (a, b) = b;
PLQTN(t,n)=3×10×log 10 (min(t/n,10 2.5/3 ));
TNpenal(a 1 ,a 2 ,b 1 ,b 2 )=max(0,min(DSVQnormal,PLQ(a 2 /a 1 ))-12.5×(max(0,OR(a 1 ,a 2 ,b 1 ,b 2 )-1)) 2 ) (ii) a Wherein DSVQnormal refers to the single sample mutation quality of normal samples, namely double-strand quality of normal samples; OR represents the ratio of ratios, and the calculation formula of OR is as follows: OR (a) 1 ,a 2 ,b 1 ,b 2 )=(a 1 ×b 2 )÷(a 2 ×b 1 );a 1 、a 2 、b 1 And b 2 The meanings are as follows:
a 1 : family q of reads supporting any allele in the paired normal control sample(ii) a total of the qualitys;
a 2 : the sum of the family quality of reads supporting mutant alleles in the paired normal control sample;
b 1 : family quality sum of reads supporting any allele in the tomor sample;
b 2 : family quality sum of reads supporting mutant alleles in the tomor sample;
then, TLOD is calculated as follows:
TLOD(a 1 ,a 2 ,b 1 ,b 2 )=min(DSVQtumor,PLQ(b 2 /b 1 ))+TNreward(a 1 ,a 2 ,b 1 ,b 2 )-TNpenal(a 1 ,a 2 ,b 1 ,b 2 );
wherein PLQ (f) =90+3 × 10 × log 10 (f) (ii) a DSVQtomor refers to the single-sample mutation quality of a tomor sample, i.e., the double-strand quality of a tomor sample;
(C4) Calculating NLOD; NLOD represents the probability of coming from an embryo for variation, as follows:
firstly, calculating the genotype likelihood of each genotype aiming at each sample; GT for genetic type, GL for genetic type likelihood for short; the GL of three GTs, homozygous wild-type, heterozygous and homozygous mutant, was calculated as follows:
GL formula for homozygous wild type: GL (f, homref) = max (GLPowerLaw (f, homref), GLBinomial (f, homref));
GL calculation formula for heterozygote type: GL (f, hetero) = max (GLPowerLaw (f, hetero), GLBinomial (f, hetero));
GL calculation formula for homozygous mutant: GL (f, homalt) = max (GLPowerLaw (f, homalt), GLBinomial (f, homalt));
where GLPowerLaw (f, homref) =10 × 3 × min (0,log) 10 (0.02×(1-f)/f));
GLPowerLaw(f,hetero)=10×3×min(0,log 10 (min(f/(1-f),(1-f)/f)));
GLPowerLaw(f,homalt)=10×3×min(0,log 10 (0.02×f/(1-f)));
GLBinomial(f,homref)=10/log(10)×d×KLBernoulli(min(0.02,f),f);
GLBinomial(f,hetero)=10/log(10)×d×KLBernoulli(0.5,f);
GLBinomial(f,homalt)=10/log(10)×d×KLBernoulli(max(1–0.02,f),f);
KLBernoulli (x, y) = y × log (y/x) + (1-y) × log ((1-y)/(1-x)); KLBernoulli (x, y) =0 if x =0, y =0, x is indeterminate or y is indeterminate;
and a refers to the number of reads supporting mutant genes, d refers to the number of reads supporting any genotype, f = a/d is the mutation frequency;
then, calculating NLOD; NLOD = max (qtumhorref, qnormalomref);
wherein, on the tomor sample, qtomormref = -3-max (GL (f, homalt), GL (f, hetero));
wherein Qnormalhomref = GL (f, homref) -max (GL (f, homalt), GL (f, hetero));
(C5) The final quality of variation was calculated according to the following formula: min (TLOD, NLOD + 31);
if the final quality of variation is at least 60, determining as positive and outputting a variant form; otherwise, the result is judged to be negative.
2. The method of claim 1, wherein: in the method, the genetic variation detected is an SNV variation or an InDel variation.
3. The method of claim 1, wherein: when the detected genetic variation is an InDel variation, the step (A) further comprises a step of estimating the base quality before the base quality correction; the base quality estimation is performed as follows:
InDelQual(x,y,a,b,c)=min(x+10,y+10,STRQual(a,b))+10×log 10 (c);
wherein InDelQual is InDel mass having the same purpose as base mass, x and y are base mass of a previous base and a next base of the occurrence of InDel, a and b are unit size and unit number of the short tandem repeat sequence, and c is base number of InDel; the STRGUAL function is defined as follows:
STRQual(a,b)=44-10×log 10 (1+8×softplus(a×b-8)/(θ×a 2 ));
the softplus function is defined as follows:
softplus (v) = log (1 + exp (v)), where log refers to the natural logarithm;
if InDel is a deletion of one short tandem repeat unit, then θ =0.25, and in addition θ =1.
4. A method according to any one of claims 1-3, characterized in that: in the step (A), the base quality correction is carried out as follows:
if the high throughput sequencing is double-ended sequencing, merging before de-duplication and filtering, merging the R1 end and the R2 end, and then using the base quality of the merged reads;
for the overlapping region of the R1 end and the R2 end, the base quality after combination is calculated as follows:
when the bases corresponding to p and q are the same, the following formula is used: max (p, q);
when the bases corresponding to p and q are different, the following formula is used for calculation: max (p, q) -min (p, q);
wherein p is the base mass before the end R1 is corrected, and q is the base mass before the end R2 is corrected;
for base masses in non-overlapping regions of the R1 end and the R2 end, the combined base masses are equal to the base masses before correction;
if the single-ended sequencing is carried out, merging is not needed, and the base quality after correction is equal to the base quality before correction;
and/or
In the step (A), read deduplication is performed as follows:
placing PCR repeats or optically repeated reads into a family to remove repeats;
further, PCR repeats or optically repeated reads are placed in a family to remove duplicates according to a method comprising the following steps:
(a1) For each locus, the number of reads that start or stop at that locus, abbreviated as end depth, is found;
(a2) Each locus x attracts other loci y, the attraction strength s of x to y is calculated, the attraction strength s is in direct proportion to the depth of the x end, in inverse proportion to the depth of the y end and in exponential decay relation with the number z of bases between x and y;
the formula of the attractive force intensity s is as follows;
s=x/y×5 -z
if s > 1, then y is considered a PCR or optical repeat of x, thereby placing x and y in a family;
(a3) And (3) filtering: the threshold for base mass was calculated as follows:
Threshold(P)=argmax p∈P (p -l{q∈P|q≤p}| );
wherein P is a set of sequencing error probabilities corresponding to all base masses; the mathematical symbol e represents set attribution, so that the meaning of P e P means that P is any one element in P, and the meaning of q e P means that q is any one element in P;
bases on reads were filtered according to base quality thresholds: filtering out bases whose base quality is lower than a base quality threshold after correction;
after filtering, for each family, taking the base with the most frequent family as the base after de-weighting; the repeated reads are integrated into a consistent read.
5. A method according to any one of claims 1-3, characterized in that: in the step (B), four preferences of depletion bias, position bias, strand bias and mismatch bias are calculated in each chromosome chain; then selecting the highest preference as the integral preference of the chromosome chain; the number of reads supporting mutant is corrected, the corrected number of reads being equal to the number of reads before correction divided by the overall preference.
6. The method of claim 5, wherein: the four preferences of depletion bias, position bias, strand bias and mismatch bias are calculated as follows:
Bias(a 1 ,a 2 ,b 1 ,b 2 )=max(0,OR(LS(a 1 ,a 2 ,b 1 ,b 2 ))-1)*(b 2 ÷(b 2 +b 1 ))+1;
wherein OR represents an odds ratio; the OR is calculated as follows:
OR(a 1 ,a 2 ,b 1 ,b 2 )=(a 1 ×b 2 )÷(a 2 ×b 1 );
wherein LS represents Laplace smoothing; the LS formula is as follows: LS (a, b, c, d) = Laplace 2 (Laplace 1 (a, b, c, d)); wherein, laplace 1 The formula is as follows: laplace 1 (a, b, c, d) = (a + p, b + p, c + p, d + p); laplace2 has the following formula: laplace 2 (a,b,c,d)=(a+p×α,b+p×α,c+p×β,d+p×β);
In Laplace 1 Formula and Laplace 2 In the formula, the default value of p is 0.5;
in Laplace 2 In the formulas, α and β are as follows: α = max (0,p × ((a) 1 +a 2 )/(b 1 +b 2 )-1));β=max(0,p×(b 1 +b 2 )/(a 1 +a 2 )-1);
For the reduction bias, the meaning of the parameters is as follows:
a 1 : the number of reads supporting any allele that were deduplicated;
a 2 : after deduplication, the number of reads supporting any allele;
b 1 : the number of reads that support the mutant allele that are deduplicated;
b 2 : number of reads supporting mutant alleles after deduplication;
for position bias, the meaning of each parameter is as follows:
for each direction, extending 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66 bases in direction; the following four values are then substituted:
a 1 : the number of reads that support any allele within reads that have spanned the extended locus;
a 2 : the number of reads that support any allele in reads that do not span the extended locus;
b 1 : in reads that have spanned the extended locus, the number of reads supporting the mutant allele;
b 2 : number of reads that support mutant alleles in reads that do not span the extended locus;
sequentially substituting 11 extension units to generate a sequence consisting of 11 numerical values; taking every 3 adjacent values in the sequence as an array, thereby generating 9 arrays; taking the minimum value for each array, thereby producing 9 minimum values; taking the maximum value of the 9 minimum values as position bias;
for strand bias, the following values are substituted:
a 1 : in the forward direction of the chromosome strand S, the number of reads supporting any allele;
a 2 : in the reverse of chromosome chain S, the number of reads supporting any allele;
b 1 : in the forward direction of chromosome strand S, the number of reads supporting the mutant allele;
b 2 : in the reverse direction of the chromosome strand S, the number of reads supporting the mutant allele;
simultaneously calculating base quality inventory and strand position inventory, wherein the base quality inventory is BQI for short, the formula is represented by I, the formula comprises 1I, the strand position inventory is SPI for short, the formula is represented by I and comprises 2I, the numerical values of the 3I are calculated by the following method, and then the minimum numerical value is taken; then, dividing the strand bias by the minimum value to normalize the strand bias; finally ensuring that the minimum Strand bias is at least 1;
for BQI, c is the average base of all alleles of the other strandBase mass, d is the average base mass of the mutant allele of this strand, and I =10 d-c
For SPI, c is the average base distance to the end of read for all alleles of the other strand, d is the average base distance to the end of read for mutant alleles suspected of having a preferred strand, and
Figure FDA0004114499840000081
each read has two ends: left and right sides; the left and right termini thus yield two I, respectively;
let x =1,2,3, …,11 for mismatch bias; x is the mismatch number threshold, 1,2, …,11 are different attempted thresholds;
for each value of x, the following four values are substituted:
a 1 : mismatch numbers < x, number of reads supporting any allele;
a 2 : the mismatch number is more than or equal to x, and the number of reads of any allele is supported;
b 1 : mismatch numbers < x, number of reads supporting mutant alleles;
b 2 : the mismatch number is more than or equal to x, and supports the reads number of mutant alleles;
sequentially substituting 11 x values to generate a sequence consisting of 11 values; taking every 5 adjacent values in the sequence as an array, thereby generating 7 arrays; taking the minimum value for each array, thereby generating 7 minimum values; the maximum of the 7 minima was taken as the mismatch bias.
7. A gene variation detection system based on high-throughput sequencing comprises a device A, a device B and a device C;
the device A is capable of pre-processing high throughput sequencing data, mainly including base quality correction, read de-duplication, according to step (A) of any one of claims 1 to 6;
said device B is capable of calculating the preference generated by sequencing according to the step (B) of any one of claims 1 to 6, and obtaining the corrected read number of the support mutation type by reducing the effective read number of the support mutation type according to the preference;
the device C can calculate the correlation quality of all the individual samples by using the corrected read numbers of the supporting mutants according to the step (C) of any one of claims 1 to 6, and then obtain the variation quality of the individual samples, and perform the judgment by the variation quality.
8. Use of the method of any one of claims 1-6 or the system of claim 7 for the manufacture of a product for tumor pre-screening, tumor prognosis, tumor classification and/or tumor medication guidance.
CN202010222444.5A 2020-03-26 2020-03-26 Gene variation detection method based on high-throughput sequencing Active CN111243664B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010222444.5A CN111243664B (en) 2020-03-26 2020-03-26 Gene variation detection method based on high-throughput sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010222444.5A CN111243664B (en) 2020-03-26 2020-03-26 Gene variation detection method based on high-throughput sequencing

Publications (2)

Publication Number Publication Date
CN111243664A CN111243664A (en) 2020-06-05
CN111243664B true CN111243664B (en) 2023-04-18

Family

ID=70869208

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010222444.5A Active CN111243664B (en) 2020-03-26 2020-03-26 Gene variation detection method based on high-throughput sequencing

Country Status (1)

Country Link
CN (1) CN111243664B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785325B (en) * 2020-06-23 2021-10-22 西北工业大学 Mutually-exclusively constrained graph Laplacian approach to heterogeneous cancer driver gene identification
CN115458051B (en) * 2022-09-28 2023-03-21 北京泛生子基因科技有限公司 Method, device and computer readable storage medium for simulating small variation in sequencing data and capable of retaining molecular tag information
CN116884492A (en) * 2023-02-07 2023-10-13 杭州联川基因诊断技术有限公司 Method, equipment and medium for mTag class selection of targeted sequencing data
CN116895332B (en) * 2023-09-11 2023-12-05 臻和(北京)生物科技有限公司 Filtering method for interrupting false positive mutation generated by artificial fragments in library construction by enzyme digestion method

Also Published As

Publication number Publication date
CN111243664A (en) 2020-06-05

Similar Documents

Publication Publication Date Title
CN111243664B (en) Gene variation detection method based on high-throughput sequencing
JP7302081B2 (en) Variant Classifier Based on Deep Neural Networks
KR102273717B1 (en) Deep learning-based variant classifier
JP7368483B2 (en) An integrated machine learning framework for estimating homologous recombination defects
Vollger et al. Increased mutation and gene conversion within human segmental duplications
TWI868095B (en) Cell-free dna end characteristics
EP4073805B1 (en) Systems and methods for predicting homologous recombination deficiency status of a specimen
CN110870016B (en) Validation methods and systems for sequence variant calling
US10496679B2 (en) Computer algorithm for automatic allele determination from fluorometer genotyping device
WO2019200338A1 (en) Variant classifier based on deep neural networks
CN110168648A (en) The verification method and system of sequence variations identification
CN110383385A (en) Method for detecting mutational burden from tumor samples
CN114566214B (en) Method and detection device, computer-readable storage medium and application for detecting genomic deletion and insertion variation
US20230064530A1 (en) Detection of Genetic Variants in Human Leukocyte Antigen Genes
RU2822040C1 (en) Method of detecting copy number variations (cnv) based on sequencing data of complete human exome and low-coverage genome
US20250226056A1 (en) Variant classifier based on deep neural networks
KR20250096737A (en) Component Mixture Model for Tissue Identification from DNA Samples
Sarantidis Algorithms to Explore the Chromosomal Clustering of Genes
Woods et al. Extended regions of suspected mis-assembly in the rat reference genome

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20250606

Address after: 102206 1st floor, Building 11, Zone 1, Yard 8, Shengshenyuan Road, Zhongguancun Life Science Park, Huilongguan Town, Changping District, Beijing (Changping Demonstration Park)

Patentee after: GENETRON HEALTH(BEIJING) LABORATORY Co.,Ltd.

Country or region after: China

Address before: 102206 building 11, area 1, No.8, shengshengyuan Road, Zhongguancun Life Science Park, Changping District, Beijing

Patentee before: GENETRON HEALTH (BEIJING) Co.,Ltd.

Country or region before: China