WO2022266790A1 - 降低高通量测序中人工引入错误突变的方法及应用 - Google Patents

降低高通量测序中人工引入错误突变的方法及应用 Download PDF

Info

Publication number
WO2022266790A1
WO2022266790A1 PCT/CN2021/101184 CN2021101184W WO2022266790A1 WO 2022266790 A1 WO2022266790 A1 WO 2022266790A1 CN 2021101184 W CN2021101184 W CN 2021101184W WO 2022266790 A1 WO2022266790 A1 WO 2022266790A1
Authority
WO
WIPO (PCT)
Prior art keywords
mutation
site
score
mutations
throughput sequencing
Prior art date
Application number
PCT/CN2021/101184
Other languages
English (en)
French (fr)
Inventor
蒙裕欢
李桂彬
黄晓强
范喜杰
穆亚飞
缪夏萍
袁杰铖
程雅婷
于世辉
梁耀铭
Original Assignee
广州市金域转化医学研究院有限公司
广州金域医学检验集团股份有限公司
广州金域医学检验中心有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市金域转化医学研究院有限公司, 广州金域医学检验集团股份有限公司, 广州金域医学检验中心有限公司 filed Critical 广州市金域转化医学研究院有限公司
Priority to PCT/CN2021/101184 priority Critical patent/WO2022266790A1/zh
Publication of WO2022266790A1 publication Critical patent/WO2022266790A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis

Definitions

  • the invention relates to the technical field of bioinformatics, in particular to a method and application for reducing artificially introduced error mutations in high-throughput sequencing.
  • NGS technology provides convenience for genetic testing, especially whole exome sequencing, which has an advantage over whole gene sequencing in terms of price. Due to its high cost performance, whole exome sequencing is becoming more and more widely used in medicine.
  • the main function of all-external sequencing is to find pathogenic mutations and provide important data for the diagnosis and treatment of diseases.
  • the mutations screened in all external channels need to undergo repeated filtering for quality control.
  • the quality control of sequencing is already very strict, due to factors such as non-specific amplification in the PCR amplification process, some artificially introduced errors (artefacts) pass the hard quality control. Control still cannot be ruled out, which has a huge impact on the diagnosis of the disease.
  • Site scoring Obtain the high-throughput sequencing data of the samples to be tested, obtain the information of each mutation site, and score each mutation site according to the following formula:
  • Svar is the mutation site score
  • Fq is the population mutation frequency
  • Pop is the total population
  • M is the preset maximum value
  • Score1 is the mutation type score
  • Score2 is the mutation coordinate score
  • Score3 is the pLoF score
  • Score1 The scoring standard of Score1 is: when the mutation is a known mutation, no points will be deducted; when the mutation is an unknown mutation, points will be deducted;
  • the scoring standard of Score2 is: when the mutation is an unknown mutation, according to the position of the refGene where the mutation is located, points will be deducted according to predetermined rules;
  • Score3 The scoring standard of Score3 is: according to the credibility of LOFTEE predicting mutation as loss of function, points are deducted according to predetermined rules;
  • Stotal is the total score of the region fragment, N is the total mutation number of the region fragment;
  • the inventors found that the artificially introduced erroneous mutations have the following characteristics: a large number of indel sites gather at both ends of the captured sequencing region, resulting in a In the region, different people have different indels, resulting in a large number of INDEL aggregations in the population. Based on this previous research, the inventor proposed a method of weighting and scoring the above-mentioned features through a window of a specific length and obtaining the score of the window area segment by weighting, and then screening the confidence interval to eliminate artificial error positions in batches.
  • each mutation site is sorted according to the sequence of chromosome and genome position.
  • the scoring standard of Score1 is: if the mutation site is annotated in dbSNP, ClinVar or gnomAD-exome database, then it is determined that the mutation at the site is known If there is a mutation, no points will be deducted; otherwise, if the mutation is determined to be an unknown mutation, points will be deducted;
  • the scoring standard of Score2 is: when the mutation is an unknown mutation, according to the exonic, splicing, UTR-3, UTR-5, upstream, downstream, intronic, intergenic or ncRNA region of refGene where the mutation is located, according to the predetermined rules deduct points;
  • the scoring standard of Score3 is: According to LOFTEE, the mutation is predicted to be a loss of function (pLoF) high-confidence locus or low-confidence locus, and points are deducted according to predetermined rules.
  • LOFTEE loss of function
  • the full name of the above dbSNP is the single nucleotide polymorphism database, that is, the single nucleotide polymorphism database, which means "a single base pair (base pair) variation in the dna sequence", that is, a, t in the dna sequence , c, g changes, that is, the possibility of two or more nucleotides appearing at a specific and positioned site in the genome, which is the most common type of human heritable variation, therefore, according to each mutation
  • the region where the site is located and the mutation type are marked with known mutations that have been reported as known in the art, which is helpful to distinguish whether the mutation is an artificially introduced error.
  • the above-mentioned dbSNP database is preferably dbSNP150.
  • the above-mentioned LOFTEE predicts that the mutation is a loss-of-function pLoF mutation, which is a putative loss-of-function (pLoF) mutation in the haploid disease gene in the Genome Aggregation Database (gnomAD) manually, and distinguishing the mutation type is also conducive to subsequent analysis and reference.
  • the score when the mutation is located in the exonic region, the score is 0; when the mutation is located in UTR-3, UTR-5, upstream, downstream, intronic, intergenic or ncRNA region, the score is -1; when the mutation is located in splicing area, the score is -2;
  • the main characteristics of artificial errors (false positive insertion and deletion sites are gathered at both ends of the captured sequencing region), so within a specific window range, the largest weight should be insertion and deletion (INDEL), so in The weight of unknown INDEL detected by the window is the largest, and other influencing factors include: SNP, whether it is a known mutation, the type of mutation, whether it is a pLOF mutation annotated by LOEFF, etc.
  • SNP SNP
  • the type of mutation whether it is a pLOF mutation annotated by LOEFF, etc.
  • the above-mentioned known mutations are mutations that have been reported.
  • the frequency is generally used to measure the number of occurrences of this site.
  • M maximum value
  • the site scoring step when the Svar score of the mutation site is greater than 0, the site score is defined as 0. Considering that this method mainly adopts the deduction system for the mutation site, when the Svar score of the mutation site is greater than 0, no points will be deducted, and it is defined as 0, which can more accurately evaluate whether there are artificially introduced error mutations in the region segment .
  • the position of the refGene is determined according to the alignment to the main transcript region of the NCBI refGene, and the pLoF only retains the pLoF mutation sites of Stop-gained, Splice site and Frameshift types that appear in the main transcript.
  • the main transcript is the most recently updated transcript in NCBI refGene, that is, the transcript with the shortest time from the current time.
  • the window size is obtained by the following method: the window size is the interval length covering 95 ⁇ 5% of adjacent mutation sites in the database. Adjacent among the above-mentioned adjacent mutation sites refers to the length of the interval between two pairs, and the sequence length that can cover most of the intervals between two mutation sites is used as the window size, so that artificially introduced error mutations can be correctly identified.
  • the database is dbSNP database, ECAC03 whole exon database and/or genomeAD whole exon database.
  • the window size is 25 ⁇ 5bp.
  • the threshold in the site exclusion step, is the Stotal score of the sequence fragments corresponding to the top 5% of the fragments in each region in the sample to be tested according to the Stotal score sorted from low to high. It can be understood that the above threshold may also be set to a certain score threshold according to practical needs.
  • the present invention also discloses the application of the above-mentioned method for reducing artificially introduced error mutations in high-throughput sequencing to high-throughput sequencing.
  • the high-throughput sequencing is whole exome sequencing.
  • the invention also discloses the application of the above method in the screening of whole exon mutation in disease diagnosis and treatment.
  • the invention also discloses a device for reducing artificially introduced error mutations in high-throughput sequencing, which is characterized in that it includes:
  • Analysis module used to analyze the mutation data of the high-throughput sequencing data of the samples to be tested according to the above-mentioned method for reducing artificially introduced error mutations in high-throughput sequencing.
  • the present invention has the following beneficial effects:
  • a method for reducing artificially introduced error mutations in high-throughput sequencing of the present invention not only identifies and eliminates artificial errors of low-frequency and harmful mutations, but also covers general artificially introduced error mutation sites, which is comprehensive and accurate The advantages.
  • the method of the present invention can overcome the defects of low efficiency and strong experience dependence existing in manual screening through automatic computer analysis and processing, and is suitable for wide popularization and application.
  • Figure 1 is a schematic flow chart of the method for reducing artificially introduced error mutations in high-throughput sequencing in the embodiment
  • Fig. 2 is a schematic diagram of Variants intervals corresponding to different window lengths in the embodiment
  • Fig. 3 is the IGV visualization result of chr9: 35906583 site in the embodiment
  • Fig. 4 is the IGV visualization result of the chr9:32986030 site in the embodiment.
  • each mutation site includes: the chromosome where the site is located, the specific position, the reference genome base, the mutation base types, etc., and sorted according to chromosome and genome positions.
  • the population mutation frequency is an important indicator of the calculated evaluation score of the mutation site of the present invention, which needs to be obtained by calculation. Specifically: input the above-mentioned mutation site information data into the standard format of annovar, that is, add the frequency data in the vcf format file, part of which is shown in the following table.
  • chromosome Location reference base mutant base allele frequency information homozygous/heterozygous chr1 14653 C T 2.33263e-05 GT 1/1 chr1 14907 A G 0.000128295 GT 1/1 chr1 14930 A G 0.000139958 GT 1/1 chr1 14932 G T 2.33263e-05 GT 1/1 chr1 14933 G A 1.16632e-05 GT 1/1 chr1 15903 G GC 6.9979e-05 GT 1/1 chr1 69331 C T 1.16632e-05 GT 1/1 chr1 69336 C T 1.16632e-05 GT 1/1 chr1 69460 C T 1.16632e-05 GT 1/1 chr1 69462 C G 2.33263e-05 GT 1/1 chr1 69474 T G 1.16632e-05 GT 1/1 chr1 69486 C A 2.33263e-05 GT 1/1 chr1 69486 C A 2.3
  • the chromosome and position represent the location of the mutation in the genome
  • the reference base is the base type of the reference genome hg19 at this position
  • the mutation base is the base type detected by this sequencing that is different from the reference genome
  • allele The gene frequency is the frequency of the mutation in the population, that is, the number of occurrences of the mutation divided by (the total number of the population multiplied by 2);
  • the information is GT stands for the abbreviation of Genetype; 1/1 means homozygous, 0/1 means heterozygous.
  • chromosome starting point end position reference base mutant base homozygous/heterozygous type frequency chr1 14653 14653 C T het 2.33263e-05 chr1 14907 14907 A G het 0.000128295 chr1 14930 14930 A G het 0.000139958 chr1 14932 14932 G T het 2.33263e-05 chr1 14933 14933 G A het 1.16632e-05 chr1 15903 15903 - C het 6.9979e-05 chr1 69331 69331 C T het 1.16632e-05 chr1 69336 69336 C T het 1.16632e-05 chr1 69460 69460 C T het 1.16632e-05 chr1 69462 69462 C G het 2.33263e-05 chr1 69474 69474 T G het 1.166
  • Svar is the score of the mutation site
  • Fq is the population frequency
  • Pop is the total number of the population
  • M is the preset maximum value
  • Score1 is the score of the mutation type
  • Score2 is the score of the mutation coordinate
  • Score3 is the score of pLoF.
  • the mutation site is annotated in dbSNP150, ClinVar or gnomAD-exome database, it is judged that the mutation at this site is a known mutation, and the score is 0; when the mutation is an unknown SNV (single nucleotide variants), the score is -1 ; when the mutation is an unknown INDEL (insertion-deletion), the score is -5.
  • Score1 when the mutation is a known mutation, Score2 will not be scored, and when the mutation is an unknown mutation, it will be scored according to the region where the mutation is located in different regions of refGene. For example, when the mutation is located in the exonic region, the score is 0; if the mutation is located in the UTR-3, UTR-5, upstream, downstream, intronic, intergenic, or ncRNA region, the score is -1; if the mutation is located in the splicing region, the score is - 2.
  • the annovar annotations in the above Score1 and Score2 scores can be performed in Ubuntu 18.04.2LTS using the following specific command line: able_annovar.pl all.vcf.avinput dir/humandb/-buildver hg19-out x --otherinfo -remove-protocol refGene,avsnp150,clinvar_20200316,gnomad_exome-operation g,f,f,f-nastring NA.
  • the condition is: only keep the pLoF mutation sites of the Stop-gained, Splice site and Frameshift types that appear in the main transcript, and others do not appear in the main transcript but appear in the non-main transcript The pLoF site of the transcript is ignored.
  • Stop-gained mutation refers to the mutation that changes the original amino acid into a stop codon, so that the protein transcription is terminated early;
  • the Splice site refers to the mutation that will affect the normal transcription of the transcript, resulting in the transcription of the intron region, resulting in the entire protein.
  • Frameshift refers to frameshift mutations, ie, sequences and linker sites at the junctional boundaries of introns and exons in RNA precursors that are recognized by the spliceosome. If there is a mutation at this position, there will be a problem with the splicing of the mRNA, and there will be a problem with the translation of the protein. Splicing is the process of cutting out the intron and retaining the exon during the transcription process. If it is a classic splice site GT-AG mutation, it is more serious and is a pathogenic variant.
  • Score3 is scored according to the final pLoF site. When the mutation is a high-confidence pLoF, the score is +3; when the mutation is a low-confidence pLoF, the score is +2.
  • Svar is the score of the mutation site
  • Fq is the group frequency
  • Pop is the total number of groups, which is 42868 in this embodiment.
  • the inventors analyzed the dbSNP150, EXAC03 all-exon database and genomeAD all-exon database, counted the distance between the mutation sites recorded in the above databases, and obtained the distance between two adjacent mutation sites. For the size of the distance, it was found that with a window size of 25 bp, more than 95% of the mutation intervals could be covered, as shown in FIG. 2 . Therefore, the cutting window size was determined to be 25bp.
  • Cutting was performed with a 25 bp window delayed from each mutation to obtain several fragments of the region including the mutation site.
  • Stotal is the total score of the region segment
  • N is the total mutation number of the region segment.
  • the mutation site in the region segment is an artificially introduced error site, and the mutation is excluded.
  • -418 is used as the screening threshold, which is less than -418 (that is, all fragments are screened according to the 5% score in the order of scores from small to large), and the positive results obtained (determined as artificially introduced error mutations), some of which are scored
  • the artificially introduced error mutations with large and small negative values are shown in the table below.
  • chromosome Location window The total number of mutations in the window total score chr9 35906583 chr9:35906583-35906607 428 -7026708.30830639 chr9 35906584 chr9:35906584-35906608 415 -6726549.28627591 chr9 35906585 chr9:35906585-35906609 414 -6704131.02405082 chr9 35906586 chr9:35906586-35906610 413 -6665636.51330758 chr9 35906580 chr9:35906580-35906604 401 -6078580.34869512 chr9 35906582 chr9:35906582-35906606 398 -5996092.36077493 chr9 35906581 chr9:35906581-35906605 398 -5995694.37824108 chr9 3590
  • Each site corresponds to at least the window starting from its own position, and may correspond to other windows. If the site falls into any window that is judged to be an artificially introduced error, and the other window is not considered to be an artificially introduced error, then The conclusion that the wrong site was identified as artificially introduced remains unchanged.
  • the chr9:35906583-35906607 window corresponding to the chr9:35906583 site, and its partial mutation status are shown in the table below.
  • the chr9:32986030 site corresponds to the chr9:32986030-32986054 window, and its mutation status is shown in the table below and Figure 4.
  • the verification result proves that the method of the present invention can remove most of the false mutations of artificial primers, leaving a large number of real mutations.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种降低高通量测序中人工引入错误突变的方法及应用,属于生物信息学技术领域。该方法首先建立群体突变频率数据库,整合得到每个突变位点的群体突变频率,随后对各个窗口内的突变位点,按照预设的权重给定分数进行评分,得到每个突变位点的得分Svar后,考虑每个突变位点往后特定窗口大小内出现突变的次数,将这个区域内所有突变的分数相加,然后乘以这个窗口的突变数目,以达到放大的目的,从而突出存在人工引入错误突变的位点,将其识别后排除。从而可实现检出所有类型的人工引入错误突变,且具有效率高,自动化程度高,准确性高和检出全面的优势。

Description

降低高通量测序中人工引入错误突变的方法及应用 技术领域
本发明涉及生物信息学技术领域,特别是涉及一种降低高通量测序中人工引入错误突变的方法及应用。
背景技术
NGS技术为基因检测提供了便利,尤其是全外显子测序,其在价格上比全基因测序更有优势,由于性价比高,全外测序在医学上得到越来越广泛的应用。
全外测序的主要作用是寻找致病的突变,为疾病的诊疗提供重要的数据。全外筛选的突变需经过重重过滤进行质控,虽然现在对于测序的质量控制已经很严格,但是由于PCR扩增过程中的非特异性扩增等因素,部分人工引入的错误(artefacts)通过硬质量控制仍然不能排除,这对疾病的诊断产生了巨大的影响。
现阶段,虽然已有一些方法去除人工错误,但是基本上都需要以整合基因组浏览(IGV)方式,由具有丰富经验的实验人员进行人工进一步的判断,效率很低,且准确性依赖于人员的主观经验,难以标准化处理。
目前有研究推出一种利用LOFTEE排除人工引入错误突变的方法,其主要通过一组过滤器,将提前终止(stop-gained)、移码突变(frameshift)和可变剪切突变(splicing)进行评估,得到低可信和高可信的预测功能缺失突变,但是其主要在于评估和排除上述几种类型的人工错误,对于其他可能存在错误的位点并没有覆盖到,因此依然可能漏掉许多的人工错误位点。
发明内容
基于此,有必要针对上述问题,提供一种降低高通量测序中人工引入错误突变的方法,采用该方法,可通过机器自动化识别,批量排除人工错误位点,提高测序准确性。
1)建立群体突变频率数据库:获取若干样本高通量测序数据,整合得到每个突变位点的群体突变频率Fq,以及群体总数Pop;
2)位点评分:获取待测样本高通量测序数据,获得每个突变位点的信息,将各突变位点按照下述公式进行评分:
Svar=(Fq×Pop×2) max=M×(Score1+Score2+Score3)
其中:Svar为突变位点得分,Fq为群体突变频率,Pop为群体总数,M为预设最大值,Score1为突变类型分数,Score2为突变坐标分数,Score3为pLoF分数;
所述Score1的评分标准为:当突变为已知突变,不扣分;当突变为未知突变,扣分;
所述Score2的评分标准为:当突变为未知突变,则根据该突变所在refGene位置,按照预定规则扣分;
所述Score3的评分标准为:根据LOFTEE预测突变为功能缺失的可信度情况,按照预定规则扣分;
3)区域片段评分:从每个突变延后预定窗口大小划分,得到若干包括突变位点的序列片段;
按照下述公式对每个区域片段进行评分:
Figure PCTCN2021101184-appb-000001
其中:Stotal为区域片段的总评分,N为该区域片段的总突变数;
4)位点排除:如区域片段的Stotal总评分低于预定阈值,则判定该区域片段中的突变位点为人工引入错误位点,排除该突变。
本发明人在长期实践工作的基础上,经过仔细分析、推论和求证后发现,对于人工引入的错误突变存在以下特征:插入缺失位点在捕获的测序区域两端大量聚集,导致了在某个区域内,不同的人出现不同插入缺失,导致种群中出现大量的INDEL聚集现象。以此前期研究基础出发,本发明人提出通过特定长度的窗口对上述特征进行权重评分并加权得到窗口区域片段的分数,再通过置信区间的筛选,批量排除人工错误位点的方法。
上述方法中,首先建立群体突变频率频率数据库,整合得到每个突变位点的突变频率,以及群体总数Pop,采用发明人设计的公式对突变位点进行评分,再按照预定窗口大小对序列进行划分,得到若干区域片段,对各区域片段中的突变位点得分进行整合,考虑每个突变位点往后特定窗口大小内出现突变的次数,将这个区域内所有突变的分数相加,然后乘以这个窗口的突变数目,以达到放大的目的,从而突出存在人工引入错误突变的位点,将其识别后排除。从而可实现检出所有类型的人工引入错误突变的目的,且具有效率高,自动化程度高,准确性高和检出全面的优势。
在其中一个实施例中,所述建立群体突变频率数据库步骤中,将每个突变位点按照染色体和基因组位置顺序进行排序。
在其中一个实施例中,所述位点评分步骤中,所述Score1的评分标准为:如该突变位点在dbSNP、ClinVar或gnomAD-exome数据库有注释,则判定该位点的突变为已知突变,不扣分;否则判定突变为未知突变,扣分;
所述Score2的评分标准为:当突变为未知突变,则根据该突变所在区域为refGene的exonic、splicing、UTR-3、UTR-5、upstream、downstream、intronic、intergenic或ncRNA区位置,按照预定规则扣分;
所述Score3的评分标准为:根据LOFTEE预测为突变为功能缺失(pLoF)高可信度位点或低可信位点,按照预定规则扣分。
可以理解的,上述dbSNP全称为the single nucleotide polymorphism database,即单核苷酸多态性数据库,意思是“dna序列中的单一碱基对(base pair)变异”,也就是dna序列中a、t、c、g的改变,即基因组的一个特异和定位的位点出现两个或多个的核苷酸可能性,它是人类可遗传的变异中最常见的一种,因此,按照每个突变位点所在的区域及突变类型,将已报道为本领域所知悉的已知突变进行标注,有利于区分该突变是否为人工引入错误。上述dbSNP数据库优选dbSNP150。上述LOFTEE预测突变为功能缺失的pLoF突变为通过人工处理基因组聚合数据库(gnomAD)中单倍体疾病基因中的假定功能丧失(pLoF)突变,区分该突变类型,也有利于后续进行分析参考。
在其中一个实施例中,所述位点评分步骤中:
所述Score1的评分标准中,当突变为已知突变,则评分为0;当突变为未知SNV,则评分为-1;当突变为未知INDEL,则评分为-5;
所述Score2的评分标准中,当突变位于exonic区,则评分为0;当突变位于UTR-3、UTR-5、upstream、downstream、intronic、intergenic或ncRNA区,则评分为-1;当突变位于splicing区,则评分为-2;
所述Score3的评分标准中,当突变为高可信的pLoF,则评分为+3;当突变为低可信的pLoF,则评分为+2。
上述评分权重中,针对人工错误的主要特征(假阳性的插入缺失位点在捕获的测序区域两端大量聚集),因此在特定窗口范围内,最大的权重应该是插入缺失(INDEL),因此在窗口检测出未知INDEL的权重最大,其他影响因素还包括:SNP,是否为已知突变,突变的类型,是否为LOEFF注释的pLOF突变等。经过测试,我们将各因素的权重按照上述方式定义,能够较好识别出人工引入错误突变。
在其中一个实施例中,所述位点评分步骤中:M=100。
可以理解的,上述已知突变即为已有报道存在的突变。而对于群体频率,由于在实践中,很难得到某个位点突变的具体人数,因此一般用频率衡量这个位点的出现次数。当频率×人群×2大于100的时候,代表这个位点已经很常见,因此我们设定了一个最大的值,即M=100。
在其中一个实施例中,所述位点评分步骤中:当突变位点的Svar分数大于0,则将此位点得分定义为0。考虑到该方法中对于突变位点主要是采用扣分制,因此当突变位点的Svar分数大于0,则不扣分,定义为0,可更为准确评估区域片段中是否存在人工引入错误突变。
在其中一个实施例中,所述refGene位置根据比对到NCBI refGene主转录本区域确定,所述pLoF仅保留出现在主转录本中Stop-gained、Splice site和Frameshift类型的pLoF突变位点。
在其中一个实施例中,所述主转录本为NCBI refGene中最近更新的转录本,即距离当前时间最短的转录本。
在其中一个实施例中,所述区域片段评分步骤中,所述窗口大小通过以下方法得到:以覆盖数据库中95±5%相邻突变位点的区间长度为窗口大小。上述相邻突变位点中的相邻指两两间的区间长度,以能够覆盖绝大部分两两间突变位点区间的序列长度为窗口大小,可将人 工引入错误突变正确识别。
在其中一个实施例中,所述数据库为dbSNP数据库、ECAC03全外显子数据库和/或genomeAD全外显子数据库。
在其中一个实施例中,所述窗口大小为25±5bp。
本发明人在前期研究中,对dbSNP150,EXAC03全外显子数据库和genomeAD的全外显子数据库进行分析比较后发现,以25bp的窗口大小,可以覆盖超过95%的突变间隔。
在其中一个实施例中,所述位点排除步骤中,所述阈值为该待测样本中各区域片段按照Stotal得分由低至高排序的前5%所对应序列片段的Stotal得分。可以理解的,上述阈值也可以根据实践需要,设定某个确定的分数阈值。
本发明还公开了上述的降低高通量测序中人工引入错误突变的方法在高通量测序中的应用。
在其中一个实施例中,所述高通量测序为全外显子测序。
将上述的降低高通量测序中人工引入错误突变的方法应用于高通量测序中,特别是全外显子的检测中,能够降低人工引入突变错误,为疾病的诊疗提供准确数据,进一步提高了个性化精准医疗的可靠性。
本发明还公开了上述的方法在疾病诊疗全外显子突变筛查中的应用。
本发明还公开了一种降低高通量测序中人工引入错误突变的装置,其特征在于,包括:
分析模块:用于按照上述的降低高通量测序中人工引入错误突变的方法,对待测样本高通量测序数据的突变数据进行分析。
可以理解的,上述装置可按照上述方法进行分析,具体呈现产品形式包括一体化设备,软件模块封装等。
与现有技术相比,本发明具有以下有益效果:
本发明的一种降低高通量测序中人工引入错误突变的方法,不仅仅针对低频及有害突变的人工错误进行识别排除,对于一般的人工引入错误突变位点也能覆盖到,具有全面、准确的优势。
并且,本发明的方法,可通过计算机自动化分析处理,克服了人工筛查存在的效率低、经验依赖性强的缺陷,适宜广泛的推广应用。
附图说明
图1为实施例中降低高通量测序中人工引入错误突变的方法流程示意图;
图2为实施例中不同窗口长度所对应Variants间隔示意图;
图3为实施例中chr9:35906583位点的IGV可视化结果;
图4为实施例中chr9:32986030位点的IGV可视化结果。
具体实施方式
为了便于理解本发明,下面将参照相关附图对本发明进行更全面的描述。附图中给出了 本发明的较佳实施例。但是,本发明可以以许多不同的形式来实现,并不限于本文所描述的实施例。相反地,提供这些实施例的目的是使对本发明的公开内容的理解更加透彻全面。
除非另有定义,本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本发明。本文所使用的术语“和/或”包括一个或多个相关的所列项目的任意的和所有的组合。
实施例
一种降低高通量测序中人工引入错误突变的方法,流程如图1所示,包括以下步骤:
1、建立群体突变频率数据库(S1)
获取若干样本高通量测序数据,整合得到每个突变位点的突变频率,具体如下:
1.1数据准备。
获取若干全外测序样本数据(本实施例中选择42868例样本),得到其中所有位点的vcf合并文件,每个突变位点信息包括:位点所在染色体、具体位置、参考基因组碱基,突变的碱基类型等,并按照染色体和基因组位置进行排序。
1.2群体突变频率计算。
群体突变频率是计算的本发明突变位点评估分数的重要指标,需要通过计算得到。具体为:将上述突变位点信息数据输入为annovar的标准格式,即在vcf格式文件中加入频率数据,部分展示如下表。
表1.群体突变及频率输入文件格式列表
染色体 位置 参考碱基 突变碱基 等位基因频率 信息 纯合/杂合
chr1 14653 C T 2.33263e-05 GT 1/1
chr1 14907 A G 0.000128295 GT 1/1
chr1 14930 A G 0.000139958 GT 1/1
chr1 14932 G T 2.33263e-05 GT 1/1
chr1 14933 G A 1.16632e-05 GT 1/1
chr1 15903 G GC 6.9979e-05 GT 1/1
chr1 69331 C T 1.16632e-05 GT 1/1
chr1 69336 C T 1.16632e-05 GT 1/1
chr1 69460 C T 1.16632e-05 GT 1/1
chr1 69462 C G 2.33263e-05 GT 1/1
chr1 69474 T G 1.16632e-05 GT 1/1
chr1 69486 C A 2.33263e-05 GT 1/1
chr1 69486 C T 2.33263e-05 GT 1/1
chr1 69491 G A 1.16632e-05 GT 1/1
chr1 69495 C T 1.16632e-05 GT 1/1
chr1 69496 G A 1.16632e-05 GT 1/1
chr1 69496 G T 8.16422e-05 GT 1/1
chr1 69510 C T 1.16632e-05 GT 1/1
chr1 69511 A G 0.942186 GT 1/1
chr1 69513 A G 0.000606485 GT 1/1
chr1 69521 T C 8.16422e-05 GT 1/1
chr1 69522 T C 1.16632e-05 GT 1/1
chr1 69534 T C 0.00271752 GT 1/1
chr1 69543 C A 1.16632e-05 GT 1/1
chr1 69550 G A 9.33053e-05 GT 1/1
chr1 69550 G T 1.16632e-05 GT 1/1
chr1 69552 G A 1.16632e-05 GT 1/1
chr1 69555 T G 1.16632e-05 GT 1/1
chr1 69559 G A 0.000314906 GT 1/1
chr1 69563 A C 6.9979e-05 GT 1/1
其中:染色体和位置代表了该突变在基因组的位置定位,参考碱基为该位置上参考基因组hg19的碱基类型,突变碱基是本次测序检测出与参考基因组不同的碱基类型,等位基因频率即人群中该突变的频率,即该突变出现的次数除以(人群总人数乘以2);信息即GT代表Genetype的缩写;1/1代表纯合、0/1代表杂合。
随后使用annovar软件自带的convert2annovar.pl-format vcf4功能模块,将vcf格式转换成annovar的标准格式,部分示例如下表。
表2.突变及频率数据注释前输入文件格式
染色体 起始位置 终止位置 参考碱基 突变碱基 纯合/杂合类型 频率
chr1 14653 14653 C T het 2.33263e-05
chr1 14907 14907 A G het 0.000128295
chr1 14930 14930 A G het 0.000139958
chr1 14932 14932 G T het 2.33263e-05
chr1 14933 14933 G A het 1.16632e-05
chr1 15903 15903 - C het 6.9979e-05
chr1 69331 69331 C T het 1.16632e-05
chr1 69336 69336 C T het 1.16632e-05
chr1 69460 69460 C T het 1.16632e-05
chr1 69462 69462 C G het 2.33263e-05
chr1 69474 69474 T G het 1.16632e-05
chr1 69486 69486 C A het 2.33263e-05
chr1 69486 69486 C T het 2.33263e-05
chr1 69491 69491 G A het 1.16632e-05
chr1 69495 69495 C T het 1.16632e-05
chr1 69496 69496 G A het 1.16632e-05
chr1 69496 69496 G T het 8.16422e-05
chr1 69510 69510 C T het 1.16632e-05
chr1 69511 69511 A G het 0.942186
chr1 69513 69513 A G het 0.000606485
chr1 69521 69521 T C het 8.16422e-05
chr1 69522 69522 T C het 1.16632e-05
chr1 69534 69534 T C het 0.00271752
chr1 69543 69543 C A het 1.16632e-05
chr1 69550 69550 G A het 9.33053e-05
chr1 69550 69550 G T het 1.16632e-05
chr1 69552 69552 G A het 1.16632e-05
chr1 69555 69555 T G het 1.16632e-05
chr1 69559 69559 G A het 0.000314906
chr1 69563 69563 A C het 6.9979e-05
2、位点评分(S2)
获取待测样本高通量测序数据,获得每个突变位点的信息,将各突变位点按照下述公式进行评分:
Svar=(Fq×Pop×2) max=M×(Score1+Score2+Score3)
其中:Svar为突变位点得分,Fq为群体频率,Pop为群体总数,M为预设最大值,Score1为突变类型分数,Score2为突变坐标分数,Score3为pLoF的分数。
具体步骤如下:
2.1 Score1评分
对上述带有频率的人群突变数据进行annovar注释,注释依据源于dbSNP150,ClinVar数据库和gnomAD-exome数据库。
如该突变位点在dbSNP150、ClinVar或gnomAD-exome数据库有注释,则判定该位点的突变为已知突变,则评分为0;当突变为未知SNV(single nucleotide variants),则评分为-1;当突变为未知INDEL(insertion-deletion),则评分为-5。
2.2 Score2评分
根据Score1评分结果,当突变为已知突变,Score2不评分,当突变为未知突变,则根据该突变所在区域位于refGene的不同区域评分。如:当突变位于exonic区,则评分为0;突变位于UTR-3、UTR-5、upstream、downstream、intronic、intergenic、或ncRNA区,则评分为-1;突变位于splicing区,则评分为-2。
在本实施例中,上述Score1和Score2评分中的annovar注释可在Ubuntu 18.04.2LTS中采用如下具体命令行进行:able_annovar.pl all.vcf.avinput dir/humandb/-buildver hg19-out x--otherinfo-remove-protocol refGene,avsnp150,clinvar_20200316,gnomad_exome-operation g,f,f,f-nastring NA。
2.3 Score3评分
标注单倍体疾病基因中的预测功能丧失pLoF突变,并对得到的pLoF突变数据进行筛选,根据该基因的主要转录本(即NCBI refGene中的基因最近更新的转录本)对pLoF位点进行annovar注释,包括pLoF位点位于基因组的位置,位于基因的具体转录本,所属变异类型,可能存在的氨基酸改变等信息。
Annovar部分注释结果如下表所示。
表3.annovar注释结果(refGene注释)
Figure PCTCN2021101184-appb-000002
Figure PCTCN2021101184-appb-000003
Figure PCTCN2021101184-appb-000004
表4.annovar注释结果(dbSNP150注释结果)
染色体 起始位置 终止位置 参考碱基 突变碱基 dbSNP150注释结果
Chr Start End Ref Alt avsnp150
chr1 14653 14653 C T rs62635297
chr1 14907 14907 A G rs6682375
chr1 14930 14930 A G rs6682385
chr1 14932 14932 G T NA
chr1 14933 14933 G A rs199856693
chr1 15903 15903 - C rs557514207
chr1 69331 69331 C T NA
chr1 69336 69336 C T NA
chr1 69460 69460 C T NA
chr1 69462 69462 C G NA
chr1 69474 69474 T G rs752034042
chr1 69486 69486 C A NA
chr1 69486 69486 C T rs548369610
chr1 69491 69491 G A NA
chr1 69495 69495 C T NA
chr1 69496 69496 G A rs150690004
chr1 69496 69496 G T NA
chr1 69510 69510 C T NA
chr1 69511 69511 A G rs2691305
chr1 69513 69513 A G rs770590115
chr1 69521 69521 T C rs553724620
chr1 69522 69522 T C NA
chr1 69534 69534 T C rs190717287
chr1 69543 69543 C A NA
chr1 69550 69550 G A NA
chr1 69550 69550 G T NA
chr1 69552 69552 G A rs2531266
chr1 69555 69555 T G rs556374459
chr1 69559 69559 G A rs754025211
表5.annovar注释结果(ClinVar注释结果)
Figure PCTCN2021101184-appb-000005
表6.annovar注释结果(gnomAD_exome注释结果)
Figure PCTCN2021101184-appb-000006
对于pLoF位点,根据上述注释信息进行筛选,条件为:仅保留出现在主转录本中Stop-gained、Splice site和Frameshift类型的pLoF突变位点,其他不出现在主转录本而出现在非主转录本的pLoF位点忽略。
上述Stop-gained突变指使原来的氨基酸变成终止密码子,使蛋白转录提前终止的突变;Splice site指会影响转录本的正常转录,导致内含子区域也可能会转录进来,导致了整个蛋白质的结构发生变化的突变;Frameshift指移码突变,即剪接体可识别的RNA前体中内含子和外显子连接边界的序列和接头位点。假如该位置发生突变,导致mRNA的剪切出问题,蛋白质翻译就会出问题,剪切是转录的过程中将内含子剪切掉,保留外显子的过程。如果是经典剪切位点GT-AG突变则比较严重,属于致病性变异。
经过上述筛选,保留pLoF突变位点如下表所示:
表7.pLoF列表
Figure PCTCN2021101184-appb-000007
Figure PCTCN2021101184-appb-000008
根据最终得到pLoF位点进行Score3评分,当突变为高可信的pLoF,则评分为+3;当突变为低可信的pLoF,则评分为+2。
2.4位点评分
获得Score1-Score3评分后,将上述突变位点按照下述公式进行评分:
Svar=(Fq×Pop×2) max=M×(Score1+Score2+Score3)
其中:
Svar为突变位点得分;
Fq为群体频率;
Pop为群体总数,本实施例为42868;
M为预设最大值,本实施例中,M=100。
通过这个公式,我们可以得到每一个位点的分数Svar,当Svar分数大于0,则判定为0。
3、区域片段评分(S3)
3.1窗口大小确定。
本发明人在前期研究中,对dbSNP150,EXAC03全外显子数据库和genomeAD的全外显子数据库进行分析,统计上述数据库中记载突变位点之间的距离,获得相邻两突变位点之间距离大小,发现以25bp的窗口大小,可以覆盖超过95%的突变间隔,如图2所示。因此确定切割窗口大小为25bp。
从每个突变延后25bp窗口大小进行切割,得到若干包括突变位点的区域片段。
3.2序列(区域片段)评分
按照下述公式对每个区域片段进行评分:
Figure PCTCN2021101184-appb-000009
其中:Stotal为区域片段的总评分,N为该区域片段的总突变数。
4、位点排除(S4)
如区域片段的Stotal总评分低于预定阈值,则判定该区域片段中的突变位点为人工引入错误位点,排除该突变。
具体的,将所有区域片段的Stotal总评分进行比较,可通过一个具体数字(如-100或者-50),也可以进行排序,取最低分数的前5%,这些窗口的突变则全为人口引入的错误突变。
本实施例中以-418作为筛选阈值,小于-418(即所有片段根据分数从小到大排序中的5%分数)进行筛选,得出的阳性结果(判定为人工引入错误突变),其中部分得分负值较大及负值较小的人工引入错误突变如下表所示。
表8.人工引入错误突变及其所对应的窗口、窗口总突变数及总分数
染色体 位置 所处窗口 窗口总突变数 总分数
chr9 35906583 chr9:35906583-35906607 428 -7026708.30830639
chr9 35906584 chr9:35906584-35906608 415 -6726549.28627591
chr9 35906585 chr9:35906585-35906609 414 -6704131.02405082
chr9 35906586 chr9:35906586-35906610 413 -6665636.51330758
chr9 35906580 chr9:35906580-35906604 401 -6078580.34869512
chr9 35906582 chr9:35906582-35906606 398 -5996092.36077493
chr9 35906581 chr9:35906581-35906605 398 -5995694.37824108
chr9 35906577 chr9:35906577-35906601 388 -5743785.43456742
chrX 100603410 chrX:100603410-100603434 171 -5728052.73344433
chrX 100603413 chrX:100603413-100603437 170 -5694215.36395906
chr9 35906587 chr9:35906587-35906611 350 -4800809.36996248
chr9 35906589 chr9:35906589-35906613 347 -4665279.90481323
chr9 35906588 chr9:35906588-35906612 342 -4574801.96584414
chr9 35906590 chr9:35906590-35906614 341 -4569608.93720737
chr9 35906592 chr9:35906592-35906616 344 -4437122.81440532
chr9 35906591 chr9:35906591-35906615 340 -4387228.27972579
chr9 35906576 chr9:35906576-35906600 309 -3402299.57000941
chr2 11727844 chr2:11727844-11727868 126 -3396924.90964996
chr2 11727843 chr2:11727843-11727867 123 -3379268.24098448
chr9 35906575 chr9:35906575-35906599 305 -3347277.37297846
chr2 11727854 chr2:11727854-11727878 134 -3326648.77447192
chr2 11727846 chr2:11727846-11727870 134 -3326380.78853077
chr18 77136193 chr18:77136193-77136217 133 -3311664.8494566
chr2 11727847 chr2:11727847-11727871 133 -3301291.06297629
chr2 11727856 chr2:11727856-11727880 132 -3275413.37200303
chr18 77136195 chr18:77136195-77136219 134 -3257772.50770687
chr2 11727857 chr2:11727857-11727881 131 -3249813.66882979
chr9 35906574 chr9:35906574-35906598 298 -3237378.0933243
chr2 11727858 chr2:11727858-11727882 130 -3224225.96512994
chr2 11727861 chr2:11727861-11727885 129 -3215677.48708922
chr18 77136197 chr18:77136197-77136221 134 -3178176.47012576
chr2 11727863 chr2:11727863-11727887 127 -3162266.19129363
chr9 35906571 chr9:35906571-35906595 290 -3121469.8819954
chr18 77136198 chr18:77136198-77136222 133 -3074658.73527408
chr9 70871928 chr9:70871928-70871952 193 -3048926.13846847
chr9 70871929 chr9:70871929-70871953 192 -3031976.64755712
chr18 77136191 chr18:77136191-77136215 124 -3012426.70434945
chr9 70871930 chr9:70871930-70871954 190 -2994693.82716957
chr2 11727864 chr2:11727864-11727888 123 -2979273.69614749
chr9 70871931 chr9:70871931-70871955 188 -2960162.87270482
chr2 11727865 chr2:11727865-11727889 122 -2954807.96948833
chr5 167945074 chr5:167945074-167945098 200 -2930730.88303328
chr18 77136190 chr18:77136190-77136214 118 -2861236.37222784
chr2 11727867 chr2:11727867-11727891 117 -2708755.82961834
chr11 117039153 chr11:117039153-117039177 140 -2604220.84949085
chr1 1247578 chr1:1247578-1247602 280 -2601686.59648442
chr1 1247579 chr1:1247579-1247603 279 -2587373.0933815
chr11 117039151 chr11:117039151-117039175 139 -2584785.31097787
chr11 117039154 chr11:117039154-117039178 139 -2584785.31097787
chr1 1247574 chr1:1247574-1247598 278 -2582547.14519519
chr9 32986030 chr9:32986030-32986054 45 -4409.795665296
注:每个位点均至少对应以自身位置为起点的窗口,以及可能对应其它窗口,如该位点落入任一判定为人工引入错误窗口,另一窗口未被认为属于人工引入错误,则认定为人工引入错 误位点的结论不变。
上述结果看出,本实施例从8245702个突变中筛选掉155338个人工引入错误位点,筛选率为1.88%;保留了8090364个位点,占比98.12%,在正常的筛选范围(大于95%)内。
5、验证
抽取上述判定为存在人工引入错误突变的2个窗口片段进行验证。
5.1片段一
chr9:35906583位点对应的chr9:35906583-35906607窗口,其部分突变情况如下表所示。
表9.chr9:35906583位点对应的窗口(chr9:35906583-35906607)突变情况
Figure PCTCN2021101184-appb-000010
Figure PCTCN2021101184-appb-000011
Figure PCTCN2021101184-appb-000012
Figure PCTCN2021101184-appb-000013
Figure PCTCN2021101184-appb-000014
该窗口有428个突变位点(上表示例性展示了部分位点35906583-35906589),根据dbSNP、ClinVar和gnomAD-exome三大数据库判断,95%25bp长的窗口出现突变位点的数量不超过18个,该窗口远超过此数量,提示存在人工引入错误;且这些突变位点除了存在几 个SNV外,其他均为INDEL,而INDEL的准确性不高,如此多的INDEL在同一区域内,其准确性更是大大降低。该位点所在窗口的IGV可视化结果见图3,图3为2个样本在该区域的reads比对情况,窗口长度为25bp,可以看出该区域的质量不佳,GC含量过高,可信程度不够。因此判断为人工引入的错误。与本发明的方法判断结果相吻合。
5.2片段二
再如chr9:32986030位点对应chr9:32986030-32986054窗口,其突变情况如下表和图4所示。
表10.chr9:32986030位点对应的窗口(chr9:32986030-32986054)突变情况
染色体 起始位置 终止位置 参考碱基 突变碱基
chr9 32986030 32986030 - A
chr9 32986030 32986030 - AAAAAAAAACAA
chr9 32986030 32986030 T -
chr9 32986030 32986030 T A
chr9 32986030 32986030 T C
chr9 32986031 32986031 A -
chr9 32986031 32986032 AA -
chr9 32986031 32986043 AAAAAAAAAACAA -
chr9 32986031 32986053 AAAAAAAAAACAAAAAAAAAAAC -
chr9 32986032 32986053 AAAAAAAAACAAAAAAAAAAAC -
chr9 32986033 32986053 AAAAAAAACAAAAAAAAAAAC -
chr9 32986034 32986053 AAAAAAACAAAAAAAAAAAC -
chr9 32986035 32986053 AAAAAACAAAAAAAAAAAC -
chr9 32986036 32986053 AAAAACAAAAAAAAAAAC -
chr9 32986037 32986037 - AAC
chr9 32986038 32986038 - AC
chr9 32986038 32986053 AAACAAAAAAAAAAAC -
chr9 32986039 32986039 - C
chr9 32986039 32986039 A C
chr9 32986040 32986040 A C
chr9 32986041 32986041 - A
chr9 32986041 32986041 C -
chr9 32986041 32986041 C A
chr9 32986042 32986056 AAAAAAAAAAACAAA -
chr9 32986044 32986044 - C
chr9 32986044 32986056 AAAAAAAAACAAA -
chr9 32986045 32986045 - C
chr9 32986045 32986045 A C
chr9 32986046 32986046 - C
chr9 32986046 32986046 A C
chr9 32986047 32986047 - C
chr9 32986047 32986047 A C
chr9 32986048 32986048 - C
chr9 32986048 32986048 A C
chr9 32986049 32986049 - C
chr9 32986049 32986049 A C
chr9 32986050 32986050 - C
chr9 32986050 32986050 A T
chr9 32986052 32986052 - AC
chr9 32986052 32986053 AC -
chr9 32986053 32986053 C -
chr9 32986053 32986053 C A
chr9 32986054 32986054 - C
chr9 32986054 32986054 A C
chr9 32986054 32986057 AAAA -
由上述结果可知,该窗口有45个突变位点,根据dbSNP、ClinVar和gnomAD-exome三大数据库判断,95%的25bp长窗口出现突变位点的数量不超过18个,该窗口超过此数量;且这些位点非常零散,相隔非常近,ploy A特征显著,非常容易出错;该位点iGV可视化结果见图4,图4为2个样本在该区域的reads比对情况,窗口长度为25bp,可以看出该区域的质量不佳,可信程度不够,因此判断为人工引入的错误。与本发明的方法判断结果相吻合。
验证结果证实,本发明的方法可除去绝大部分的人工引物错误突变,留下大量真实的突变。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (16)

  1. 一种降低高通量测序中人工引入错误突变的方法,其特征在于,包括以下步骤:
    1)建立群体突变频率数据库:获取若干样本高通量测序数据,整合得到每个突变位点的群体突变频率Fq,以及群体总数Pop;
    2)位点评分:获取待测样本高通量测序数据,获得每个突变位点的信息,将各突变位点按照下述公式进行评分:
    Svar=(Fq×Pop×2) max=M×(Score1+Score2+Score3)
    其中:Svar为突变位点得分,Fq为群体突变频率,Pop为群体总数,M为预设最大值,Score1为突变类型分数,Score2为突变坐标分数,Score3为pLoF分数;
    所述Score1的评分标准为:当突变为已知突变,不扣分;当突变为未知突变,扣分;
    所述Score2的评分标准为:当突变为未知突变,则根据该突变所在refGene位置,按照预定规则扣分;
    所述Score3的评分标准为:根据LOFTEE预测突变为功能缺失的可信度情况,按照预定规则扣分;
    3)区域片段评分:从每个突变延后预定窗口大小划分,得到若干包括突变位点的区域片段;
    按照下述公式对每个区域片段进行评分:
    Figure PCTCN2021101184-appb-100001
    其中:Stotal为区域片段的总评分,N为该区域片段的总突变数;
    4)位点排除:如区域片段的Stotal总评分低于预定阈值,则判定该区域片段中的突变位点为人工引入错误位点,排除该突变。
  2. 根据权利要求1所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述建立群体突变频率数据库步骤中,将每个突变位点按照染色体和基因组位置顺序进行排序。
  3. 根据权利要求1所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述位点评分步骤中,
    所述Score1的评分标准为:如该突变位点在dbSNP、ClinVar或gnomAD-exome数据库有注释,则判定该位点的突变为已知突变,不扣分;否则判定突变为未知突变,扣分;
    所述Score2的评分标准为:当突变为未知突变,则根据该突变所在区域为refGene的exonic、splicing、UTR-3、UTR-5、upstream、downstream、intronic、intergenic或ncRNA区位置,按照预定规则扣分;
    所述Score3的评分标准为:根据LOFTEE预测为突变为功能缺失高可信度位点或低可信位点,按照预定规则扣分。
  4. 根据权利要求3所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述位点评分步骤中:
    所述Score1的评分标准中,当突变为已知突变,则评分为0;当突变为未知SNV,则评分为-1;当突变为未知INDEL,则评分为-5;
    所述Score2的评分标准中,当突变位于exonic区,则评分为0;当突变位于UTR-3、UTR-5、upstream、downstream、intronic、intergenic或ncRNA区,则评分为-1;当突变位于splicing区,则评分为-2;
    所述Score3的评分标准中,当突变为高可信的pLoF,则评分为+3;当突变为低可信的pLoF,则评分为+2。
  5. 根据权利要求3所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述位点评分步骤中:M=100。
  6. 根据权利要求3所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述位点评分步骤中:当突变位点的Svar分数大于0,则将此位点得分定义为0。
  7. 根据权利要求1所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述refGene位置根据比对到NCBI refGene主转录本区域确定,所述pLoF仅保留出现在主转录本中Stop-gained、Splice site和Frameshift类型的pLoF突变位点。
  8. 根据权利要求7所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述主转录本为NCBI refGene中最近更新的转录本。
  9. 根据权利要求1所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述区域片段评分步骤中,所述窗口大小通过以下方法得到:以覆盖数据库中95±5%相邻突变位点的区间长度为窗口大小。
  10. 根据权利要求9所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述数据库为dbSNP数据库、ECAC03全外显子数据库和/或genomeAD全外显子数据库。
  11. 根据权利要求9所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述窗口大小为25±5bp。
  12. 根据权利要求1所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述位点排除步骤中,所述阈值为该待测样本中各区域片段按照Stotal得分由低至高排序的 前5%所对应区域片段的Stotal得分。
  13. 权利要求1-12任一项所述的降低高通量测序中人工引入错误突变的方法在高通量测序中的应用。
  14. 根据权利要求13所述的应用,其特征在于,所述高通量测序为全外显子测序。
  15. 权利要求1-12任一项所述的方法在疾病诊疗全外显子突变筛查中的应用。
  16. 一种降低高通量测序中人工引入错误突变的装置,其特征在于,包括:
    分析模块:用于按照权利要求1-12任一项所述的降低高通量测序中人工引入错误突变的方法,对待测样本高通量测序数据的突变数据进行分析。
PCT/CN2021/101184 2021-06-21 2021-06-21 降低高通量测序中人工引入错误突变的方法及应用 WO2022266790A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/101184 WO2022266790A1 (zh) 2021-06-21 2021-06-21 降低高通量测序中人工引入错误突变的方法及应用

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/101184 WO2022266790A1 (zh) 2021-06-21 2021-06-21 降低高通量测序中人工引入错误突变的方法及应用

Publications (1)

Publication Number Publication Date
WO2022266790A1 true WO2022266790A1 (zh) 2022-12-29

Family

ID=84545024

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/101184 WO2022266790A1 (zh) 2021-06-21 2021-06-21 降低高通量测序中人工引入错误突变的方法及应用

Country Status (1)

Country Link
WO (1) WO2022266790A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229841A (zh) * 2017-05-24 2017-10-03 重庆金域医学检验所有限公司 一种基因变异评估方法及系统
CN107423578A (zh) * 2017-03-02 2017-12-01 北京诺禾致源科技股份有限公司 检测体细胞突变的装置
CN112233725A (zh) * 2020-10-14 2021-01-15 合肥达徽基因科技有限公司 Atp7b基因突变二代测序自动化分析解读方法和报告系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423578A (zh) * 2017-03-02 2017-12-01 北京诺禾致源科技股份有限公司 检测体细胞突变的装置
CN107229841A (zh) * 2017-05-24 2017-10-03 重庆金域医学检验所有限公司 一种基因变异评估方法及系统
CN112233725A (zh) * 2020-10-14 2021-01-15 合肥达徽基因科技有限公司 Atp7b基因突变二代测序自动化分析解读方法和报告系统

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KARCZEWSKI, K. J. ET AL.: "The mutational constraint spectrum quantified from variation in 141, 456 humans", NATURE, vol. 581, 27 May 2020 (2020-05-27), XP037533562, DOI: 10.1038/s41586-020-2308-7 *
OVED JOSEPH H., MICHELE P LAMBERT, M ANNA KOWALSKA, MORTIMER PONCZ, KONRAD J KARCZEWSKI : "Population based frequency of naturally occurring loss-of-function", J. THROMB. HAEMOST., vol. 19, no. 1, 23 October 2020 (2020-10-23), pages 248 - 254, XP093016976, DOI: 10.1111/jth.15113 *
SUE RICHARDS, NAZNEEN AZIZ, SHERRI BALE, DAVID BICK, SOMA DAS, JULIE GASTIER-FOSTER, WAYNE W. GRODY, MADHURI HEGDE, ELAINE LYON, E: "Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology", GENETICS IN MEDICINE, NATURE PUBLISHING GROUP US, NEW YORK, vol. 17, no. 5, 1 May 2015 (2015-05-01), New York, pages 405 - 423, XP055331624, ISSN: 1098-3600, DOI: 10.1038/gim.2015.30 *

Similar Documents

Publication Publication Date Title
CN109033749B (zh) 一种肿瘤突变负荷检测方法、装置和存储介质
Shimane et al. An association analysis of HLA-DRB1 with systemic lupus erythematosus and rheumatoid arthritis in a Japanese population: effects of* 09: 01 allele on disease phenotypes
CN109658983B (zh) 一种识别和消除核酸变异检测中假阳性的方法和装置
CN109767810B (zh) 高通量测序数据分析方法及装置
CN108624650B (zh) 判断实体瘤是否适合免疫治疗的方法和检测试剂盒
CN104232777A (zh) 同时确定胎儿核酸含量和染色体非整倍性的方法及装置
AU2016355983B2 (en) Methods for detecting copy-number variations in next-generation sequencing
WO2021232388A1 (zh) 确定胚胎细胞染色体中预定位点碱基类型的方法及其应用
Wang et al. Plasma cell-free RNA characteristics in COVID-19 patients
US20160078168A1 (en) Fusion transcript detection methods and fusion transcripts identified thereby
Verhagen et al. A disease-associated microRNA cluster links inflammatory pathways and an altered composition of leukocyte subsets to noninfectious uveitis
CN111139291A (zh) 一种单基因遗传性疾病高通量测序分析方法
Sembler‐Møller et al. Distinct microRNA expression profiles in saliva and salivary gland tissue differentiate patients with primary Sjögren's syndrome from non‐Sjögren's sicca patients
Chintalapati et al. Using the Neandertal genome to study the evolution of small insertions and deletions in modern humans
CN111223525A (zh) 一种肿瘤外显子测序数据分析方法
CN110838341A (zh) 一种ATAC-seq测序数据的生物信息分析方法
WO2022266790A1 (zh) 降低高通量测序中人工引入错误突变的方法及应用
AU2020364225B2 (en) Fragment size characterization of cell-free DNA mutations from clonal hematopoiesis
CN113470746B (zh) 降低高通量测序中人工引入错误突变的方法及应用
CN114990202B (zh) Snp位点在评估基因组异常的应用及评估基因组异常的方法
CN111128308A (zh) 一种神经精神疾病新发突变信息知识平台
CN116287204A (zh) 检测特征基因的突变情况在制备静脉血栓栓塞症风险检测产品中的应用
CN115171784A (zh) 多基因遗传病症风险基因和风险突变的筛选方法及筛选系统
Sabbagh et al. Clinico-biological refinement of BCL11B-related disorder and identification of an episignature: A series of 20 unreported individuals
CN113764038A (zh) 构建骨髓增生异常综合征转白基因预测模型的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946293

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE