WO2022266790A1 - 降低高通量测序中人工引入错误突变的方法及应用 - Google Patents
降低高通量测序中人工引入错误突变的方法及应用 Download PDFInfo
- Publication number
- WO2022266790A1 WO2022266790A1 PCT/CN2021/101184 CN2021101184W WO2022266790A1 WO 2022266790 A1 WO2022266790 A1 WO 2022266790A1 CN 2021101184 W CN2021101184 W CN 2021101184W WO 2022266790 A1 WO2022266790 A1 WO 2022266790A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- mutation
- site
- score
- mutations
- throughput sequencing
- Prior art date
Links
- 230000035772 mutation Effects 0.000 title claims abstract description 231
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000012163 sequencing technique Methods 0.000 title abstract description 10
- 230000036438 mutation frequency Effects 0.000 claims abstract description 16
- 238000012165 high-throughput sequencing Methods 0.000 claims description 37
- 239000012634 fragment Substances 0.000 claims description 21
- 101150054338 ref gene Proteins 0.000 claims description 14
- 210000000349 chromosome Anatomy 0.000 claims description 10
- 238000012216 screening Methods 0.000 claims description 8
- 201000010099 disease Diseases 0.000 claims description 7
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 7
- 230000037433 frameshift Effects 0.000 claims description 6
- 108020004417 Untranslated RNA Proteins 0.000 claims description 5
- 102000039634 Untranslated RNA Human genes 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 5
- 238000003745 diagnosis Methods 0.000 claims description 5
- 230000007717 exclusion Effects 0.000 claims description 5
- 238000011144 upstream manufacturing Methods 0.000 claims description 5
- 238000007482 whole exome sequencing Methods 0.000 claims description 4
- 230000008901 benefit Effects 0.000 abstract description 4
- 230000003321 amplification Effects 0.000 abstract description 3
- 238000001514 detection method Methods 0.000 abstract description 3
- 238000003199 nucleic acid amplification method Methods 0.000 abstract description 3
- 108090000623 proteins and genes Proteins 0.000 description 10
- 239000002773 nucleotide Substances 0.000 description 4
- 125000003729 nucleotide group Chemical group 0.000 description 4
- 238000013518 transcription Methods 0.000 description 4
- 230000035897 transcription Effects 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 238000012800 visualization Methods 0.000 description 4
- 238000012217 deletion Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 102000004169 proteins and genes Human genes 0.000 description 3
- 238000003908 quality control method Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 108700028369 Alleles Proteins 0.000 description 2
- 108700024394 Exon Proteins 0.000 description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 150000001413 amino acids Chemical class 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 230000001717 pathogenic effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 108020004705 Codon Proteins 0.000 description 1
- 108091092195 Intron Proteins 0.000 description 1
- 238000012408 PCR amplification Methods 0.000 description 1
- 108020005093 RNA Precursors Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 231100000221 frame shift mutation induction Toxicity 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000004777 loss-of-function mutation Effects 0.000 description 1
- 108020004999 messenger RNA Proteins 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 210000001324 spliceosome Anatomy 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
Definitions
- the invention relates to the technical field of bioinformatics, in particular to a method and application for reducing artificially introduced error mutations in high-throughput sequencing.
- NGS technology provides convenience for genetic testing, especially whole exome sequencing, which has an advantage over whole gene sequencing in terms of price. Due to its high cost performance, whole exome sequencing is becoming more and more widely used in medicine.
- the main function of all-external sequencing is to find pathogenic mutations and provide important data for the diagnosis and treatment of diseases.
- the mutations screened in all external channels need to undergo repeated filtering for quality control.
- the quality control of sequencing is already very strict, due to factors such as non-specific amplification in the PCR amplification process, some artificially introduced errors (artefacts) pass the hard quality control. Control still cannot be ruled out, which has a huge impact on the diagnosis of the disease.
- Site scoring Obtain the high-throughput sequencing data of the samples to be tested, obtain the information of each mutation site, and score each mutation site according to the following formula:
- Svar is the mutation site score
- Fq is the population mutation frequency
- Pop is the total population
- M is the preset maximum value
- Score1 is the mutation type score
- Score2 is the mutation coordinate score
- Score3 is the pLoF score
- Score1 The scoring standard of Score1 is: when the mutation is a known mutation, no points will be deducted; when the mutation is an unknown mutation, points will be deducted;
- the scoring standard of Score2 is: when the mutation is an unknown mutation, according to the position of the refGene where the mutation is located, points will be deducted according to predetermined rules;
- Score3 The scoring standard of Score3 is: according to the credibility of LOFTEE predicting mutation as loss of function, points are deducted according to predetermined rules;
- Stotal is the total score of the region fragment, N is the total mutation number of the region fragment;
- the inventors found that the artificially introduced erroneous mutations have the following characteristics: a large number of indel sites gather at both ends of the captured sequencing region, resulting in a In the region, different people have different indels, resulting in a large number of INDEL aggregations in the population. Based on this previous research, the inventor proposed a method of weighting and scoring the above-mentioned features through a window of a specific length and obtaining the score of the window area segment by weighting, and then screening the confidence interval to eliminate artificial error positions in batches.
- each mutation site is sorted according to the sequence of chromosome and genome position.
- the scoring standard of Score1 is: if the mutation site is annotated in dbSNP, ClinVar or gnomAD-exome database, then it is determined that the mutation at the site is known If there is a mutation, no points will be deducted; otherwise, if the mutation is determined to be an unknown mutation, points will be deducted;
- the scoring standard of Score2 is: when the mutation is an unknown mutation, according to the exonic, splicing, UTR-3, UTR-5, upstream, downstream, intronic, intergenic or ncRNA region of refGene where the mutation is located, according to the predetermined rules deduct points;
- the scoring standard of Score3 is: According to LOFTEE, the mutation is predicted to be a loss of function (pLoF) high-confidence locus or low-confidence locus, and points are deducted according to predetermined rules.
- LOFTEE loss of function
- the full name of the above dbSNP is the single nucleotide polymorphism database, that is, the single nucleotide polymorphism database, which means "a single base pair (base pair) variation in the dna sequence", that is, a, t in the dna sequence , c, g changes, that is, the possibility of two or more nucleotides appearing at a specific and positioned site in the genome, which is the most common type of human heritable variation, therefore, according to each mutation
- the region where the site is located and the mutation type are marked with known mutations that have been reported as known in the art, which is helpful to distinguish whether the mutation is an artificially introduced error.
- the above-mentioned dbSNP database is preferably dbSNP150.
- the above-mentioned LOFTEE predicts that the mutation is a loss-of-function pLoF mutation, which is a putative loss-of-function (pLoF) mutation in the haploid disease gene in the Genome Aggregation Database (gnomAD) manually, and distinguishing the mutation type is also conducive to subsequent analysis and reference.
- the score when the mutation is located in the exonic region, the score is 0; when the mutation is located in UTR-3, UTR-5, upstream, downstream, intronic, intergenic or ncRNA region, the score is -1; when the mutation is located in splicing area, the score is -2;
- the main characteristics of artificial errors (false positive insertion and deletion sites are gathered at both ends of the captured sequencing region), so within a specific window range, the largest weight should be insertion and deletion (INDEL), so in The weight of unknown INDEL detected by the window is the largest, and other influencing factors include: SNP, whether it is a known mutation, the type of mutation, whether it is a pLOF mutation annotated by LOEFF, etc.
- SNP SNP
- the type of mutation whether it is a pLOF mutation annotated by LOEFF, etc.
- the above-mentioned known mutations are mutations that have been reported.
- the frequency is generally used to measure the number of occurrences of this site.
- M maximum value
- the site scoring step when the Svar score of the mutation site is greater than 0, the site score is defined as 0. Considering that this method mainly adopts the deduction system for the mutation site, when the Svar score of the mutation site is greater than 0, no points will be deducted, and it is defined as 0, which can more accurately evaluate whether there are artificially introduced error mutations in the region segment .
- the position of the refGene is determined according to the alignment to the main transcript region of the NCBI refGene, and the pLoF only retains the pLoF mutation sites of Stop-gained, Splice site and Frameshift types that appear in the main transcript.
- the main transcript is the most recently updated transcript in NCBI refGene, that is, the transcript with the shortest time from the current time.
- the window size is obtained by the following method: the window size is the interval length covering 95 ⁇ 5% of adjacent mutation sites in the database. Adjacent among the above-mentioned adjacent mutation sites refers to the length of the interval between two pairs, and the sequence length that can cover most of the intervals between two mutation sites is used as the window size, so that artificially introduced error mutations can be correctly identified.
- the database is dbSNP database, ECAC03 whole exon database and/or genomeAD whole exon database.
- the window size is 25 ⁇ 5bp.
- the threshold in the site exclusion step, is the Stotal score of the sequence fragments corresponding to the top 5% of the fragments in each region in the sample to be tested according to the Stotal score sorted from low to high. It can be understood that the above threshold may also be set to a certain score threshold according to practical needs.
- the present invention also discloses the application of the above-mentioned method for reducing artificially introduced error mutations in high-throughput sequencing to high-throughput sequencing.
- the high-throughput sequencing is whole exome sequencing.
- the invention also discloses the application of the above method in the screening of whole exon mutation in disease diagnosis and treatment.
- the invention also discloses a device for reducing artificially introduced error mutations in high-throughput sequencing, which is characterized in that it includes:
- Analysis module used to analyze the mutation data of the high-throughput sequencing data of the samples to be tested according to the above-mentioned method for reducing artificially introduced error mutations in high-throughput sequencing.
- the present invention has the following beneficial effects:
- a method for reducing artificially introduced error mutations in high-throughput sequencing of the present invention not only identifies and eliminates artificial errors of low-frequency and harmful mutations, but also covers general artificially introduced error mutation sites, which is comprehensive and accurate The advantages.
- the method of the present invention can overcome the defects of low efficiency and strong experience dependence existing in manual screening through automatic computer analysis and processing, and is suitable for wide popularization and application.
- Figure 1 is a schematic flow chart of the method for reducing artificially introduced error mutations in high-throughput sequencing in the embodiment
- Fig. 2 is a schematic diagram of Variants intervals corresponding to different window lengths in the embodiment
- Fig. 3 is the IGV visualization result of chr9: 35906583 site in the embodiment
- Fig. 4 is the IGV visualization result of the chr9:32986030 site in the embodiment.
- each mutation site includes: the chromosome where the site is located, the specific position, the reference genome base, the mutation base types, etc., and sorted according to chromosome and genome positions.
- the population mutation frequency is an important indicator of the calculated evaluation score of the mutation site of the present invention, which needs to be obtained by calculation. Specifically: input the above-mentioned mutation site information data into the standard format of annovar, that is, add the frequency data in the vcf format file, part of which is shown in the following table.
- chromosome Location reference base mutant base allele frequency information homozygous/heterozygous chr1 14653 C T 2.33263e-05 GT 1/1 chr1 14907 A G 0.000128295 GT 1/1 chr1 14930 A G 0.000139958 GT 1/1 chr1 14932 G T 2.33263e-05 GT 1/1 chr1 14933 G A 1.16632e-05 GT 1/1 chr1 15903 G GC 6.9979e-05 GT 1/1 chr1 69331 C T 1.16632e-05 GT 1/1 chr1 69336 C T 1.16632e-05 GT 1/1 chr1 69460 C T 1.16632e-05 GT 1/1 chr1 69462 C G 2.33263e-05 GT 1/1 chr1 69474 T G 1.16632e-05 GT 1/1 chr1 69486 C A 2.33263e-05 GT 1/1 chr1 69486 C A 2.3
- the chromosome and position represent the location of the mutation in the genome
- the reference base is the base type of the reference genome hg19 at this position
- the mutation base is the base type detected by this sequencing that is different from the reference genome
- allele The gene frequency is the frequency of the mutation in the population, that is, the number of occurrences of the mutation divided by (the total number of the population multiplied by 2);
- the information is GT stands for the abbreviation of Genetype; 1/1 means homozygous, 0/1 means heterozygous.
- chromosome starting point end position reference base mutant base homozygous/heterozygous type frequency chr1 14653 14653 C T het 2.33263e-05 chr1 14907 14907 A G het 0.000128295 chr1 14930 14930 A G het 0.000139958 chr1 14932 14932 G T het 2.33263e-05 chr1 14933 14933 G A het 1.16632e-05 chr1 15903 15903 - C het 6.9979e-05 chr1 69331 69331 C T het 1.16632e-05 chr1 69336 69336 C T het 1.16632e-05 chr1 69460 69460 C T het 1.16632e-05 chr1 69462 69462 C G het 2.33263e-05 chr1 69474 69474 T G het 1.166
- Svar is the score of the mutation site
- Fq is the population frequency
- Pop is the total number of the population
- M is the preset maximum value
- Score1 is the score of the mutation type
- Score2 is the score of the mutation coordinate
- Score3 is the score of pLoF.
- the mutation site is annotated in dbSNP150, ClinVar or gnomAD-exome database, it is judged that the mutation at this site is a known mutation, and the score is 0; when the mutation is an unknown SNV (single nucleotide variants), the score is -1 ; when the mutation is an unknown INDEL (insertion-deletion), the score is -5.
- Score1 when the mutation is a known mutation, Score2 will not be scored, and when the mutation is an unknown mutation, it will be scored according to the region where the mutation is located in different regions of refGene. For example, when the mutation is located in the exonic region, the score is 0; if the mutation is located in the UTR-3, UTR-5, upstream, downstream, intronic, intergenic, or ncRNA region, the score is -1; if the mutation is located in the splicing region, the score is - 2.
- the annovar annotations in the above Score1 and Score2 scores can be performed in Ubuntu 18.04.2LTS using the following specific command line: able_annovar.pl all.vcf.avinput dir/humandb/-buildver hg19-out x --otherinfo -remove-protocol refGene,avsnp150,clinvar_20200316,gnomad_exome-operation g,f,f,f-nastring NA.
- the condition is: only keep the pLoF mutation sites of the Stop-gained, Splice site and Frameshift types that appear in the main transcript, and others do not appear in the main transcript but appear in the non-main transcript The pLoF site of the transcript is ignored.
- Stop-gained mutation refers to the mutation that changes the original amino acid into a stop codon, so that the protein transcription is terminated early;
- the Splice site refers to the mutation that will affect the normal transcription of the transcript, resulting in the transcription of the intron region, resulting in the entire protein.
- Frameshift refers to frameshift mutations, ie, sequences and linker sites at the junctional boundaries of introns and exons in RNA precursors that are recognized by the spliceosome. If there is a mutation at this position, there will be a problem with the splicing of the mRNA, and there will be a problem with the translation of the protein. Splicing is the process of cutting out the intron and retaining the exon during the transcription process. If it is a classic splice site GT-AG mutation, it is more serious and is a pathogenic variant.
- Score3 is scored according to the final pLoF site. When the mutation is a high-confidence pLoF, the score is +3; when the mutation is a low-confidence pLoF, the score is +2.
- Svar is the score of the mutation site
- Fq is the group frequency
- Pop is the total number of groups, which is 42868 in this embodiment.
- the inventors analyzed the dbSNP150, EXAC03 all-exon database and genomeAD all-exon database, counted the distance between the mutation sites recorded in the above databases, and obtained the distance between two adjacent mutation sites. For the size of the distance, it was found that with a window size of 25 bp, more than 95% of the mutation intervals could be covered, as shown in FIG. 2 . Therefore, the cutting window size was determined to be 25bp.
- Cutting was performed with a 25 bp window delayed from each mutation to obtain several fragments of the region including the mutation site.
- Stotal is the total score of the region segment
- N is the total mutation number of the region segment.
- the mutation site in the region segment is an artificially introduced error site, and the mutation is excluded.
- -418 is used as the screening threshold, which is less than -418 (that is, all fragments are screened according to the 5% score in the order of scores from small to large), and the positive results obtained (determined as artificially introduced error mutations), some of which are scored
- the artificially introduced error mutations with large and small negative values are shown in the table below.
- chromosome Location window The total number of mutations in the window total score chr9 35906583 chr9:35906583-35906607 428 -7026708.30830639 chr9 35906584 chr9:35906584-35906608 415 -6726549.28627591 chr9 35906585 chr9:35906585-35906609 414 -6704131.02405082 chr9 35906586 chr9:35906586-35906610 413 -6665636.51330758 chr9 35906580 chr9:35906580-35906604 401 -6078580.34869512 chr9 35906582 chr9:35906582-35906606 398 -5996092.36077493 chr9 35906581 chr9:35906581-35906605 398 -5995694.37824108 chr9 3590
- Each site corresponds to at least the window starting from its own position, and may correspond to other windows. If the site falls into any window that is judged to be an artificially introduced error, and the other window is not considered to be an artificially introduced error, then The conclusion that the wrong site was identified as artificially introduced remains unchanged.
- the chr9:35906583-35906607 window corresponding to the chr9:35906583 site, and its partial mutation status are shown in the table below.
- the chr9:32986030 site corresponds to the chr9:32986030-32986054 window, and its mutation status is shown in the table below and Figure 4.
- the verification result proves that the method of the present invention can remove most of the false mutations of artificial primers, leaving a large number of real mutations.
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
一种降低高通量测序中人工引入错误突变的方法及应用,属于生物信息学技术领域。该方法首先建立群体突变频率数据库,整合得到每个突变位点的群体突变频率,随后对各个窗口内的突变位点,按照预设的权重给定分数进行评分,得到每个突变位点的得分Svar后,考虑每个突变位点往后特定窗口大小内出现突变的次数,将这个区域内所有突变的分数相加,然后乘以这个窗口的突变数目,以达到放大的目的,从而突出存在人工引入错误突变的位点,将其识别后排除。从而可实现检出所有类型的人工引入错误突变,且具有效率高,自动化程度高,准确性高和检出全面的优势。
Description
本发明涉及生物信息学技术领域,特别是涉及一种降低高通量测序中人工引入错误突变的方法及应用。
NGS技术为基因检测提供了便利,尤其是全外显子测序,其在价格上比全基因测序更有优势,由于性价比高,全外测序在医学上得到越来越广泛的应用。
全外测序的主要作用是寻找致病的突变,为疾病的诊疗提供重要的数据。全外筛选的突变需经过重重过滤进行质控,虽然现在对于测序的质量控制已经很严格,但是由于PCR扩增过程中的非特异性扩增等因素,部分人工引入的错误(artefacts)通过硬质量控制仍然不能排除,这对疾病的诊断产生了巨大的影响。
现阶段,虽然已有一些方法去除人工错误,但是基本上都需要以整合基因组浏览(IGV)方式,由具有丰富经验的实验人员进行人工进一步的判断,效率很低,且准确性依赖于人员的主观经验,难以标准化处理。
目前有研究推出一种利用LOFTEE排除人工引入错误突变的方法,其主要通过一组过滤器,将提前终止(stop-gained)、移码突变(frameshift)和可变剪切突变(splicing)进行评估,得到低可信和高可信的预测功能缺失突变,但是其主要在于评估和排除上述几种类型的人工错误,对于其他可能存在错误的位点并没有覆盖到,因此依然可能漏掉许多的人工错误位点。
发明内容
基于此,有必要针对上述问题,提供一种降低高通量测序中人工引入错误突变的方法,采用该方法,可通过机器自动化识别,批量排除人工错误位点,提高测序准确性。
1)建立群体突变频率数据库:获取若干样本高通量测序数据,整合得到每个突变位点的群体突变频率Fq,以及群体总数Pop;
2)位点评分:获取待测样本高通量测序数据,获得每个突变位点的信息,将各突变位点按照下述公式进行评分:
Svar=(Fq×Pop×2)
max=M×(Score1+Score2+Score3)
其中:Svar为突变位点得分,Fq为群体突变频率,Pop为群体总数,M为预设最大值,Score1为突变类型分数,Score2为突变坐标分数,Score3为pLoF分数;
所述Score1的评分标准为:当突变为已知突变,不扣分;当突变为未知突变,扣分;
所述Score2的评分标准为:当突变为未知突变,则根据该突变所在refGene位置,按照预定规则扣分;
所述Score3的评分标准为:根据LOFTEE预测突变为功能缺失的可信度情况,按照预定规则扣分;
3)区域片段评分:从每个突变延后预定窗口大小划分,得到若干包括突变位点的序列片段;
按照下述公式对每个区域片段进行评分:
其中:Stotal为区域片段的总评分,N为该区域片段的总突变数;
4)位点排除:如区域片段的Stotal总评分低于预定阈值,则判定该区域片段中的突变位点为人工引入错误位点,排除该突变。
本发明人在长期实践工作的基础上,经过仔细分析、推论和求证后发现,对于人工引入的错误突变存在以下特征:插入缺失位点在捕获的测序区域两端大量聚集,导致了在某个区域内,不同的人出现不同插入缺失,导致种群中出现大量的INDEL聚集现象。以此前期研究基础出发,本发明人提出通过特定长度的窗口对上述特征进行权重评分并加权得到窗口区域片段的分数,再通过置信区间的筛选,批量排除人工错误位点的方法。
上述方法中,首先建立群体突变频率频率数据库,整合得到每个突变位点的突变频率,以及群体总数Pop,采用发明人设计的公式对突变位点进行评分,再按照预定窗口大小对序列进行划分,得到若干区域片段,对各区域片段中的突变位点得分进行整合,考虑每个突变位点往后特定窗口大小内出现突变的次数,将这个区域内所有突变的分数相加,然后乘以这个窗口的突变数目,以达到放大的目的,从而突出存在人工引入错误突变的位点,将其识别后排除。从而可实现检出所有类型的人工引入错误突变的目的,且具有效率高,自动化程度高,准确性高和检出全面的优势。
在其中一个实施例中,所述建立群体突变频率数据库步骤中,将每个突变位点按照染色体和基因组位置顺序进行排序。
在其中一个实施例中,所述位点评分步骤中,所述Score1的评分标准为:如该突变位点在dbSNP、ClinVar或gnomAD-exome数据库有注释,则判定该位点的突变为已知突变,不扣分;否则判定突变为未知突变,扣分;
所述Score2的评分标准为:当突变为未知突变,则根据该突变所在区域为refGene的exonic、splicing、UTR-3、UTR-5、upstream、downstream、intronic、intergenic或ncRNA区位置,按照预定规则扣分;
所述Score3的评分标准为:根据LOFTEE预测为突变为功能缺失(pLoF)高可信度位点或低可信位点,按照预定规则扣分。
可以理解的,上述dbSNP全称为the single nucleotide polymorphism database,即单核苷酸多态性数据库,意思是“dna序列中的单一碱基对(base pair)变异”,也就是dna序列中a、t、c、g的改变,即基因组的一个特异和定位的位点出现两个或多个的核苷酸可能性,它是人类可遗传的变异中最常见的一种,因此,按照每个突变位点所在的区域及突变类型,将已报道为本领域所知悉的已知突变进行标注,有利于区分该突变是否为人工引入错误。上述dbSNP数据库优选dbSNP150。上述LOFTEE预测突变为功能缺失的pLoF突变为通过人工处理基因组聚合数据库(gnomAD)中单倍体疾病基因中的假定功能丧失(pLoF)突变,区分该突变类型,也有利于后续进行分析参考。
在其中一个实施例中,所述位点评分步骤中:
所述Score1的评分标准中,当突变为已知突变,则评分为0;当突变为未知SNV,则评分为-1;当突变为未知INDEL,则评分为-5;
所述Score2的评分标准中,当突变位于exonic区,则评分为0;当突变位于UTR-3、UTR-5、upstream、downstream、intronic、intergenic或ncRNA区,则评分为-1;当突变位于splicing区,则评分为-2;
所述Score3的评分标准中,当突变为高可信的pLoF,则评分为+3;当突变为低可信的pLoF,则评分为+2。
上述评分权重中,针对人工错误的主要特征(假阳性的插入缺失位点在捕获的测序区域两端大量聚集),因此在特定窗口范围内,最大的权重应该是插入缺失(INDEL),因此在窗口检测出未知INDEL的权重最大,其他影响因素还包括:SNP,是否为已知突变,突变的类型,是否为LOEFF注释的pLOF突变等。经过测试,我们将各因素的权重按照上述方式定义,能够较好识别出人工引入错误突变。
在其中一个实施例中,所述位点评分步骤中:M=100。
可以理解的,上述已知突变即为已有报道存在的突变。而对于群体频率,由于在实践中,很难得到某个位点突变的具体人数,因此一般用频率衡量这个位点的出现次数。当频率×人群×2大于100的时候,代表这个位点已经很常见,因此我们设定了一个最大的值,即M=100。
在其中一个实施例中,所述位点评分步骤中:当突变位点的Svar分数大于0,则将此位点得分定义为0。考虑到该方法中对于突变位点主要是采用扣分制,因此当突变位点的Svar分数大于0,则不扣分,定义为0,可更为准确评估区域片段中是否存在人工引入错误突变。
在其中一个实施例中,所述refGene位置根据比对到NCBI refGene主转录本区域确定,所述pLoF仅保留出现在主转录本中Stop-gained、Splice site和Frameshift类型的pLoF突变位点。
在其中一个实施例中,所述主转录本为NCBI refGene中最近更新的转录本,即距离当前时间最短的转录本。
在其中一个实施例中,所述区域片段评分步骤中,所述窗口大小通过以下方法得到:以覆盖数据库中95±5%相邻突变位点的区间长度为窗口大小。上述相邻突变位点中的相邻指两两间的区间长度,以能够覆盖绝大部分两两间突变位点区间的序列长度为窗口大小,可将人 工引入错误突变正确识别。
在其中一个实施例中,所述数据库为dbSNP数据库、ECAC03全外显子数据库和/或genomeAD全外显子数据库。
在其中一个实施例中,所述窗口大小为25±5bp。
本发明人在前期研究中,对dbSNP150,EXAC03全外显子数据库和genomeAD的全外显子数据库进行分析比较后发现,以25bp的窗口大小,可以覆盖超过95%的突变间隔。
在其中一个实施例中,所述位点排除步骤中,所述阈值为该待测样本中各区域片段按照Stotal得分由低至高排序的前5%所对应序列片段的Stotal得分。可以理解的,上述阈值也可以根据实践需要,设定某个确定的分数阈值。
本发明还公开了上述的降低高通量测序中人工引入错误突变的方法在高通量测序中的应用。
在其中一个实施例中,所述高通量测序为全外显子测序。
将上述的降低高通量测序中人工引入错误突变的方法应用于高通量测序中,特别是全外显子的检测中,能够降低人工引入突变错误,为疾病的诊疗提供准确数据,进一步提高了个性化精准医疗的可靠性。
本发明还公开了上述的方法在疾病诊疗全外显子突变筛查中的应用。
本发明还公开了一种降低高通量测序中人工引入错误突变的装置,其特征在于,包括:
分析模块:用于按照上述的降低高通量测序中人工引入错误突变的方法,对待测样本高通量测序数据的突变数据进行分析。
可以理解的,上述装置可按照上述方法进行分析,具体呈现产品形式包括一体化设备,软件模块封装等。
与现有技术相比,本发明具有以下有益效果:
本发明的一种降低高通量测序中人工引入错误突变的方法,不仅仅针对低频及有害突变的人工错误进行识别排除,对于一般的人工引入错误突变位点也能覆盖到,具有全面、准确的优势。
并且,本发明的方法,可通过计算机自动化分析处理,克服了人工筛查存在的效率低、经验依赖性强的缺陷,适宜广泛的推广应用。
图1为实施例中降低高通量测序中人工引入错误突变的方法流程示意图;
图2为实施例中不同窗口长度所对应Variants间隔示意图;
图3为实施例中chr9:35906583位点的IGV可视化结果;
图4为实施例中chr9:32986030位点的IGV可视化结果。
为了便于理解本发明,下面将参照相关附图对本发明进行更全面的描述。附图中给出了 本发明的较佳实施例。但是,本发明可以以许多不同的形式来实现,并不限于本文所描述的实施例。相反地,提供这些实施例的目的是使对本发明的公开内容的理解更加透彻全面。
除非另有定义,本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本发明。本文所使用的术语“和/或”包括一个或多个相关的所列项目的任意的和所有的组合。
实施例
一种降低高通量测序中人工引入错误突变的方法,流程如图1所示,包括以下步骤:
1、建立群体突变频率数据库(S1)
获取若干样本高通量测序数据,整合得到每个突变位点的突变频率,具体如下:
1.1数据准备。
获取若干全外测序样本数据(本实施例中选择42868例样本),得到其中所有位点的vcf合并文件,每个突变位点信息包括:位点所在染色体、具体位置、参考基因组碱基,突变的碱基类型等,并按照染色体和基因组位置进行排序。
1.2群体突变频率计算。
群体突变频率是计算的本发明突变位点评估分数的重要指标,需要通过计算得到。具体为:将上述突变位点信息数据输入为annovar的标准格式,即在vcf格式文件中加入频率数据,部分展示如下表。
表1.群体突变及频率输入文件格式列表
染色体 | 位置 | 参考碱基 | 突变碱基 | 等位基因频率 | 信息 | 纯合/杂合 |
chr1 | 14653 | C | T | 2.33263e-05 | GT | 1/1 |
chr1 | 14907 | A | G | 0.000128295 | GT | 1/1 |
chr1 | 14930 | A | G | 0.000139958 | GT | 1/1 |
chr1 | 14932 | G | T | 2.33263e-05 | GT | 1/1 |
chr1 | 14933 | G | A | 1.16632e-05 | GT | 1/1 |
chr1 | 15903 | G | GC | 6.9979e-05 | GT | 1/1 |
chr1 | 69331 | C | T | 1.16632e-05 | GT | 1/1 |
chr1 | 69336 | C | T | 1.16632e-05 | GT | 1/1 |
chr1 | 69460 | C | T | 1.16632e-05 | GT | 1/1 |
chr1 | 69462 | C | G | 2.33263e-05 | GT | 1/1 |
chr1 | 69474 | T | G | 1.16632e-05 | GT | 1/1 |
chr1 | 69486 | C | A | 2.33263e-05 | GT | 1/1 |
chr1 | 69486 | C | T | 2.33263e-05 | GT | 1/1 |
chr1 | 69491 | G | A | 1.16632e-05 | GT | 1/1 |
chr1 | 69495 | C | T | 1.16632e-05 | GT | 1/1 |
chr1 | 69496 | G | A | 1.16632e-05 | GT | 1/1 |
chr1 | 69496 | G | T | 8.16422e-05 | GT | 1/1 |
chr1 | 69510 | C | T | 1.16632e-05 | GT | 1/1 |
chr1 | 69511 | A | G | 0.942186 | GT | 1/1 |
chr1 | 69513 | A | G | 0.000606485 | GT | 1/1 |
chr1 | 69521 | T | C | 8.16422e-05 | GT | 1/1 |
chr1 | 69522 | T | C | 1.16632e-05 | GT | 1/1 |
chr1 | 69534 | T | C | 0.00271752 | GT | 1/1 |
chr1 | 69543 | C | A | 1.16632e-05 | GT | 1/1 |
chr1 | 69550 | G | A | 9.33053e-05 | GT | 1/1 |
chr1 | 69550 | G | T | 1.16632e-05 | GT | 1/1 |
chr1 | 69552 | G | A | 1.16632e-05 | GT | 1/1 |
chr1 | 69555 | T | G | 1.16632e-05 | GT | 1/1 |
chr1 | 69559 | G | A | 0.000314906 | GT | 1/1 |
chr1 | 69563 | A | C | 6.9979e-05 | GT | 1/1 |
其中:染色体和位置代表了该突变在基因组的位置定位,参考碱基为该位置上参考基因组hg19的碱基类型,突变碱基是本次测序检测出与参考基因组不同的碱基类型,等位基因频率即人群中该突变的频率,即该突变出现的次数除以(人群总人数乘以2);信息即GT代表Genetype的缩写;1/1代表纯合、0/1代表杂合。
随后使用annovar软件自带的convert2annovar.pl-format vcf4功能模块,将vcf格式转换成annovar的标准格式,部分示例如下表。
表2.突变及频率数据注释前输入文件格式
染色体 | 起始位置 | 终止位置 | 参考碱基 | 突变碱基 | 纯合/杂合类型 | 频率 |
chr1 | 14653 | 14653 | C | T | het | 2.33263e-05 |
chr1 | 14907 | 14907 | A | G | het | 0.000128295 |
chr1 | 14930 | 14930 | A | G | het | 0.000139958 |
chr1 | 14932 | 14932 | G | T | het | 2.33263e-05 |
chr1 | 14933 | 14933 | G | A | het | 1.16632e-05 |
chr1 | 15903 | 15903 | - | C | het | 6.9979e-05 |
chr1 | 69331 | 69331 | C | T | het | 1.16632e-05 |
chr1 | 69336 | 69336 | C | T | het | 1.16632e-05 |
chr1 | 69460 | 69460 | C | T | het | 1.16632e-05 |
chr1 | 69462 | 69462 | C | G | het | 2.33263e-05 |
chr1 | 69474 | 69474 | T | G | het | 1.16632e-05 |
chr1 | 69486 | 69486 | C | A | het | 2.33263e-05 |
chr1 | 69486 | 69486 | C | T | het | 2.33263e-05 |
chr1 | 69491 | 69491 | G | A | het | 1.16632e-05 |
chr1 | 69495 | 69495 | C | T | het | 1.16632e-05 |
chr1 | 69496 | 69496 | G | A | het | 1.16632e-05 |
chr1 | 69496 | 69496 | G | T | het | 8.16422e-05 |
chr1 | 69510 | 69510 | C | T | het | 1.16632e-05 |
chr1 | 69511 | 69511 | A | G | het | 0.942186 |
chr1 | 69513 | 69513 | A | G | het | 0.000606485 |
chr1 | 69521 | 69521 | T | C | het | 8.16422e-05 |
chr1 | 69522 | 69522 | T | C | het | 1.16632e-05 |
chr1 | 69534 | 69534 | T | C | het | 0.00271752 |
chr1 | 69543 | 69543 | C | A | het | 1.16632e-05 |
chr1 | 69550 | 69550 | G | A | het | 9.33053e-05 |
chr1 | 69550 | 69550 | G | T | het | 1.16632e-05 |
chr1 | 69552 | 69552 | G | A | het | 1.16632e-05 |
chr1 | 69555 | 69555 | T | G | het | 1.16632e-05 |
chr1 | 69559 | 69559 | G | A | het | 0.000314906 |
chr1 | 69563 | 69563 | A | C | het | 6.9979e-05 |
2、位点评分(S2)
获取待测样本高通量测序数据,获得每个突变位点的信息,将各突变位点按照下述公式进行评分:
Svar=(Fq×Pop×2)
max=M×(Score1+Score2+Score3)
其中:Svar为突变位点得分,Fq为群体频率,Pop为群体总数,M为预设最大值,Score1为突变类型分数,Score2为突变坐标分数,Score3为pLoF的分数。
具体步骤如下:
2.1 Score1评分
对上述带有频率的人群突变数据进行annovar注释,注释依据源于dbSNP150,ClinVar数据库和gnomAD-exome数据库。
如该突变位点在dbSNP150、ClinVar或gnomAD-exome数据库有注释,则判定该位点的突变为已知突变,则评分为0;当突变为未知SNV(single nucleotide variants),则评分为-1;当突变为未知INDEL(insertion-deletion),则评分为-5。
2.2 Score2评分
根据Score1评分结果,当突变为已知突变,Score2不评分,当突变为未知突变,则根据该突变所在区域位于refGene的不同区域评分。如:当突变位于exonic区,则评分为0;突变位于UTR-3、UTR-5、upstream、downstream、intronic、intergenic、或ncRNA区,则评分为-1;突变位于splicing区,则评分为-2。
在本实施例中,上述Score1和Score2评分中的annovar注释可在Ubuntu 18.04.2LTS中采用如下具体命令行进行:able_annovar.pl all.vcf.avinput dir/humandb/-buildver hg19-out x--otherinfo-remove-protocol refGene,avsnp150,clinvar_20200316,gnomad_exome-operation g,f,f,f-nastring NA。
2.3 Score3评分
标注单倍体疾病基因中的预测功能丧失pLoF突变,并对得到的pLoF突变数据进行筛选,根据该基因的主要转录本(即NCBI refGene中的基因最近更新的转录本)对pLoF位点进行annovar注释,包括pLoF位点位于基因组的位置,位于基因的具体转录本,所属变异类型,可能存在的氨基酸改变等信息。
Annovar部分注释结果如下表所示。
表3.annovar注释结果(refGene注释)
表4.annovar注释结果(dbSNP150注释结果)
染色体 | 起始位置 | 终止位置 | 参考碱基 | 突变碱基 | dbSNP150注释结果 |
Chr | Start | End | Ref | Alt | avsnp150 |
chr1 | 14653 | 14653 | C | T | rs62635297 |
chr1 | 14907 | 14907 | A | G | rs6682375 |
chr1 | 14930 | 14930 | A | G | rs6682385 |
chr1 | 14932 | 14932 | G | T | NA |
chr1 | 14933 | 14933 | G | A | rs199856693 |
chr1 | 15903 | 15903 | - | C | rs557514207 |
chr1 | 69331 | 69331 | C | T | NA |
chr1 | 69336 | 69336 | C | T | NA |
chr1 | 69460 | 69460 | C | T | NA |
chr1 | 69462 | 69462 | C | G | NA |
chr1 | 69474 | 69474 | T | G | rs752034042 |
chr1 | 69486 | 69486 | C | A | NA |
chr1 | 69486 | 69486 | C | T | rs548369610 |
chr1 | 69491 | 69491 | G | A | NA |
chr1 | 69495 | 69495 | C | T | NA |
chr1 | 69496 | 69496 | G | A | rs150690004 |
chr1 | 69496 | 69496 | G | T | NA |
chr1 | 69510 | 69510 | C | T | NA |
chr1 | 69511 | 69511 | A | G | rs2691305 |
chr1 | 69513 | 69513 | A | G | rs770590115 |
chr1 | 69521 | 69521 | T | C | rs553724620 |
chr1 | 69522 | 69522 | T | C | NA |
chr1 | 69534 | 69534 | T | C | rs190717287 |
chr1 | 69543 | 69543 | C | A | NA |
chr1 | 69550 | 69550 | G | A | NA |
chr1 | 69550 | 69550 | G | T | NA |
chr1 | 69552 | 69552 | G | A | rs2531266 |
chr1 | 69555 | 69555 | T | G | rs556374459 |
chr1 | 69559 | 69559 | G | A | rs754025211 |
表5.annovar注释结果(ClinVar注释结果)
表6.annovar注释结果(gnomAD_exome注释结果)
对于pLoF位点,根据上述注释信息进行筛选,条件为:仅保留出现在主转录本中Stop-gained、Splice site和Frameshift类型的pLoF突变位点,其他不出现在主转录本而出现在非主转录本的pLoF位点忽略。
上述Stop-gained突变指使原来的氨基酸变成终止密码子,使蛋白转录提前终止的突变;Splice site指会影响转录本的正常转录,导致内含子区域也可能会转录进来,导致了整个蛋白质的结构发生变化的突变;Frameshift指移码突变,即剪接体可识别的RNA前体中内含子和外显子连接边界的序列和接头位点。假如该位置发生突变,导致mRNA的剪切出问题,蛋白质翻译就会出问题,剪切是转录的过程中将内含子剪切掉,保留外显子的过程。如果是经典剪切位点GT-AG突变则比较严重,属于致病性变异。
经过上述筛选,保留pLoF突变位点如下表所示:
表7.pLoF列表
根据最终得到pLoF位点进行Score3评分,当突变为高可信的pLoF,则评分为+3;当突变为低可信的pLoF,则评分为+2。
2.4位点评分
获得Score1-Score3评分后,将上述突变位点按照下述公式进行评分:
Svar=(Fq×Pop×2)
max=M×(Score1+Score2+Score3)
其中:
Svar为突变位点得分;
Fq为群体频率;
Pop为群体总数,本实施例为42868;
M为预设最大值,本实施例中,M=100。
通过这个公式,我们可以得到每一个位点的分数Svar,当Svar分数大于0,则判定为0。
3、区域片段评分(S3)
3.1窗口大小确定。
本发明人在前期研究中,对dbSNP150,EXAC03全外显子数据库和genomeAD的全外显子数据库进行分析,统计上述数据库中记载突变位点之间的距离,获得相邻两突变位点之间距离大小,发现以25bp的窗口大小,可以覆盖超过95%的突变间隔,如图2所示。因此确定切割窗口大小为25bp。
从每个突变延后25bp窗口大小进行切割,得到若干包括突变位点的区域片段。
3.2序列(区域片段)评分
按照下述公式对每个区域片段进行评分:
其中:Stotal为区域片段的总评分,N为该区域片段的总突变数。
4、位点排除(S4)
如区域片段的Stotal总评分低于预定阈值,则判定该区域片段中的突变位点为人工引入错误位点,排除该突变。
具体的,将所有区域片段的Stotal总评分进行比较,可通过一个具体数字(如-100或者-50),也可以进行排序,取最低分数的前5%,这些窗口的突变则全为人口引入的错误突变。
本实施例中以-418作为筛选阈值,小于-418(即所有片段根据分数从小到大排序中的5%分数)进行筛选,得出的阳性结果(判定为人工引入错误突变),其中部分得分负值较大及负值较小的人工引入错误突变如下表所示。
表8.人工引入错误突变及其所对应的窗口、窗口总突变数及总分数
染色体 | 位置 | 所处窗口 | 窗口总突变数 | 总分数 |
chr9 | 35906583 | chr9:35906583-35906607 | 428 | -7026708.30830639 |
chr9 | 35906584 | chr9:35906584-35906608 | 415 | -6726549.28627591 |
chr9 | 35906585 | chr9:35906585-35906609 | 414 | -6704131.02405082 |
chr9 | 35906586 | chr9:35906586-35906610 | 413 | -6665636.51330758 |
chr9 | 35906580 | chr9:35906580-35906604 | 401 | -6078580.34869512 |
chr9 | 35906582 | chr9:35906582-35906606 | 398 | -5996092.36077493 |
chr9 | 35906581 | chr9:35906581-35906605 | 398 | -5995694.37824108 |
chr9 | 35906577 | chr9:35906577-35906601 | 388 | -5743785.43456742 |
chrX | 100603410 | chrX:100603410-100603434 | 171 | -5728052.73344433 |
chrX | 100603413 | chrX:100603413-100603437 | 170 | -5694215.36395906 |
chr9 | 35906587 | chr9:35906587-35906611 | 350 | -4800809.36996248 |
chr9 | 35906589 | chr9:35906589-35906613 | 347 | -4665279.90481323 |
chr9 | 35906588 | chr9:35906588-35906612 | 342 | -4574801.96584414 |
chr9 | 35906590 | chr9:35906590-35906614 | 341 | -4569608.93720737 |
chr9 | 35906592 | chr9:35906592-35906616 | 344 | -4437122.81440532 |
chr9 | 35906591 | chr9:35906591-35906615 | 340 | -4387228.27972579 |
chr9 | 35906576 | chr9:35906576-35906600 | 309 | -3402299.57000941 |
chr2 | 11727844 | chr2:11727844-11727868 | 126 | -3396924.90964996 |
chr2 | 11727843 | chr2:11727843-11727867 | 123 | -3379268.24098448 |
chr9 | 35906575 | chr9:35906575-35906599 | 305 | -3347277.37297846 |
chr2 | 11727854 | chr2:11727854-11727878 | 134 | -3326648.77447192 |
chr2 | 11727846 | chr2:11727846-11727870 | 134 | -3326380.78853077 |
chr18 | 77136193 | chr18:77136193-77136217 | 133 | -3311664.8494566 |
chr2 | 11727847 | chr2:11727847-11727871 | 133 | -3301291.06297629 |
chr2 | 11727856 | chr2:11727856-11727880 | 132 | -3275413.37200303 |
chr18 | 77136195 | chr18:77136195-77136219 | 134 | -3257772.50770687 |
chr2 | 11727857 | chr2:11727857-11727881 | 131 | -3249813.66882979 |
chr9 | 35906574 | chr9:35906574-35906598 | 298 | -3237378.0933243 |
chr2 | 11727858 | chr2:11727858-11727882 | 130 | -3224225.96512994 |
chr2 | 11727861 | chr2:11727861-11727885 | 129 | -3215677.48708922 |
chr18 | 77136197 | chr18:77136197-77136221 | 134 | -3178176.47012576 |
chr2 | 11727863 | chr2:11727863-11727887 | 127 | -3162266.19129363 |
chr9 | 35906571 | chr9:35906571-35906595 | 290 | -3121469.8819954 |
chr18 | 77136198 | chr18:77136198-77136222 | 133 | -3074658.73527408 |
chr9 | 70871928 | chr9:70871928-70871952 | 193 | -3048926.13846847 |
chr9 | 70871929 | chr9:70871929-70871953 | 192 | -3031976.64755712 |
chr18 | 77136191 | chr18:77136191-77136215 | 124 | -3012426.70434945 |
chr9 | 70871930 | chr9:70871930-70871954 | 190 | -2994693.82716957 |
chr2 | 11727864 | chr2:11727864-11727888 | 123 | -2979273.69614749 |
chr9 | 70871931 | chr9:70871931-70871955 | 188 | -2960162.87270482 |
chr2 | 11727865 | chr2:11727865-11727889 | 122 | -2954807.96948833 |
chr5 | 167945074 | chr5:167945074-167945098 | 200 | -2930730.88303328 |
chr18 | 77136190 | chr18:77136190-77136214 | 118 | -2861236.37222784 |
chr2 | 11727867 | chr2:11727867-11727891 | 117 | -2708755.82961834 |
chr11 | 117039153 | chr11:117039153-117039177 | 140 | -2604220.84949085 |
chr1 | 1247578 | chr1:1247578-1247602 | 280 | -2601686.59648442 |
chr1 | 1247579 | chr1:1247579-1247603 | 279 | -2587373.0933815 |
chr11 | 117039151 | chr11:117039151-117039175 | 139 | -2584785.31097787 |
chr11 | 117039154 | chr11:117039154-117039178 | 139 | -2584785.31097787 |
chr1 | 1247574 | chr1:1247574-1247598 | 278 | -2582547.14519519 |
chr9 | 32986030 | chr9:32986030-32986054 | 45 | -4409.795665296 |
注:每个位点均至少对应以自身位置为起点的窗口,以及可能对应其它窗口,如该位点落入任一判定为人工引入错误窗口,另一窗口未被认为属于人工引入错误,则认定为人工引入错 误位点的结论不变。
上述结果看出,本实施例从8245702个突变中筛选掉155338个人工引入错误位点,筛选率为1.88%;保留了8090364个位点,占比98.12%,在正常的筛选范围(大于95%)内。
5、验证
抽取上述判定为存在人工引入错误突变的2个窗口片段进行验证。
5.1片段一
chr9:35906583位点对应的chr9:35906583-35906607窗口,其部分突变情况如下表所示。
表9.chr9:35906583位点对应的窗口(chr9:35906583-35906607)突变情况
该窗口有428个突变位点(上表示例性展示了部分位点35906583-35906589),根据dbSNP、ClinVar和gnomAD-exome三大数据库判断,95%25bp长的窗口出现突变位点的数量不超过18个,该窗口远超过此数量,提示存在人工引入错误;且这些突变位点除了存在几 个SNV外,其他均为INDEL,而INDEL的准确性不高,如此多的INDEL在同一区域内,其准确性更是大大降低。该位点所在窗口的IGV可视化结果见图3,图3为2个样本在该区域的reads比对情况,窗口长度为25bp,可以看出该区域的质量不佳,GC含量过高,可信程度不够。因此判断为人工引入的错误。与本发明的方法判断结果相吻合。
5.2片段二
再如chr9:32986030位点对应chr9:32986030-32986054窗口,其突变情况如下表和图4所示。
表10.chr9:32986030位点对应的窗口(chr9:32986030-32986054)突变情况
染色体 | 起始位置 | 终止位置 | 参考碱基 | 突变碱基 |
chr9 | 32986030 | 32986030 | - | A |
chr9 | 32986030 | 32986030 | - | AAAAAAAAACAA |
chr9 | 32986030 | 32986030 | T | - |
chr9 | 32986030 | 32986030 | T | A |
chr9 | 32986030 | 32986030 | T | C |
chr9 | 32986031 | 32986031 | A | - |
chr9 | 32986031 | 32986032 | AA | - |
chr9 | 32986031 | 32986043 | AAAAAAAAAACAA | - |
chr9 | 32986031 | 32986053 | AAAAAAAAAACAAAAAAAAAAAC | - |
chr9 | 32986032 | 32986053 | AAAAAAAAACAAAAAAAAAAAC | - |
chr9 | 32986033 | 32986053 | AAAAAAAACAAAAAAAAAAAC | - |
chr9 | 32986034 | 32986053 | AAAAAAACAAAAAAAAAAAC | - |
chr9 | 32986035 | 32986053 | AAAAAACAAAAAAAAAAAC | - |
chr9 | 32986036 | 32986053 | AAAAACAAAAAAAAAAAC | - |
chr9 | 32986037 | 32986037 | - | AAC |
chr9 | 32986038 | 32986038 | - | AC |
chr9 | 32986038 | 32986053 | AAACAAAAAAAAAAAC | - |
chr9 | 32986039 | 32986039 | - | C |
chr9 | 32986039 | 32986039 | A | C |
chr9 | 32986040 | 32986040 | A | C |
chr9 | 32986041 | 32986041 | - | A |
chr9 | 32986041 | 32986041 | C | - |
chr9 | 32986041 | 32986041 | C | A |
chr9 | 32986042 | 32986056 | AAAAAAAAAAACAAA | - |
chr9 | 32986044 | 32986044 | - | C |
chr9 | 32986044 | 32986056 | AAAAAAAAACAAA | - |
chr9 | 32986045 | 32986045 | - | C |
chr9 | 32986045 | 32986045 | A | C |
chr9 | 32986046 | 32986046 | - | C |
chr9 | 32986046 | 32986046 | A | C |
chr9 | 32986047 | 32986047 | - | C |
chr9 | 32986047 | 32986047 | A | C |
chr9 | 32986048 | 32986048 | - | C |
chr9 | 32986048 | 32986048 | A | C |
chr9 | 32986049 | 32986049 | - | C |
chr9 | 32986049 | 32986049 | A | C |
chr9 | 32986050 | 32986050 | - | C |
chr9 | 32986050 | 32986050 | A | T |
chr9 | 32986052 | 32986052 | - | AC |
chr9 | 32986052 | 32986053 | AC | - |
chr9 | 32986053 | 32986053 | C | - |
chr9 | 32986053 | 32986053 | C | A |
chr9 | 32986054 | 32986054 | - | C |
chr9 | 32986054 | 32986054 | A | C |
chr9 | 32986054 | 32986057 | AAAA | - |
由上述结果可知,该窗口有45个突变位点,根据dbSNP、ClinVar和gnomAD-exome三大数据库判断,95%的25bp长窗口出现突变位点的数量不超过18个,该窗口超过此数量;且这些位点非常零散,相隔非常近,ploy A特征显著,非常容易出错;该位点iGV可视化结果见图4,图4为2个样本在该区域的reads比对情况,窗口长度为25bp,可以看出该区域的质量不佳,可信程度不够,因此判断为人工引入的错误。与本发明的方法判断结果相吻合。
验证结果证实,本发明的方法可除去绝大部分的人工引物错误突变,留下大量真实的突变。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。
Claims (16)
- 一种降低高通量测序中人工引入错误突变的方法,其特征在于,包括以下步骤:1)建立群体突变频率数据库:获取若干样本高通量测序数据,整合得到每个突变位点的群体突变频率Fq,以及群体总数Pop;2)位点评分:获取待测样本高通量测序数据,获得每个突变位点的信息,将各突变位点按照下述公式进行评分:Svar=(Fq×Pop×2) max=M×(Score1+Score2+Score3)其中:Svar为突变位点得分,Fq为群体突变频率,Pop为群体总数,M为预设最大值,Score1为突变类型分数,Score2为突变坐标分数,Score3为pLoF分数;所述Score1的评分标准为:当突变为已知突变,不扣分;当突变为未知突变,扣分;所述Score2的评分标准为:当突变为未知突变,则根据该突变所在refGene位置,按照预定规则扣分;所述Score3的评分标准为:根据LOFTEE预测突变为功能缺失的可信度情况,按照预定规则扣分;3)区域片段评分:从每个突变延后预定窗口大小划分,得到若干包括突变位点的区域片段;按照下述公式对每个区域片段进行评分:其中:Stotal为区域片段的总评分,N为该区域片段的总突变数;4)位点排除:如区域片段的Stotal总评分低于预定阈值,则判定该区域片段中的突变位点为人工引入错误位点,排除该突变。
- 根据权利要求1所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述建立群体突变频率数据库步骤中,将每个突变位点按照染色体和基因组位置顺序进行排序。
- 根据权利要求1所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述位点评分步骤中,所述Score1的评分标准为:如该突变位点在dbSNP、ClinVar或gnomAD-exome数据库有注释,则判定该位点的突变为已知突变,不扣分;否则判定突变为未知突变,扣分;所述Score2的评分标准为:当突变为未知突变,则根据该突变所在区域为refGene的exonic、splicing、UTR-3、UTR-5、upstream、downstream、intronic、intergenic或ncRNA区位置,按照预定规则扣分;所述Score3的评分标准为:根据LOFTEE预测为突变为功能缺失高可信度位点或低可信位点,按照预定规则扣分。
- 根据权利要求3所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述位点评分步骤中:所述Score1的评分标准中,当突变为已知突变,则评分为0;当突变为未知SNV,则评分为-1;当突变为未知INDEL,则评分为-5;所述Score2的评分标准中,当突变位于exonic区,则评分为0;当突变位于UTR-3、UTR-5、upstream、downstream、intronic、intergenic或ncRNA区,则评分为-1;当突变位于splicing区,则评分为-2;所述Score3的评分标准中,当突变为高可信的pLoF,则评分为+3;当突变为低可信的pLoF,则评分为+2。
- 根据权利要求3所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述位点评分步骤中:M=100。
- 根据权利要求3所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述位点评分步骤中:当突变位点的Svar分数大于0,则将此位点得分定义为0。
- 根据权利要求1所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述refGene位置根据比对到NCBI refGene主转录本区域确定,所述pLoF仅保留出现在主转录本中Stop-gained、Splice site和Frameshift类型的pLoF突变位点。
- 根据权利要求7所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述主转录本为NCBI refGene中最近更新的转录本。
- 根据权利要求1所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述区域片段评分步骤中,所述窗口大小通过以下方法得到:以覆盖数据库中95±5%相邻突变位点的区间长度为窗口大小。
- 根据权利要求9所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述数据库为dbSNP数据库、ECAC03全外显子数据库和/或genomeAD全外显子数据库。
- 根据权利要求9所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述窗口大小为25±5bp。
- 根据权利要求1所述的降低高通量测序中人工引入错误突变的方法,其特征在于,所述位点排除步骤中,所述阈值为该待测样本中各区域片段按照Stotal得分由低至高排序的 前5%所对应区域片段的Stotal得分。
- 权利要求1-12任一项所述的降低高通量测序中人工引入错误突变的方法在高通量测序中的应用。
- 根据权利要求13所述的应用,其特征在于,所述高通量测序为全外显子测序。
- 权利要求1-12任一项所述的方法在疾病诊疗全外显子突变筛查中的应用。
- 一种降低高通量测序中人工引入错误突变的装置,其特征在于,包括:分析模块:用于按照权利要求1-12任一项所述的降低高通量测序中人工引入错误突变的方法,对待测样本高通量测序数据的突变数据进行分析。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/101184 WO2022266790A1 (zh) | 2021-06-21 | 2021-06-21 | 降低高通量测序中人工引入错误突变的方法及应用 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/101184 WO2022266790A1 (zh) | 2021-06-21 | 2021-06-21 | 降低高通量测序中人工引入错误突变的方法及应用 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022266790A1 true WO2022266790A1 (zh) | 2022-12-29 |
Family
ID=84545024
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/101184 WO2022266790A1 (zh) | 2021-06-21 | 2021-06-21 | 降低高通量测序中人工引入错误突变的方法及应用 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2022266790A1 (zh) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229841A (zh) * | 2017-05-24 | 2017-10-03 | 重庆金域医学检验所有限公司 | 一种基因变异评估方法及系统 |
CN107423578A (zh) * | 2017-03-02 | 2017-12-01 | 北京诺禾致源科技股份有限公司 | 检测体细胞突变的装置 |
CN112233725A (zh) * | 2020-10-14 | 2021-01-15 | 合肥达徽基因科技有限公司 | Atp7b基因突变二代测序自动化分析解读方法和报告系统 |
-
2021
- 2021-06-21 WO PCT/CN2021/101184 patent/WO2022266790A1/zh unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107423578A (zh) * | 2017-03-02 | 2017-12-01 | 北京诺禾致源科技股份有限公司 | 检测体细胞突变的装置 |
CN107229841A (zh) * | 2017-05-24 | 2017-10-03 | 重庆金域医学检验所有限公司 | 一种基因变异评估方法及系统 |
CN112233725A (zh) * | 2020-10-14 | 2021-01-15 | 合肥达徽基因科技有限公司 | Atp7b基因突变二代测序自动化分析解读方法和报告系统 |
Non-Patent Citations (3)
Title |
---|
KARCZEWSKI, K. J. ET AL.: "The mutational constraint spectrum quantified from variation in 141, 456 humans", NATURE, vol. 581, 27 May 2020 (2020-05-27), XP037533562, DOI: 10.1038/s41586-020-2308-7 * |
OVED JOSEPH H., MICHELE P LAMBERT, M ANNA KOWALSKA, MORTIMER PONCZ, KONRAD J KARCZEWSKI : "Population based frequency of naturally occurring loss-of-function", J. THROMB. HAEMOST., vol. 19, no. 1, 23 October 2020 (2020-10-23), pages 248 - 254, XP093016976, DOI: 10.1111/jth.15113 * |
SUE RICHARDS, NAZNEEN AZIZ, SHERRI BALE, DAVID BICK, SOMA DAS, JULIE GASTIER-FOSTER, WAYNE W. GRODY, MADHURI HEGDE, ELAINE LYON, E: "Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology", GENETICS IN MEDICINE, NATURE PUBLISHING GROUP US, NEW YORK, vol. 17, no. 5, 1 May 2015 (2015-05-01), New York, pages 405 - 423, XP055331624, ISSN: 1098-3600, DOI: 10.1038/gim.2015.30 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033749B (zh) | 一种肿瘤突变负荷检测方法、装置和存储介质 | |
Shimane et al. | An association analysis of HLA-DRB1 with systemic lupus erythematosus and rheumatoid arthritis in a Japanese population: effects of* 09: 01 allele on disease phenotypes | |
CN109658983B (zh) | 一种识别和消除核酸变异检测中假阳性的方法和装置 | |
CN109767810B (zh) | 高通量测序数据分析方法及装置 | |
CN108624650B (zh) | 判断实体瘤是否适合免疫治疗的方法和检测试剂盒 | |
CN104232777A (zh) | 同时确定胎儿核酸含量和染色体非整倍性的方法及装置 | |
AU2016355983B2 (en) | Methods for detecting copy-number variations in next-generation sequencing | |
WO2021232388A1 (zh) | 确定胚胎细胞染色体中预定位点碱基类型的方法及其应用 | |
Wang et al. | Plasma cell-free RNA characteristics in COVID-19 patients | |
US20160078168A1 (en) | Fusion transcript detection methods and fusion transcripts identified thereby | |
Verhagen et al. | A disease-associated microRNA cluster links inflammatory pathways and an altered composition of leukocyte subsets to noninfectious uveitis | |
CN111139291A (zh) | 一种单基因遗传性疾病高通量测序分析方法 | |
Sembler‐Møller et al. | Distinct microRNA expression profiles in saliva and salivary gland tissue differentiate patients with primary Sjögren's syndrome from non‐Sjögren's sicca patients | |
Chintalapati et al. | Using the Neandertal genome to study the evolution of small insertions and deletions in modern humans | |
CN111223525A (zh) | 一种肿瘤外显子测序数据分析方法 | |
CN110838341A (zh) | 一种ATAC-seq测序数据的生物信息分析方法 | |
WO2022266790A1 (zh) | 降低高通量测序中人工引入错误突变的方法及应用 | |
AU2020364225B2 (en) | Fragment size characterization of cell-free DNA mutations from clonal hematopoiesis | |
CN113470746B (zh) | 降低高通量测序中人工引入错误突变的方法及应用 | |
CN114990202B (zh) | Snp位点在评估基因组异常的应用及评估基因组异常的方法 | |
CN111128308A (zh) | 一种神经精神疾病新发突变信息知识平台 | |
CN116287204A (zh) | 检测特征基因的突变情况在制备静脉血栓栓塞症风险检测产品中的应用 | |
CN115171784A (zh) | 多基因遗传病症风险基因和风险突变的筛选方法及筛选系统 | |
Sabbagh et al. | Clinico-biological refinement of BCL11B-related disorder and identification of an episignature: A series of 20 unreported individuals | |
CN113764038A (zh) | 构建骨髓增生异常综合征转白基因预测模型的方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21946293 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |