WO2023207396A1 - 用于分析变异检测结果的模型的构建方法 - Google Patents

用于分析变异检测结果的模型的构建方法 Download PDF

Info

Publication number
WO2023207396A1
WO2023207396A1 PCT/CN2023/081719 CN2023081719W WO2023207396A1 WO 2023207396 A1 WO2023207396 A1 WO 2023207396A1 CN 2023081719 W CN2023081719 W CN 2023081719W WO 2023207396 A1 WO2023207396 A1 WO 2023207396A1
Authority
WO
WIPO (PCT)
Prior art keywords
data set
positive
mutation
value
site
Prior art date
Application number
PCT/CN2023/081719
Other languages
English (en)
French (fr)
Inventor
唐飞
王中华
孙隽
彭智宇
Original Assignee
天津华大基因科技有限公司
天津华大医学检验所有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天津华大基因科技有限公司, 天津华大医学检验所有限公司 filed Critical 天津华大基因科技有限公司
Publication of WO2023207396A1 publication Critical patent/WO2023207396A1/zh

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6858Allele-specific amplification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Definitions

  • the present invention relates to the biological field. Specifically, the present invention relates to a method of constructing a model for analyzing mutation detection results.
  • cNGS Clinical next-generation sequencing
  • ACMG American College of Medical Genetics and Genomics
  • CAP College of American Pathologists
  • Sanger sequencing has been the main technology for molecular diagnosis of genetic diseases. But as evidenced by the growth of public databases such as ClinVar and OMIM, the total number of clinically reported candidate variants is steadily increasing, exponentially increasing the cost and turnaround time of testing, making it increasingly difficult to fully measure The more unrealistic. Therefore, the need to use machine learning models trained on large amounts of known data to identify false-positive variants in cNGS data and reduce the need for orthogonal testing becomes increasingly urgent.
  • the present invention aims to solve at least one of the technical problems existing in the prior art to at least a certain extent.
  • the present invention proposes a method for constructing a model for analyzing mutation detection results.
  • the method includes: obtaining a positive sequencing data set that is clearly a positive mutation site and a negative sequencing data set that is a negative mutation site; extracting mutations from the positive sequencing data set and the negative sequencing data set respectively.
  • Characteristics of the site use the characteristic results obtained in the previous step to construct a model; wherein the characteristics include at least one of the following: ADO value: the depth of the first allele in the genotype of the variant site; AD1 value: the variant site The depth of the second allele in the point genotype; AF0 value: the frequency of the first allele in the variant site genotype; AF1 value: the frequency of the second allele in the variant site genotype; GT Value: a single value (specifically it can be 0, 1, 2, 3); DP value: sequencing depth value; GQ value: quality value of variant site genotype; MQ value: quality of variant site mapping; QUAL value: Quality value of variant site likelihood.
  • Dozens of characteristic parameters can be generated in the mutation detection and analysis software.
  • the inventor conducted a comparative analysis on these characteristic parameters, screened out a set of characteristic parameters, and used these characteristic parameters as attributes to identify positive mutation sites and negative mutation sites.
  • the data set is used to construct a machine learning model.
  • the obtained model can be used to accurately predict whether the positive mutation data is a false positive. It can also further learn the genotype of the mutation site, which helps to locate possible mutations faster and more accurately, and reduce Cost and turnaround time of orthogonal experiments.
  • the invention proposes a method for analyzing mutation detection results.
  • the method includes: obtaining a candidate positive mutation data set; analyzing the candidate positive mutation data set using a machine learning model obtained by the aforementioned method for constructing a model for analyzing mutation detection results,
  • the method of the present invention can be used to accurately predict whether the positive mutation data is a false positive, and at the same time, the genotype of the mutation can be determined, which helps to locate possible mutations faster and more accurately, and reduces the complexity of orthogonal experiments. Cost and turnaround time.
  • the present invention proposes a device for constructing a model for analyzing mutation detection results.
  • the device includes: an acquisition module, the acquisition module is adapted to acquire a positive sequencing data set that is clearly a positive mutation site and a negative sequencing data set that is a negative mutation site; an extraction module, the extraction module The module is suitable for extracting the characteristics of variant sites from the positive sequencing data set and the negative sequencing data set respectively; building a module, the building module is suitable for constructing a model using the feature results obtained by the extraction module; wherein the features include At least one of the following: AD0 value: the depth of the first allele in the variant site genotype; AD1 value: the depth of the second allele in the variant site genotype; AF0 value: the variant site genotype The frequency of the first allele in the genotype of the variant site; AF1 value: the frequency of the second allele in the genotype of the variant site; GT value: a single value; DP value: the sequencing depth value;
  • the model obtained by using the device of the present invention can accurately predict whether the positive mutation data is a false positive, and can also determine the genotype of the mutation, which helps to locate possible mutations faster and more accurately, and reduces orthogonal errors. Cost and turnaround time of experiments.
  • the present invention provides an executable storage medium.
  • the storage medium stores computer program instructions.
  • the processor causes the processor to perform the method of analyzing mutation detection results as described above. Therefore, by implementing the storage medium of the present invention, it is possible to accurately predict whether the positive mutation data is a false positive, and at the same time, the genotype of the mutation can be determined, which helps to locate possible mutations faster and more accurately, and reduces orthogonal errors. Cost and turnaround time of experiments.
  • the invention provides an electronic device.
  • the electronic device includes: the aforementioned executable storage medium; the processor, configured to execute the computer program to implement the aforementioned
  • the methods for analyzing variant detection results are described above. Therefore, by implementing the electronic device of the present invention, it is possible to accurately predict whether the positive mutation data is a false positive, and at the same time, the genotype of the mutation can be determined, which helps to locate possible mutations faster and more accurately, and reduces orthogonal errors. Cost and turnaround time of experiments.
  • the present invention proposes a method for constructing a model for analyzing mutation detection results.
  • the method includes: obtaining a positive sequencing data set that is clearly a positive mutation site and a negative sequencing data set that is a negative mutation site; extracting mutations from the positive sequencing data set and the negative sequencing data set respectively Characteristics of the site; construct a model using the characteristic results obtained in the previous step; wherein the characteristics include at least one of the following: A0 value, AD1 value, AFO value, AF1 value, GT value, DP value, GQ value, MQ value and QUAL value.
  • the inventor has screened out the above 9 characteristic parameters through extensive experiments. They are all characteristic parameters in the GATK software. For specific meanings, see the table below. They are used as characteristic attributes to compare positive sequencing data sets and negative mutation sites that are clearly positive mutation sites. Perform machine learning on the negative sequencing data set of the points to obtain a prediction model. As a result, the obtained model can be used to accurately predict whether the positive mutation data is a false positive, helping to locate possible mutations faster and more accurately, and reducing the cost and turnaround time of orthogonal experiments.
  • the positive sequencing data set that is clearly a positive mutation site and the negative sequencing data set that is a negative mutation site are obtained by the following methods: obtaining the sequencing data set; using GATK software to analyze the sequencing data The set is compared with the reference data to obtain a candidate positive mutation data set; the candidate positive mutation data set is analyzed and processed to obtain a positive sequencing data set that is clearly a positive mutation site and a negative sequencing data set that is a negative mutation site. .
  • the first step is to obtain clinical gene sequencing data, compare the sequencing data with reference data (for example, including comparison, variant detection, annotation, and filtering operations), and use GATK to identify variants, obtain candidate positive variant data, and output a VCF file.
  • reference data for example, including comparison, variant detection, annotation, and filtering operations
  • the reference sequence is selected from human genome hg19.
  • the analysis processing includes: subjecting the candidate positive mutation data set to standard clinical interpretation to obtain a potentially pathogenic mutation data set; performing an orthogonal test analysis on the potentially pathogenic mutation data set , obtain a positive sequencing data set that is clearly a positive mutation site and a negative sequencing data set that is a negative mutation site, wherein the positive sequencing data set includes a SNV mutation type data set and an INDEL mutation type data set, and the SNV mutation type
  • the data set and INDEL mutation type data set include homozygous genotype data set and heterozygous genotype data set respectively.
  • standard clinical interpretation refers to the interpretation of the pathogenicity of clinical variants with reference to the 2015 version of the ACMG guidelines.
  • the present invention does not strictly limit the method of orthogonal test analysis, as long as it can determine whether the potentially pathogenic mutation data is a true positive mutation or a false positive.
  • conventional techniques in the field can be used, for example, refer to Sanger F.DNA sequencing with chain-terminating inhibitors.1977[J].Biotechnology(Reading,Mass.),24:104-108.
  • the model is selected from a random forest classification model, and the threshold is 0.95 ⁇ 0.05.
  • the setting of the threshold ensures sufficient accuracy and reduces accidental errors.
  • the accuracy and orthogonal test rate can be weighed against each other while ensuring sufficient accuracy.
  • the positive sequencing data set that is clearly a positive mutation site and the negative sequencing data set that is a negative mutation site are divided into a training set and a test set (3:1), and a random forest classification model is selected , select the model with the highest accuracy after 5-fold cross-validation.
  • the method for constructing a model for analyzing mutation detection results includes:
  • the invention proposes a method for analyzing mutation detection results.
  • the method includes: obtaining a candidate positive mutation data set; analyzing the candidate positive mutation data set using a machine learning model obtained by the aforementioned method for constructing a model for analyzing mutation detection results,
  • the model obtained by using the method of the present invention can accurately predict whether the candidate positive mutation data is a false positive, and can also determine the genotype of the mutation, which helps to locate possible mutations faster and more accurately, and reduces the number of false positives. Cost and turnaround time for handing over experiments.
  • the candidate positive mutation data set is obtained by: obtaining a sequencing data set; using GATK software to compare the sequencing data set with reference data to obtain the candidate positive mutation data set.
  • the model is selected from a random forest classification model.
  • the confidence of the candidate positive mutation data is lower than the threshold of the model, the candidate positive mutation data is subjected to orthogonal test analysis, so that Predict whether the positive variant data in the candidate positive variant data set is a false positive.
  • Data below the threshold is called gray zone data.
  • the accuracy of using the model to predict false positives is low. Therefore, it is necessary to conduct orthogonal experimental verification on this part of the data to accurately predict the false positives.
  • the present invention proposes a device for constructing a model for analyzing mutation detection results.
  • the device includes: an acquisition module, the acquisition module is adapted to acquire a positive sequencing data set that is clearly a positive mutation site and a negative sequencing data set that is a negative mutation site; an extraction module, the extraction module The module is suitable for extracting the characteristics of variant sites from the positive sequencing data set and the negative sequencing data set respectively; building a module, the building module is suitable for constructing a model using the feature results obtained by the extraction module; wherein the features include At least one of the following: AD0 value: the depth of the first allele in the variant site genotype; AD1 value: the depth of the second allele in the variant site genotype; AF0 value: the variant site genotype The frequency of the first allele in the genotype of the variant site; AF1 value: the frequency of the second allele in the genotype of the variant site; GT value: a single value; DP value: the sequencing depth value;
  • the model obtained by using the device of the present invention can accurately predict whether the positive mutation data is a false positive, and can also determine the genotype of the mutation, which helps to locate possible mutations faster and more accurately, and reduces orthogonal errors. Cost and turnaround time of experiments.
  • the acquisition module includes: an acquisition sequencing data set module, which is suitable for acquiring a sequencing data set; and a comparison processing module, which is suitable for using GATK software to compare the sequencing data set.
  • the sequencing data set is compared with the reference data to obtain a candidate positive mutation data set; an analysis and processing module is adapted to analyze and process the candidate positive mutation data set to obtain a positive result that is clearly a positive mutation site.
  • the acquisition module can be used to accurately determine the positive mutation site data and negative mutation site data in the sequencing data set. At the same time, the genotype of the positive mutation site can also be determined.
  • the analysis and processing module includes: a standard clinical interpretation module, which is suitable for standard clinical interpretation of the positive mutation data to obtain potentially pathogenic mutation data; orthogonal test analysis Module, the orthogonal test analysis sub-module is suitable for performing orthogonal test analysis on the potentially pathogenic mutation data to obtain a positive sequencing data set that is clearly a positive mutation site and a negative sequencing data set that is a negative mutation site.
  • the present invention provides an executable storage medium.
  • the storage medium stores computer program instructions.
  • the processor causes the processor to perform the method of analyzing mutation detection results as described above. Therefore, by implementing the storage medium of the present invention, it is possible to accurately predict whether the positive mutation data is a false positive, and at the same time, the genotype of the mutation can be determined, which helps to locate possible mutations faster and more accurately, and reduces the number of false positives. Cost and turnaround time for handing over experiments.
  • the invention provides an electronic device.
  • the electronic device includes: the aforementioned executable storage medium; and the processor, configured to execute the computer program to implement the aforementioned method of analyzing mutation detection results. Therefore, by implementing the electronic device of the present invention, it is possible to accurately predict whether the positive mutation data is a false positive, and at the same time, the genotype of the mutation can be determined, which helps to locate possible mutations faster and more accurately, and reduces the number of false positives. Cost and turnaround time for handing over experiments.
  • VCF file went through the standard clinical interpretation process and analyzed 7375 possible pathogenic variants
  • the mutations include 5241 mutation types SNV and 2134 mutation types INDEL. There are 3226 genotype Het, 63 Hom, and 1952 negative variants in SNV; there are 1606 genotype Het, 138 Hom, and 390 negative variants in Indel;
  • test set accuracy rates of the SNV and INDEL models are 94.8% and 93.8% respectively.
  • accuracy rates of different genotypes are shown in Table 3.
  • this method delimits different thresholds (confidence of random forest results) for the test set to obtain different accuracy and orthogonal experiment proportions (Table 4), where the accuracy refers to the correct judgment
  • the number/total number that meets the threshold, the orthogonal experiment ratio refers to the number below the threshold/the total number of test samples.
  • Select the threshold with the smallest possible proportion of orthogonal experiments as the target threshold when sufficient accuracy is met, and the final threshold is 0.95, which is within a scalable range of ⁇ 0.05.
  • references to the terms “one embodiment,” “some embodiments,” “an example,” “specific examples,” or “some examples” or the like means that specific features are described in connection with the embodiment or example. , structures, materials or features are included in at least one embodiment or example of the invention. In this specification, the schematic expressions of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine different embodiments or examples and features of different embodiments or examples described in this specification unless they are inconsistent with each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Software Systems (AREA)
  • Organic Chemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Wood Science & Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Zoology (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Biochemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明提出了用于分析变异检测结果的模型的构建方法,所述方法包括:获取明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集;分别从所述阳性测序数据集和阴性测序数据集中提取变异位点的特征;利用上步得到的特征结果构建模型;其中,所述特征包括下列的至少之一:AD0值、AD1值、AF0值、AF1值、GT值、DP值、GQ值、MQ值和QUAL值。

Description

用于分析变异检测结果的模型的构建方法 技术领域
本发明涉及生物领域。具体地,本发明涉及用于分析变异检测结果的模型的构建方法。
背景技术
临床下一代测序(cNGS)被广泛用于确定遗传疾病患者的分子诊断。然而,已知的NGS流程在测序、比对和变异调用步骤中都会存在随机和系统错误。因为报告的变异会影响患者护理与治疗,美国医学遗传学和基因组学学院(ACMG)和美国病理学家学院(CAP)建议对报告的变异进行正交确认,以降低错误的风险积极的结果。目前Sanger测序一直是遗传性疾病分子诊断的主要技术。但是如ClinVar和OMIM等公共数据库的增长所证明的那样,临床报告候选变体的总数正在稳步增加,它成倍的增加了测试的成本和周转时间,使得想要完全测得也变得越来越不切实际。因此,使用大量已知数据经过训练的机器学习模型,以识别cNGS数据中的假阳性变异,减少对正交测试的需求变得越来越迫切。
目前针对变异假阳性的研究存在如下问题:Sanger测序等正交实验会增加大量的成本和周转时间;现有模型所用的特征多为布尔标记值,与未更改的定量指标相比,这会导致信息丢失;现有模型训练集中的假阳性变异调用相对较少,可能导致某些假阳性捕获率(特别是SNV)的置信区间较宽;现有模型由于成本原因,使用临床数据不够,要么刻意复杂适用多种场景,但置信度不足,要么置信度足够,但过拟合风险较大,适用场景不足。
因此,目前用于预测变异假阳性的方法仍有待研究。
发明内容
本发明旨在至少在一定程度上解决现有技术中存在的技术问题至少之一。
为此,在本发明的一个方面,本发明提出了一种用于分析变异检测结果的模型的构建方法。根据本发明的实施例,所述方法包括:获取明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集;分别从所述阳性测序数据集和阴性测序数据集中提取变异位点的特征;利用上步得到的特征结果构建模型;其中,所述特征包括下列的至少之一:AD0值:变异位点基因型中第一个等位基因的深度;AD1值:变异位点基因型中第二个等位基因的深度;AF0值:变异位点基因型中第一个等位基因的频率;AF1值:变异位点基因型中第二个等位基因的频率;GT值:单个数值(具体可以为0、1、2、3);DP值:测序深度值;GQ值:变异位点基因型的质量值;MQ值:变异位点映射的质量;QUAL值: 变异位点可能性的质量值。
变异检测分析软件中可以生成几十种特征参数,发明人对这些特征参数进行比较分析,筛选出一组特征参数,以这些特征参数作为属性对已明确为阳性变异位点和阴性变异位点的数据集构建机器学习模型,利用获得的模型可以准确地预测阳性变异数据是否为假阳性,还可以进一步获知变异位点的基因型,有助于更快和精准的定位到可能的变异,并减少正交实验的成本和周转时间。
在本发明的另一方面,本发明提出了一种分析变异检测结果的方法。根据本发明的实施例,所述方法包括:获取候选阳性变异数据集;利用前面所述用于分析变异检测结果的模型的构建方法获得的机器学习模型对所述候选阳性变异数据集进行分析,以便预测所述候选阳性变异数据集中的阳性变异数据是否为假阳性和/或变异位点的基因型。由此,利用本发明的方法可以准确地预测出阳性变异数据是否为假阳性,同时还可以确定变异的基因型,有助于更快和精准的定位到可能的变异,并减少正交实验的成本和周转时间。
在本发明的又一方面,本发明提出了一种用于分析变异检测结果的模型的构建装置。根据本发明的实施例,所述装置包括:获取模块,所述获取模块适于获取明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集;提取模块,所述提取模块适于分别从所述阳性测序数据集和阴性测序数据集中提取变异位点的特征;构建模块,所述构建模块适于利用所述提取模块获得的特征结果构建模型;其中,所述特征包括下列的至少之一:AD0值:变异位点基因型中第一个等位基因的深度;AD1值:变异位点基因型中第二个等位基因的深度;AF0值:变异位点基因型中第一个等位基因的频率;AF1值:变异位点基因型中第二个等位基因的频率;GT值:单个数值;DP值:测序深度值;GQ值:变异位点基因型的质量值;MQ值:变异位点映射的质量;QUAL值:变异位点可能性的质量值。由此,利用本发明的装置获得的模型可以准确地预测阳性变异数据是否为假阳性,同时还可以确定变异的基因型,有助于更快和精准的定位到可能的变异,并减少正交实验的成本和周转时间。
在本发明的又一方面,本发明提出了一种可执行的存储介质。根据本发明的实施例,所述存储介质存储有计算机程序指令,所述计算机程序指令在处理器上运行时,使所述处理器执行如前面所述分析变异检测结果的方法。由此,通过执行本发明的存储介质,可以准确地预测阳性变异数据是否为假阳性,同时还可以确定变异的基因型,有助于更快和精准的定位到可能的变异,并减少正交实验的成本和周转时间。
在本发明的又一方面,本发明提出了一种电子设备。根据本发明的实施例,所述电子设备包括:前面所述可执行的存储介质;所述处理器,用于执行所述计算机程序以实现前 面所述分析变异检测结果的方法。由此,通过实施本发明的电子设备,可以准确地预测阳性变异数据是否为假阳性,同时还可以确定变异的基因型,有助于更快和精准的定位到可能的变异,并减少正交实验的成本和周转时间。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
具体实施方式
下面详细描述本发明的实施例。下面描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。实施例中未注明具体技术或条件的,按照本领域内的文献所描述的技术或条件或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者,均为可以通过市购获得的常规产品。
用于分析变异检测结果的模型的构建方法
在本发明的一个方面,本发明提出了一种用于分析变异检测结果的模型的构建方法。根据本发明的实施例,所述方法包括:获取明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集;分别从所述阳性测序数据集和阴性测序数据集中提取变异位点的特征;利用上步得到的特征结果构建模型;其中,所述特征包括下列的至少之一:AD0值、AD1值、AF0值、AF1值、GT值、DP值、GQ值、MQ值和QUAL值。
发明人经过大量实验筛选出上述9种特征参数,其均为GATK软件中的特征参数,具体含义参见下表,以其作为特征属性对明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集进行机器学习,获得预测模型。由此,利用获得的模型可以准确地预测阳性变异数据中是否为假阳性,有助于更快和精准的定位到可能的变异,并减少正交实验的成本和周转时间。
表1特征含义

根据本发明的实施例,所述明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集是通过下列方法获得的:获取测序数据集;利用GATK软件对所述测序数据集与参考数据进行比对处理,获得候选阳性变异数据集;对所述候选阳性变异数据集进行分析处理,获得明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集。
首选获取临床基因测序数据,通过将测序数据与参考数据比对(例如包括比对、变异检测、注释和过滤等操作),并使用GATK识别变异,获得候选阳性变异数据,输出VCF文件。通过对候选阳性变异数据再次进行分析处理,明确获知数据是否为真阳性或假阳性。将数据分为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集。
根据本发明的实施例,所述参考序列选自人类基因组hg19。
根据本发明的实施例,所述分析处理包括:将所述候选阳性变异数据集进行标准临床解读,获取可能致病的变异数据集;对所述可能致病的变异数据集进行正交试验分析,获得明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集,其中,所述阳性测序数据集包括SNV变异类型数据集和INDEL变异类型数据集,所述SNV变异类型数据集和INDEL变异类型数据集分别包括纯合基因型数据集和杂合基因型数据集。
术语“标准临床解读”是指参考2015年版ACMG指南对临床变异的致病性进行解读。
通过将GATK识别分析获得的候选阳性变异数据进行标准临床解读,以获得可能致病的变异数据,再对这些数据经过正交试验验证变异的准确性,即可获得明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集。阳性测序数据集可以分为SNV变异类型和INDEL变异类型,两种变异类型还可以进一步准确获知变异的基因型,即为纯合(Hom)或杂合(Het)。
需要说明的是,本发明对于正交试验分析的方法不作严格限定,只要是能够获知可能致病的变异数据是真阳性变异还是假阳性即可,具体可以采用本领域常规技术操作,例如参考Sanger F.DNA sequencing with chain-terminating inhibitors.1977[J].Biotechnology(Reading,Mass.),24:104-108.。
根据本发明的实施例,所述模型选自随机森林分类模型,阈值为0.95±0.05。阈值的设定保证了足够的准确率,减少偶然性误差。采用可伸缩的阈值设定,在保证足够准确率的前提下,可在准确率和进行正交试验率中相互权衡。
根据本发明的具体实施例,分别将明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集分为训练集和测试集(3:1),并选择随机森林分类模型,经过5折交叉验证选择准确率最高的模型。
根据本发明的实施例,所述用于分析变异检测结果的模型的构建方法包括:
1、首先获取临床基因组数据通过人类参考基因组(hg19)比对,并使用GATK识别变异输出VCF文件;
2、经过标准临床解读获取可能致病的变异,再经过正交实验验证变异的准确性,并且提供准确的基因型Hom(纯和)、Het(杂合)、N(不存在变异);
3、然后将VCF文件转换为机器学习标签和特征,从中共计获取特征9个,具体参见表1:
4、根据变异类型的不同(SNV,INDEL),通过从VCF文件中提取出的特征分别构建两个不同的机器学习分类模型,经过网格搜索寻求最优参数。
5、基于上述方法将数据分为训练集和测试集(3:1),并选择随机森林分类模型,经过5折交叉验证选择准确率最高的模型。
分析变异检测结果的方法
在本发明的另一方面,本发明提出了一种分析变异检测结果的方法。根据本发明的实施例,所述方法包括:获取候选阳性变异数据集;利用前面所述用于分析变异检测结果的模型的构建方法获得的机器学习模型对所述候选阳性变异数据集进行分析,以便预测所述候选阳性变异数据集中的阳性变异数据是否为假阳性和/或变异位点的基因型。由此,利用本发明的方法获得的模型可以准确地预测候选阳性变异数据是否为假阳性,同时还可以确定变异的基因型,有助于更快和精准的定位到可能的变异,并减少正交实验的成本和周转时间。
根据本发明的实施例,所述候选阳性变异数据集是通过下列方式获得的:获取测序数据集;利用GATK软件对所述测序数据集与参考数据进行比对处理,获得所述候选阳性变异数据集。
根据本发明的实施例,所述模型选自随机森林分类模型,当所述候选阳性变异数据的置信度低于所述模型的阈值时,将所述候选阳性变异数据进行正交试验分析,以便预测所述候选阳性变异数据集中的阳性变异数据是否为假阳性。低于阈值的数据称为灰区数据, 利用模型预测假阳性的准确率偏低,因此,需要再对这部分数据进行正交实验验证,从而准确地预测其假阳性。
本领域技术人员能够理解的是,前面针对用于分析变异检测结果的模型的构建方法所描述的特征和优点,同样适用于该分析变异检测结果的方法,在此不再赘述。
用于分析变异检测结果的模型的构建装置
在本发明的又一方面,本发明提出了一种用于分析变异检测结果的模型的构建装置。根据本发明的实施例,所述装置包括:获取模块,所述获取模块适于获取明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集;提取模块,所述提取模块适于分别从所述阳性测序数据集和阴性测序数据集中提取变异位点的特征;构建模块,所述构建模块适于利用所述提取模块获得的特征结果构建模型;其中,所述特征包括下列的至少之一:AD0值:变异位点基因型中第一个等位基因的深度;AD1值:变异位点基因型中第二个等位基因的深度;AF0值:变异位点基因型中第一个等位基因的频率;AF1值:变异位点基因型中第二个等位基因的频率;GT值:单个数值;DP值:测序深度值;GQ值:变异位点基因型的质量值;MQ值:变异位点映射的质量;QUAL值:变异位点可能性的质量值。由此,利用本发明的装置获得的模型可以准确地预测阳性变异数据是否为假阳性,同时还可以确定变异的基因型,有助于更快和精准的定位到可能的变异,并减少正交实验的成本和周转时间。
根据本发明的实施例,所述获取模块包括:获取测序数据集模块,所述获取测序数据集模块适于获取测序数据集;对比处理模块,所述对比处理模块适于利用GATK软件对所述测序数据集与参考数据进行比对处理,获得候选阳性变异数据集;分析处理模块,所述分析处理模块适于对所述候选阳性变异数据集进行分析处理,获得明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集。采用获取模块可以准确地确定测序数据集中的阳性变异位点数据和阴性变异位点数据,同时,还可以确定阳性变异位点的基因型。
根据本发明的实施例,所述分析处理模块包括:标准临床解读模块,所述标准临床解读模块适于将所述阳性变异数据进行标准临床解读,获取可能致病的变异数据;正交试验分析模块,所述正交试验分析子模块适于对所述可能致病的变异数据进行正交试验分析,获得明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集。
可执行的存储介质
在本发明的又一方面,本发明提出了一种可执行的存储介质。根据本发明的实施例,所述存储介质存储有计算机程序指令,所述计算机程序指令在处理器上运行时,使所述处理器执行如前面所述分析变异检测结果的方法。由此,通过执行本发明的存储介质,可以准确地预测阳性变异数据中是否为假阳性,同时还可以确定变异的基因型,有助于更快和精准的定位到可能的变异,并减少正交实验的成本和周转时间。
本领域技术人员能够理解的是,前面针对分析变异检测结果的方法所描述的特征和优点,同样适用于该可执行的存储介质,在此不再赘述。
电子设备
在本发明的又一方面,本发明提出了一种电子设备。根据本发明的实施例,所述电子设备包括:前面所述可执行的存储介质;所述处理器,用于执行所述计算机程序以实现前面所述分析变异检测结果的方法。由此,通过实施本发明的电子设备,可以准确地预测阳性变异数据中是否为假阳性,同时还可以确定变异的基因型,有助于更快和精准的定位到可能的变异,并减少正交实验的成本和周转时间。
本领域技术人员能够理解的是,前面针对分析变异检测结果的方法和可执行的存储介质所描述的特征和优点,同样适用于该电子设备,在此不再赘述。
下面将结合实施例对本发明的方案进行解释。本领域技术人员将会理解,下面的实施例仅用于说明本发明,而不应视为限定本发明的范围。实施例中未注明具体技术或条件的,按照本领域内的文献所描述的技术或条件或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者,均为可以通过市购获得的常规产品。
实施例1
1、获取临床5190名患者的WES数据,利用GATK软件对数据与人类基因组hg19进行比对、变异检测、注释和过滤,得到VCF文件;
2、VCF文件经过标准临床解读流程,分析得到可能致病的7375个变异;
3、对上述7375个变异数进行正交实验验证(具体可参考Sanger F.DNA sequencing with chain-terminating inhibitors.1977[J].Biotechnology(Reading,Mass.),24:104-108),确定这些变异包含5241个变异类型SNV和2134个变异类型INDEL。SNV中基因型Het为3226个,Hom为63个,阴性变异为1952个;Indel中基因型Het为1606个,Hom为138个,阴性变异为390个;
4、将上步数据分为训练集和测试集(3:1),训练集分别建立随机森林分类模型,对训练集中所有的特征作为候选特征,然后进行主成份分析,最终确定了表2中列出的9种特征。
表2不同变异类型SNV和INDEL建立随机森林分类模型中的特征重要性
SNV和INDEL模型的测试集准确率分别为94.8%与93.8%,其中不同基因型的准确率如表3。
表3不同变异类型SNV和INDEL建立随机森林分类模型中的不同基因型的准确率
考虑到临床数据需要的准确性,本方法对测试集通过划定不同的阈值(随机森林结果的置信度)得到不同准确性和正交实验比例(表4),其中的准确率是指判断正确的数量/满足阈值的总数,正交实验比例是指低于阈值的数量/总体测试样本数量。选择在满足足够准确率的情况下,选择尽可能小的正交实验比例的阈值作为目标阈值,最终确定阈值为0.95,并且处于一个可伸缩的范围±0.05。以上结果显示,本方法对噪音数据、数据冗余和低质量数据都有一定的容忍性,有很好的鲁棒性。
表4不同变异类型SNV和INDEL建立随机森林分类模型中的不同阈值与需要正交实验的比例

在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (13)

  1. 一种用于分析变异检测结果的模型的构建方法,其特征在于,包括:
    获取明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集;
    分别从所述阳性测序数据集和阴性测序数据集中提取变异位点的特征;
    利用上步得到的特征结果构建模型;
    其中,所述特征包括下列的至少之一:
    AD0值:变异位点基因型中第一个等位基因的深度;
    AD1值:变异位点基因型中第二个等位基因的深度;
    AF0值:变异位点基因型中第一个等位基因的频率;
    AF1值:变异位点基因型中第二个等位基因的频率;
    GT值:单个数值;
    DP值:测序深度值;
    GQ值:变异位点基因型的质量值;
    MQ值:变异位点映射的质量;
    QUAL值:变异位点可能性的质量值。
  2. 根据权利要求1所述的方法,其特征在于,所述明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集是通过下列方法获得的:
    获取测序数据集;
    利用GATK软件对所述测序数据集与参考数据进行比对处理,获得候选阳性变异数据集;
    对所述候选阳性变异数据集进行分析处理,获得明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集。
  3. 根据权利要求2所述的方法,其特征在于,所述参考序列选自人类基因组hg19。
  4. 根据权利要求2所述的方法,其特征在于,所述分析处理包括:
    将所述候选阳性变异数据集进行标准临床解读,获取可能致病的变异数据集;
    对所述可能致病的变异数据集进行正交试验分析,获得明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集,其中,所述阳性测序数据集包括SNV变异类型数据集和INDEL变异类型数据集,所述SNV变异类型数据集和INDEL变异类型数据集分别包括纯合基因型数据集和杂合基因型数据集。
  5. 根据权利要求1所述的方法,其特征在于,所述模型选自随机森林分类模型,阈值 为0.95±0.05。
  6. 一种分析变异检测结果的方法,其特征在于,包括:
    获取候选阳性变异数据集;
    利用权利要求1~5任一项所述用于分析变异检测结果的模型的构建方法获得的机器学习模型对所述候选阳性变异数据集进行分析,以便预测所述候选阳性变异数据集中的阳性变异数据是否为假阳性和/或变异位点的基因型。
  7. 根据权利要求6所述的方法,其特征在于,所述候选阳性变异数据集是通过下列方式获得的:
    获取测序数据集;
    利用GATK软件对所述测序数据集与参考数据进行比对处理,获得所述候选阳性变异数据集。
  8. 根据权利要求6所述的方法,其特征在于,所述模型选自随机森林分类模型,当所述候选阳性变异数据的置信度低于所述模型的阈值时,将所述候选阳性变异数据进行正交试验分析,以便预测所述候选阳性变异数据集中的阳性变异数据是否为假阳性。
  9. 一种用于分析变异检测结果的模型的构建装置,其特征在于,包括:
    获取模块,所述获取模块适于获取明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集;
    提取模块,所述提取模块适于分别从所述阳性测序数据集和阴性测序数据集中提取变异位点的特征;
    构建模块,所述构建模块适于利用所述提取模块获得的特征结果构建模型;
    其中,所述特征包括下列的至少之一:
    AD0值:变异位点基因型中第一个等位基因的深度;
    AD1值:变异位点基因型中第二个等位基因的深度;
    AF0值:变异位点基因型中第一个等位基因的频率;
    AF1值:变异位点基因型中第二个等位基因的频率;
    GT值:单个数值;
    DP值:测序深度值;
    GQ值:变异位点基因型的质量值;
    MQ值:变异位点映射的质量;
    QUAL值:变异位点可能性的质量值。
  10. 根据权利要求9所述的装置,其特征在于,所述获取模块包括:
    获取测序数据集模块,所述获取测序数据集模块适于获取测序数据集;
    对比处理模块,所述对比处理模块适于利用GATK软件对所述测序数据集与参考数据进行比对处理,获得候选阳性变异数据集;
    分析处理模块,所述分析处理模块适于对所述候选阳性变异数据集进行分析处理,获得明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集。
  11. 根据权利要求10所述的装置,其特征在于,所述分析处理模块包括:
    标准临床解读模块,所述标准临床解读模块适于将所述阳性变异数据进行标准临床解读,获取可能致病的变异数据;
    正交试验分析模块,所述正交试验分析子模块适于对所述可能致病的变异数据进行正交试验分析,获得明确为阳性变异位点的阳性测序数据集和阴性变异位点的阴性测序数据集。
  12. 一种可执行的存储介质,其特征在于,所述存储介质存储有计算机程序指令,所述计算机程序指令在处理器上运行时,使所述处理器执行如权利要求6-8任一项所述分析变异检测结果的方法。
  13. 一种电子设备,其特征在于,包括:
    权利要求12所述可执行的存储介质;
    所述处理器,用于执行所述计算机程序以实现如权利要求6-8任一项所述分析变异检测结果的方法。
PCT/CN2023/081719 2022-04-25 2023-03-15 用于分析变异检测结果的模型的构建方法 WO2023207396A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210443091.0 2022-04-25
CN202210443091.0A CN116994647A (zh) 2022-04-25 2022-04-25 用于分析变异检测结果的模型的构建方法

Publications (1)

Publication Number Publication Date
WO2023207396A1 true WO2023207396A1 (zh) 2023-11-02

Family

ID=88517243

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/081719 WO2023207396A1 (zh) 2022-04-25 2023-03-15 用于分析变异检测结果的模型的构建方法

Country Status (2)

Country Link
CN (1) CN116994647A (zh)
WO (1) WO2023207396A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711487A (zh) * 2024-02-05 2024-03-15 广州嘉检医学检测有限公司 胚系SNV、InDel变异的鉴定方法、系统以及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108690871A (zh) * 2018-03-29 2018-10-23 深圳裕策生物科技有限公司 基于二代测序的插入缺失突变检测方法、装置和存储介质
CN110268044A (zh) * 2017-03-07 2019-09-20 深圳华大生命科学研究院 一种染色体变异的检测方法及装置
CN111304308A (zh) * 2020-03-02 2020-06-19 北京泛生子基因科技有限公司 一种审核高通量测序基因变异检测结果的方法
CN112111565A (zh) * 2019-06-20 2020-12-22 上海其明信息技术有限公司 一种细胞游离dna测序数据的突变分析方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110268044A (zh) * 2017-03-07 2019-09-20 深圳华大生命科学研究院 一种染色体变异的检测方法及装置
CN108690871A (zh) * 2018-03-29 2018-10-23 深圳裕策生物科技有限公司 基于二代测序的插入缺失突变检测方法、装置和存储介质
CN112111565A (zh) * 2019-06-20 2020-12-22 上海其明信息技术有限公司 一种细胞游离dna测序数据的突变分析方法和装置
CN111304308A (zh) * 2020-03-02 2020-06-19 北京泛生子基因科技有限公司 一种审核高通量测序基因变异检测结果的方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711487A (zh) * 2024-02-05 2024-03-15 广州嘉检医学检测有限公司 胚系SNV、InDel变异的鉴定方法、系统以及可读存储介质
CN117711487B (zh) * 2024-02-05 2024-05-17 广州嘉检医学检测有限公司 胚系SNV、InDel变异的鉴定方法、系统以及可读存储介质

Also Published As

Publication number Publication date
CN116994647A (zh) 2023-11-03

Similar Documents

Publication Publication Date Title
US10354747B1 (en) Deep learning analysis pipeline for next generation sequencing
US10216895B2 (en) Rare variant calls in ultra-deep sequencing
Daber et al. Understanding the limitations of next generation sequencing informatics, an approach to clinical pipeline validation using artificial data sets
DePristo et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data
BR112020013636A2 (pt) método para facilitar o diagnóstico pré-natal de um distúrbio genético a partir de uma amostra materna associada à gestante, método para identificação de contaminação associada a pelo menos um entre preparação de biblioteca de sequenciamento e sequenciamento de alto rendimento e método para caracterização associada a pelo menos um entre preparação de biblioteca de sequenciamento e sequenciamento
AU2020244763A1 (en) Systems and methods for deriving and optimizing classifiers from multiple datasets
CN111341383B (zh) 一种检测拷贝数变异的方法、装置和存储介质
US20190172582A1 (en) Methods and systems for determining somatic mutation clonality
JP2005531853A (ja) Snp遺伝子型クラスタリングのためのシステムおよび方法
WO2019204632A1 (en) Method and system for rapid genetic analysis
Sadasivan et al. Rapid real-time squiggle classification for read until using rawmap
US20220277811A1 (en) Detecting False Positive Variant Calls In Next-Generation Sequencing
WO2012091093A1 (ja) 緑内障診断チップと変形プロテオミクスクラスター解析による緑内障統合的判定方法
CN113160882A (zh) 一种基于三代测序的病原微生物宏基因组检测方法
WO2023207396A1 (zh) 用于分析变异检测结果的模型的构建方法
Renaud et al. Unsupervised detection of fragment length signatures of circulating tumor DNA using non-negative matrix factorization
Ziegler et al. MiMSI-a deep multiple instance learning framework improves microsatellite instability detection from tumor next-generation sequencing
WO2018150378A1 (en) Detecting cross-contamination in sequencing data using regression techniques
WO2023087277A1 (zh) 序列变异分析方法、系统以及存储介质
CN114171116A (zh) 孕妇游离及本身dna评估胎儿dna浓度的方法及应用
AU2018391843B2 (en) Sequencing data-based ITD mutation ratio detecting apparatus and method
CN116646010B (zh) 人源性病毒检测方法及装置、设备、存储介质
WO2024007971A1 (en) Analysis of microbial fragments in plasma
US20220262461A1 (en) System and method for copy number variant error correction
Listgarten et al. PERSONALIZED MEDICINE: FROM GENOTYPES AND MOLECULAR PHENOTYPES TOWARDS THERAPY

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23794832

Country of ref document: EP

Kind code of ref document: A1