CN109616155B

CN109616155B - Data processing system and method for genetic variation pathogenicity classification of coding region

Info

Publication number: CN109616155B
Application number: CN201811374374.4A
Authority: CN
Inventors: 诸峰
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2023-04-18
Anticipated expiration: 2038-11-19
Also published as: CN109616155A

Abstract

The invention discloses a data processing system and a data processing method for genetic variation pathogenicity classification of a coding region, wherein the system comprises a variation locus discovery module, a variation locus annotation module, a data and resource loading module, a pathogenicity distinguishing and classifying module and a result explaining and verifying module which are sequentially connected, wherein the variation locus discovery module is used for searching the specific position of a genetic variation pathogenicity variation locus in the coding region; the variant locus annotation module is used for carrying out information annotation on variant loci and generating a data file corresponding to each variant locus; the data and resource loading module is used for reading an external resource file and the data file which are judged by pathogenicity; the pathogenicity judging and classifying module is used for calculating values of all discriminants in each variation site in the data file and classifying the pathogenicity of the genetic variation; the result interpretation and verification module is used for interpreting and manually verifying the classification results; the invention improves the working efficiency of the personnel reading the genetic disease data.

Description

A data processing system and method for pathogenicity classification of genetic variation in coding regions

技术领域technical field

本发明属于生物信息学领域，尤其涉及一种编码区域遗传变异致病性分类的数据处理系统与方法。The invention belongs to the field of bioinformatics, and in particular relates to a data processing system and method for pathogenicity classification of genetic variation in coding regions.

背景技术Background technique

随着新一代高通量测序技术和现代信息处理技术的发展，使得基于目标基因集、全外显子组、全基因组的测序以及海量基因数据的处理与解读成为可能。其中，对遗传变异的功能分析，特别是对编码区域遗传变异的致病性分类与解释，是目前基因测序数据处理分析领域关注的重点，相关成果将服务于生物医学、遗传学等领域，为临床辅助决策、数据精准解读、遗传咨询等应用提供重要依据。With the development of next-generation high-throughput sequencing technology and modern information processing technology, it is possible to process and interpret massive genetic data based on target gene sets, whole exomes, and whole genome sequencing. Among them, the functional analysis of genetic variation, especially the classification and interpretation of the pathogenicity of genetic variation in coding regions, is currently the focus of attention in the field of gene sequencing data processing and analysis. It provides an important basis for clinical auxiliary decision-making, accurate interpretation of data, genetic counseling and other applications.

目前，对遗传变异的致病性分类，国内外主要以2015年美国医学遗传学与基因组学学会(ACMG)发布的疾病变异位点的分类及解读标准为指南。这一指南给出了判定变异位点临床意义的28条评估标准，由于指南本身并不能将每一项评判标准的细节和参数都明确指定，因此不同的数据解读人员在具体操作中可能存在一定的差异，从而导致解读结论的不一致程度较高。此外，对于每一个检测样本，都需要数据解读人员人工查询各类数据库资源，并逐一比对相关证据信息，使得整个数据处理过程非常的繁琐和低效，容易发生错误。由此可见，现有技术方法的自动化和系统化程度都很低，缺乏高效率的测序数据解读工具，无法满足对大规模样本数据快速且结果一致的处理需求。At present, the pathogenicity classification of genetic variants is mainly guided by the classification and interpretation standards of disease variant sites released by the American College of Medical Genetics and Genomics (ACMG) in 2015. This guideline gives 28 evaluation criteria for judging the clinical significance of variant sites. Since the guideline itself cannot clearly specify the details and parameters of each criteria, different data interpreters may have certain differences in specific operations. , leading to a high degree of inconsistency in interpretation conclusions. In addition, for each test sample, data interpretation personnel are required to manually query various database resources and compare relevant evidence information one by one, making the entire data processing process very cumbersome and inefficient, and prone to errors. It can be seen that the degree of automation and systematization of existing technical methods is very low, and there is a lack of efficient sequencing data interpretation tools, which cannot meet the needs of rapid processing of large-scale sample data and consistent results.

发明内容Contents of the invention

本发明的主要目的在于提供了一种编码区域遗传变异致病性分类的数据处理系统与方法，该方法能实现对大规模样本数据及海量变异位点信息的半自动和系统化的处理，加快遗传病基因分析速度，极大提升遗传病数据解读人员的工作效率，避免因处理过程的繁琐而导致的错误；解决了现有技术中效率低下、成本高的问题；具体技术方案如下：The main purpose of the present invention is to provide a data processing system and method for pathogenicity classification of genetic variation in coding regions, which can realize semi-automatic and systematic processing of large-scale sample data and massive variation site information, and speed up genetic The speed of disease gene analysis greatly improves the work efficiency of genetic disease data interpreters and avoids errors caused by cumbersome processing; it solves the problems of low efficiency and high cost in existing technologies; the specific technical solutions are as follows:

一方面，提供一种编码区域遗传变异致病性分类的数据处理系统，所述系统基于ACMG指南为理论依据构建，所述系统包括依次连接的变异位点发现模块、变异位点注释模块、数据及资源加载模块、致病性判别与分类模块和结果解释与验证模块，其中：On the one hand, a data processing system for pathogenicity classification of genetic variation in coding regions is provided. The system is constructed based on ACMG guidelines, and the system includes a variant site discovery module, a variant site annotation module, and a data And resource loading module, pathogenicity discrimination and classification module and result interpretation and verification module, in which:

变异位点发现模块，用于找寻编码区域内遗传变异致病性变异位点的具体位置；所述变异位点包括SNPs和小片段的INDELs；The variant site discovery module is used to find the specific position of the genetic variation pathogenic variant site in the coding region; the variant site includes SNPs and small fragment INDELs;

变异位点注释模块，用于对变异位点进行信息注释，并生成与每个所述变异位点对应的数据文件；所述信息注释包括所述变异位点所在染色体、参考等位基因、替换等位基因、所在外显子位置、罕见性、所在基因、氨基酸变化、各类可计算工具对变异有害性计算得分及预测结果、不同人群中变异频率信息的注释；The variable site annotation module is used to annotate the variable site information and generate a data file corresponding to each of the variable sites; the information annotation includes the chromosome where the variable site is located, the reference allele, the replacement Alleles, exon positions, rarity, genes, amino acid changes, various calculable tools to calculate the harmfulness of mutations and prediction results, and annotations of mutation frequency information in different populations;

数据及资源加载模块，用于读取致病性判别的外部资源文件和所述数据文件；所述外部资源文件包括所述致病性判别的基因列表、Clinvar、OMIM、dbscSNV和dbNSFP数据库；Data and resource loading module, for reading the external resource file and the data file of pathogenicity discrimination; The external resource file includes the gene list, Clinvar, OMIM, dbscSNV and dbNSFP database of the pathogenicity discrimination;

致病性判断与分类模块，用于计算所述数据文件中每一所述变异位点中所有判别项的取值，并对遗传变异的所述致病性进行分类操作；所述判别项包括PVS1、PS1、PS4、PM1、PM2、PM4、PM5、PP2、PP3、PP5、BS1、BS2、BP1、BP3、BP4、BP6、BP7和BA1；The pathogenicity judgment and classification module is used to calculate the value of all discriminant items in each of the mutation sites in the data file, and perform classification operations on the pathogenicity of genetic variation; the discriminant items include PVS1, PS1, PS4, PM1, PM2, PM4, PM5, PP2, PP3, PP5, BS1, BS2, BP1, BP3, BP4, BP6, BP7 and BA1;

结果解释和验证模块，用于将所述分类进行结果解释和人工验证。The result interpretation and verification module is used for performing result interpretation and manual verification on the classification.

进一步的，所述变异位点发现模块包括序列比对与映射单元、序列数据预处理单元和SNPs与小片段INDELs变异发现单元；所述序列比对与映射单元用于接收由序列数据构成的原始测序数据，并将序列数据映射到参考基因组上；所述序列数据预处理单元用于对映射到参考基因组上的序列数据做预处理操作；所述SNPs与小片段INDELs变异发现单元用于识别预处理后的序列数据相对参考基因组的变异位点，并计算每一所述变异位点的基因型。Further, the variation site discovery module includes a sequence comparison and mapping unit, a sequence data preprocessing unit, and a SNPs and small fragment INDELs variation discovery unit; the sequence comparison and mapping unit is used to receive the original Sequencing data, and mapping the sequence data to the reference genome; the sequence data preprocessing unit is used to perform preprocessing operations on the sequence data mapped to the reference genome; the SNPs and small fragment INDELs variation discovery unit is used to identify the preprocessing The processed sequence data is compared with the variation sites of the reference genome, and the genotype of each variation site is calculated.

进一步的，所述变异位点发现模块输入为fastq格式的原始测序数据文件，所述变异位点发现模块输出为包含所有变异位点的vcf格式文件。Further, the input of the variable site discovery module is the original sequencing data file in fastq format, and the output of the variable site discovery module is a vcf format file containing all variable sites.

进一步的，所述变异位点发现模块使用BWA-MEM算法完成所述原始测序数据的映射操作；所述变异位点发现模块使用GATK工具实现对所述变异位点的寻找操作。Further, the variable site discovery module uses the BWA-MEM algorithm to complete the mapping operation of the original sequencing data; the variable site discovery module uses the GATK tool to realize the search operation for the variable site.

进一步的，所述变异位点注释模块包括位点注释单元，所述位点注释单元对所述SNPs和所述小片段INDELs进行注释，并能够选择指定所述变异位点进行信息注释。Further, the variable site annotation module includes a site annotation unit, which annotates the SNPs and the small fragment INDELs, and can select and specify the variable site for information annotation.

进一步的，所述数据及资源加载模块包括注释数据加载单元和外部资源加载单元，所述注释数据加载单元用于读取所述数据文件并存储；所述外部资源加载单元用于读取所述外部资源文件。Further, the data and resource loading module includes an annotation data loading unit and an external resource loading unit, the annotation data loading unit is used to read and store the data file; the external resource loading unit is used to read the External resource files.

进一步的，所述致病性判别与分类模块包括致病性判别单元和致病性分类单元，所述致病性判别单元用于计算每一所述变异位点中所有所述判别项的取值；所述致病性分类单元根据响应于所述致病性判别单元的计算结果对所有所述变异位点进行分类。Further, the pathogenicity discrimination and classification module includes a pathogenicity discrimination unit and a pathogenicity classification unit, and the pathogenicity discrimination unit is used to calculate the value of all the discrimination items in each of the mutation sites. value; the pathogenicity classification unit classifies all the variant sites according to the calculation results in response to the pathogenicity discrimination unit.

进一步的，所述结果解释与验证模块包括结果解释单元和验证单元，所述结果解释单元用于给出所述致病性判别与分类模块的判别结果和分类结果的分类依据；所述验证单元用于对所述分类进行比对，并根据比对结果提交至人工进行进一步审核和确认。Further, the result interpretation and verification module includes a result interpretation unit and a verification unit, and the result interpretation unit is used to give the classification basis of the discrimination results and classification results of the pathogenicity discrimination and classification module; the verification unit It is used to compare the categories and submit them to manual for further review and confirmation based on the comparison results.

另一方面，提供了一种编码区域遗传变异致病性分类的数据处理方法，应用于上述的一种编码区域遗传变异致病性分类的数据处理系统，其特征在于，所述方法包括步骤：In another aspect, a data processing method for classification of pathogenicity of genetic variation in coding regions is provided, which is applied to the above data processing system for classification of pathogenicity of genetic variation in coding regions, characterized in that the method includes the steps of:

S1、输入由序列数据组成的原始测序数据至所述变异位点发现模块，利用BWA-MEM算法将序列数据映射到参考基因组，使用Picard工具对映射后的序列数据做预处理，采用GATK工具找出序列数据的变异位点；S1. Input the original sequencing data composed of sequence data to the variation site discovery module, use the BWA-MEM algorithm to map the sequence data to the reference genome, use the Picard tool to preprocess the mapped sequence data, and use the GATK tool to find The variation site of the sequence data;

S2、使用所述变异位点注释模块对所有所述变异位点进行信息注释，生成与每一所述变异位点对应的数据文件；S2. Use the variable site annotation module to annotate information on all the variable sites, and generate a data file corresponding to each of the variable sites;

S3、使用所述数据及资源加载模块读取所述数据文件，并同时读取用于致病性判别的外部资源文件；S3. Use the data and resource loading module to read the data file, and at the same time read the external resource file used for pathogenicity discrimination;

S4、基于ACMG指南计算每一所述变异位点的所有判别项的取值，并对所述判别项打分，并根据打分进行汇总操作，基于所述汇总对致病性进行分类操作；S4. Calculating the values of all discriminant items for each of the variant sites based on the ACMG guidelines, and scoring the discriminant items, performing a summary operation based on the scoring, and performing a classification operation on pathogenicity based on the summary;

S5、对所述致病性的分类作出解释，并将所述解释作为分类的依据，将分类结果与Clinvar、InterVar遗传变异数据库解读工具的解读结果做比对，基于比对结果提交至人工进行进一步审核和确认。S5. Explain the classification of the pathogenicity, and use the explanation as the basis for the classification, compare the classification results with the interpretation results of the Clinvar and InterVar genetic variation database interpretation tools, and submit them to manual processing based on the comparison results Further review and confirmation.

进一步的，在所述步骤S5中，若干所述分类结果与Clinvar、InterVar遗传变异数据库解读工具的解读结果不一致，则将所述分类结果提交至人工进行审核和确认，否则，完成对应所述致病性的分类。Further, in the step S5, if some of the classification results are inconsistent with the interpretation results of the Clinvar and InterVar genetic variation database interpretation tools, the classification results are submitted to manual review and confirmation; otherwise, the corresponding classification results are completed. disease classification.

本发明的编码区域遗传变异致病性分类的数据处理系统与方法，所述处理系统由变异位点发现模块、变异位点注释模块、数据及资源加载模块、致病性判别与分类模块和结果解释和验证模块依次连接形成，由变异位点发现模块完成编码区域所有遗传变异位点的找寻，并通过变异位点注释模块对变异位点中所有信息进行注释生成与每一变异位点对应的数据文件，然后由数据及资源加载模块读取数据文件和用于致病性判别的外部资源文件；随后通过致病性判别与分类模块对每一变异位点的判别项进行具体取值的计算，对每一判别项进行打分，将所有判别项按打分进行汇总操作，并根据汇总的情况对所有遗传变异的致病性进行分类；最后，由结果解释与验证模块给出所有遗传变异的致病性的分类依据，并与遗传变异数据解读工具的结果进行比对，若比对不一致，则再次通过人工进行进一步的审核和确认；与现有技术相比，本发明能够针对目标基因集、全外显子组测序数据，实现对大规模样本及海量变异位点信息的半自动和系统化的处理；本发明集成了变异位点发现模块、变异位点注释模块、数据及资源加载模块、致病性判别与分类模块、结果解释和验证模块等处理过程，整个数据处理流程具有规范性和系统性；本发明能够加快遗传病基因数据分析速度，极大提升遗传病数据解读人员的工作效率，避免因处理过程的繁琐而导致的错误。The data processing system and method for pathogenicity classification of genetic variation in coding regions of the present invention, the processing system consists of a variation site discovery module, a variation site annotation module, a data and resource loading module, a pathogenicity discrimination and classification module and a result The interpretation and verification modules are connected in sequence, and the variable site discovery module completes the search for all genetic variation sites in the coding region, and annotates all the information in the variable site through the variable site annotation module to generate a corresponding to each variable site The data file, and then the data and resource loading module reads the data file and the external resource file used for pathogenicity discrimination; then, the specific value of the discriminant item of each mutation site is calculated through the pathogenicity discrimination and classification module , to score each discriminant item, perform a summary operation of all discriminant items according to the scoring, and classify the pathogenicity of all genetic variations according to the summary; finally, the result interpretation and verification module gives the pathogenicity of all genetic variants. The basis for the classification of the disease, and compare it with the results of the genetic variation data interpretation tool. If the comparison is inconsistent, further review and confirmation will be performed manually again; compared with the prior art, the present invention can target gene sets, Whole exome sequencing data realizes semi-automatic and systematic processing of large-scale samples and massive variation site information; the invention integrates a variation site discovery module, a variation site annotation module, a data and resource loading module, and a The entire data processing process is standardized and systematic; the invention can speed up the analysis speed of genetic disease genetic data, and greatly improve the work efficiency of genetic disease data interpreters. Avoid errors caused by cumbersome processing.

附图说明Description of drawings

图1为本发明实施例中所述编码区域遗传变异致病性分类的数据处理系统的结构组成框图示意；Fig. 1 is a schematic block diagram of the structural composition of the data processing system for the classification of the pathogenicity of the genetic variation in the coding region described in the embodiment of the present invention;

图2为本发明实施例中所述变异位点发现模块发现变异位点的流程框图示意；Fig. 2 is a block diagram schematic diagram of the discovery module of the variable site described in the embodiment of the present invention to find the variable site;

图3为本发明实施例中所述自定义注释单元对变异位点的信息注释流程图示意；Fig. 3 is a schematic flowchart of the information annotation of the variable site by the self-defined annotation unit described in the embodiment of the present invention;

图4为本发明实施例中所述编码区域遗传变异致病性分类的数据处理方法的流程框图示意。Fig. 4 is a block diagram of a data processing method for pathogenicity classification of genetic variation in the coding region described in the embodiment of the present invention.

1-变异位点发现模块、2-变异位点注释模块、3-数据及资源加载模块、4-致病性判别与分类模块、5-结果解释与验证模块。1-variant site discovery module, 2-variant site annotation module, 3-data and resource loading module, 4-pathogenicity discrimination and classification module, 5-result interpretation and verification module.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention.

本发明的编码区域遗传变异致病性分类的数据处理系统及方法中所述编码区域遗传变异致病性分类的数据处理系统基于ACMG指南为理论依据构建，其中，编码区域为基因组上的全部蛋白编码区域，即基因组上的所有外显子区域；在对数据处理过程中，本发明涉及BWA-MEM算法、Picard工具和GATK工具，且Picard工具包括AddOrReplaceReadGroups算法和MarkDuplicate算法，GATK工具包括HaplotypeCaller算法、GenotypeGVCFs算法、SelectVariants算法、VariantRecalibrator算法、ApplyRecalibration算法和CombineVariants算法。The data processing system and method for the pathogenicity classification of genetic variation in the coding region of the present invention is constructed based on the ACMG guidelines, wherein the coding region is all proteins on the genome Coding region, that is, all exon regions on the genome; in the process of data processing, the present invention relates to BWA-MEM algorithm, Picard tool and GATK tool, and Picard tool includes AddOrReplaceReadGroups algorithm and MarkDuplicate algorithm, GATK tool includes HaplotypeCaller algorithm, GenotypeGVCFs algorithm, SelectVariants algorithm, VariantRecalibrator algorithm, ApplyRecalibration algorithm and CombineVariants algorithm.

结合图1～图4，本发明的编码区域遗传变异致病性分类的数据处理系统及方法中，所述编码区域遗传变异致病性分类的数据处理系统包括依次连接的变异位点发现模块1、变异位点注释模块2、数据及资源加载模块3、致病性判别与分类模块4和结果解释与验证模块5，变异位点发现模块1用于找寻编码区域内变异位点的具体位置，变异位点包括SNPs和小片段的INDELs；变异位点注释模块2用于对变异位点进行信息注释，并生成与每个变异位点对应的数据文件；信息注释包括变异位点所在染色体、参考等位基因、替换等位基因、所在外显子位置、罕见性、所在基因、氨基酸变化、各类可计算工具对变异有害性计算得分及预测结果、不同人群中变异频率信息的注释；数据及资源加载模块3用于读取致病性判别的外部资源文件和数据文件；外部资源文件包括致病性判别的基因列表、Clinvar、OMIM、dbscSNV和dbNSFP数据库；致病性判断与分类模块4用于计算数据文件中每一变异位点中所有判别项的取值，并对遗传变异的致病性进行分类操作；判别项包括PVS1、PS1、PS4、PM1、PM2、PM4、PM5、PP2、PP3、PP5、BS1、BS2、BP1、BP3、BP4、BP6、BP7和BA1；结果解释和验证模块5用于将分类进行结果解释和人工验证；所述编码区域遗传变异致病性分类的数据处理方法包括步骤：S1、输入由序列数据组成的原始测序数据至所述变异位点发现模块，利用BWA-MEM算法将序列数据映射到参考基因组，使用Picard工具对映射后的序列数据做预处理，采用GATK工具找出序列数据的变异位点；S2、使用所述变异位点注释模块对所有所述变异位点进行信息注释，生成与每一所述变异位点对应的数据文件；S3、使用所述数据及资源加载模块读取所述数据文件，并同时读取用于致病性判别的外部资源文件；S4、基于ACMG指南计算每一所述变异位点的所有判别项的取值，并对所述判别项打分，并根据打分进行汇总操作，基于所述汇总对致病性进行分类操作；S5、对所述致病性的分类作出解释，并将所述解释作为分类的依据，将分类结果与Clinvar、InterVar遗传变异数据库解读工具的解读结果做比对，基于比对结果提交至人工进行进一步审核和确认。With reference to Figures 1 to 4, in the data processing system and method for the classification of pathogenicity of genetic variation in coding regions of the present invention, the data processing system for classification of pathogenicity of genetic variation in coding regions includes a sequentially connected variation site discovery module 1 , variant site annotation module 2, data and resource loading module 3, pathogenicity discrimination and classification module 4 and result interpretation and verification module 5, variant site discovery module 1 is used to find the specific position of the variant site in the coding region, Variation sites include SNPs and INDELs of small fragments; the variation site annotation module 2 is used to annotate the information of the variation sites and generate data files corresponding to each variation site; the information annotation includes the chromosome where the variation site is located, the reference Alleles, replacement alleles, exon positions, rarity, genes, amino acid changes, various calculation tools for the calculation of mutation harmfulness and prediction results, annotations of mutation frequency information in different populations; data and The resource loading module 3 is used to read the external resource files and data files of pathogenicity discrimination; the external resource files include the gene list, Clinvar, OMIM, dbscSNV and dbNSFP databases of pathogenicity discrimination; the pathogenicity judgment and classification module 4 is used Calculate the value of all discriminant items in each mutation site in the data file, and classify the pathogenicity of genetic variation; discriminant items include PVS1, PS1, PS4, PM1, PM2, PM4, PM5, PP2, PP3 , PP5, BS1, BS2, BP1, BP3, BP4, BP6, BP7, and BA1; the result interpretation and verification module 5 is used to perform result interpretation and manual verification of the classification; the data processing method of the pathogenicity classification of the genetic variation in the coding region Including steps: S1. Input the original sequencing data composed of sequence data to the variation site discovery module, use the BWA-MEM algorithm to map the sequence data to the reference genome, use the Picard tool to preprocess the mapped sequence data, and use The GATK tool finds the variation site of the sequence data; S2, uses the variation site annotation module to perform information annotation on all the variation sites, and generates a data file corresponding to each of the variation sites; S3, uses the variation site annotation module The above data and resource loading module reads the data file, and at the same time reads the external resource file used for pathogenicity discrimination; S4, calculates the value of all discriminant items for each of the mutation sites based on the ACMG guidelines, and Score the discriminant items, perform a summary operation based on the scores, and classify the pathogenicity based on the summary; S5. Explain the classification of the pathogenicity, and use the explanation as a basis for classification, and use the explanation as a basis for classification. The classification results are compared with the interpretation results of the Clinvar and InterVar genetic variation database interpretation tools, and based on the comparison results, they are submitted to manual for further review and confirmation.

在本发明实施例中，变异位点发现模块1包括序列比对与映射单元、序列数据预处理单元和SNPs与小片段INDELs变异发现单元；序列比对与映射单元用于接收由序列数据组成的原始测序数据，并将原始测序数据映射到参考基因组上；序列数据预处理单元用于对映射后的序列数据做预处理操作；SNPs与小片段INDELs变异发现单元用于识别相对参考基因组的变异位点，并计算每一变异位点的基因型；其中，本发明利用BWA-MEM算法实现原始测序数据至参考基因组的映射操作；所述预处理操作通过Picard工具实现，Picard工具包括AddOrReplaceReadGroups算法和MarkDuplicate算法，由AddOrReplaceReadGroups算法将序列数据信息添加到映射后的bam文件中；由MarkDuplicate算法来标记序列数据中的重复信息，以减轻诸如PCR扩增的数据生成步骤所引起的偏差；并使用Picard工具对序列数据进行排序；最后，使用GATK工具执行对碱基质量分数的重新校准操作；使用变异位点发现模块找寻变异位点的具体过程如下：In the embodiment of the present invention, the variant site discovery module 1 includes a sequence comparison and mapping unit, a sequence data preprocessing unit, and a SNPs and small fragment INDELs variation discovery unit; the sequence comparison and mapping unit is used to receive the sequence data composed of Raw sequencing data, and map the original sequencing data to the reference genome; the sequence data preprocessing unit is used to preprocess the mapped sequence data; the SNPs and small fragment INDELs variation discovery unit is used to identify the variant position relative to the reference genome point, and calculate the genotype of each variation site; wherein, the present invention utilizes the BWA-MEM algorithm to realize the mapping operation of the original sequencing data to the reference genome; the preprocessing operation is realized by the Picard tool, and the Picard tool includes AddOrReplaceReadGroups algorithm and MarkDuplicate Algorithm, the sequence data information is added to the mapped bam file by the AddOrReplaceReadGroups algorithm; the repeated information in the sequence data is marked by the MarkDuplicate algorithm to alleviate the bias caused by the data generation steps such as PCR amplification; and the Picard tool is used to The sequence data is sorted; finally, the GATK tool is used to recalibrate the base quality score; the specific process of using the variable site discovery module to find the variable site is as follows:

首先使用GATK，在GVCF模式中单独对每个样本运行HaplotypeCaller方法，以产生GVCF的中间文件格式；然后使用GATK的GenotypeGVCFs方法联合单样本GVCF文件产生多样本的VCF文件；并使用GATK的SelectVariants方法区分SNPS和INDELs；接下来使用GATK的VariantRecalibrator算法以及GATK的ApplyReclibration算法对遗传变异进行质量分数校正，用来实现对变异位点的过滤；最后使用GATK的CombineVariants方法将SNPs和INDELs联合放入一个VCF文件中。First use GATK to run the HaplotypeCaller method on each sample separately in GVCF mode to generate the GVCF intermediate file format; then use GATK's GenotypeGVCFs method to combine single-sample GVCF files to generate multi-sample VCF files; and use GATK's SelectVariants method to distinguish SNPS and INDELs; Next, use GATK's VariantRecalibrator algorithm and GATK's ApplyReclibration algorithm to correct the mass score of genetic variation to achieve filtering of variant sites; finally use GATK's CombineVariants method to combine SNPs and INDELs into a VCF file middle.

在本发明实施例中，变异位点注释模块2通过位点注释单元实现对变异位点的信息注释，具体的，利用dbNSFP数据库可对所有变异位点或者选择特定的变异位点进行注释，具体过程如描述下：In the embodiment of the present invention, the variable site annotation module 2 realizes the information annotation of the variable site through the site annotation unit. Specifically, the dbNSFP database can be used to annotate all the variable sites or select a specific variable site. Specifically The process is as described below:

首先解析由变异位点发现模块生成的VCF文件，将单个样本的VCF文件转成以Tab键间隔的格式，另外针对INDELs，进行位置的调整以消除冗余，输出文件为包含变异染色体编号、变异坐标起点终点位置、参考等位基因、替换等位基因；随后以该文件作为输入，通过转录组编码、转录组编码到基因及蛋白编码等数据资源，得到包含基因名、基因区域、转录组编码、蛋白编码等信息的输出文件；紧接着，通过氨基酸、序列数据资源，获得变异所在的外显子位置，并对变异的类型进行分类，得到变异碱基的变化以及氨基酸的变化信息；并分析变异是否发生在剪接位点，并获取该剪接位点编码；最后，借助dbNSFP数据库资源，得到变异的SIFT、Polyphen2、MutationTaster、LRT、FATHMM、CADD、MetaSVM、Clinvar、InterVar功能预测工具的分数和预测结果。First parse the VCF file generated by the variant site discovery module, convert the VCF file of a single sample into a format separated by Tab keys, and adjust the position of INDELs to eliminate redundancy, and the output file contains the variant chromosome number, variant Coordinate start and end positions, reference alleles, replacement alleles; then use this file as input, through transcriptome code, transcriptome code to gene and protein code and other data resources, get the gene name, gene region, transcriptome code , protein coding and other information output files; then, through amino acid and sequence data resources, obtain the exon position where the variation is located, and classify the type of variation, and obtain the change of the mutated base and the change information of the amino acid; and analyze Whether the mutation occurs at the splice site, and obtain the code of the splice site; finally, with the help of the dbNSFP database resource, get the scores and predictions of the variant's SIFT, Polyphen2, MutationTaster, LRT, FATHMM, CADD, MetaSVM, Clinvar, InterVar function prediction tools result.

在本发明实施例中，数据及资源加载模块3包括注释数据加载单元和外部资源加载单元，注释数据加载单元利用于读取数据文件并存储；外部资源加载单元用于读取所有待分析的外部资源文件，例如，读取LOF(Lost of Function)基因列表；读取错义基因列表，读取Clinvar数据库中所有致病性的变异信息，变异信息包括遗传变异的致病性的染色体、开始坐标、碱基变化、氨基酸变化和对应基因名；读取OMIM数据库相关信息，所述相关信息包括OMIM编号对应基因名、隐性遗传疾病的OMIM编号列表、显性遗传疾病的OMIM编号列表和OMIM编号对应Orpha编号列表；从gwasdb中获取与PS4相关变异信息，包括染色体编号、在hg19中的位置、SNP编号、参考等位基因和替换等位基因；读取用于PM1判断的良性蛋白质结构域；读取BP1基因列表，获取基因名称；读取用于判断PM4和BP3的rmsk范围，具体包括染色体编号、起始位置和终止位置；读取用于BS2判断的杂合子和纯合子信息；加载dbscSNV数据信息，数据信息包括染色体编号、位置、参考等位基因、替换等位基因、ada分数、rf分数。In the embodiment of the present invention, the data and resource loading module 3 includes an annotation data loading unit and an external resource loading unit, the annotation data loading unit is used to read and store data files; the external resource loading unit is used to read all external resources to be analyzed Resource files, for example, read the LOF (Lost of Function) gene list; read the missense gene list, read all pathogenic variation information in the Clinvar database, and the variation information includes the pathogenic chromosome and start coordinates of the genetic variation , base changes, amino acid changes and corresponding gene names; read the relevant information of the OMIM database, which includes the gene names corresponding to OMIM numbers, the list of OMIM numbers for recessive genetic diseases, the list of OMIM numbers for dominant genetic diseases, and the OMIM numbers Corresponding to the list of Orpha numbers; obtain the variation information related to PS4 from gwasdb, including chromosome number, position in hg19, SNP number, reference allele and replacement allele; read the benign protein domain used for PM1 judgment; Read the BP1 gene list to obtain the gene name; read the rmsk range used to judge PM4 and BP3, including chromosome number, start position and end position; read heterozygote and homozygote information for BS2 judgment; load dbscSNV Data information, data information includes chromosome number, position, reference allele, replacement allele, ada score, rf score.

在本发明实施例中，致病性判别与分类模块包括致病性判别单元和致病性分类单元，致病性判别单元用于计算每一变异位点中所有判别项的取值；致病性分类单元根据响应于致病性判别单元的计算结果对所有变异位点进行分类；对于判别项PVS1，首先查看变异位点所在基因是否在LOF(Lost of Function)基因列表中，进一步查看变异位点的注释信息，如果是无义变异、移码变异、+/-1或2剪接位点、起始密码子、单个或多个外显子缺失情况，那么PVS1成立，取值为1；对于判别项PS1，首先检查变异位点所在基因的类型，如果是错义变异，进一步检查变异是否存在于加载的Clinvar致病变异数据中，接下来检查该变异是否导致不同的核酸改变，判断是否是相同的氨基酸改变，如果是则PS1成立，取值为1，否则取值为0；对于判别项PS4，检查受影响基因的变异率与对照组的患病率相比是否显著增加，先检查变异类型，如果是错义变异，然后检查变异是否出现在PS4资源文件列表中，如果是则PS4成立，取值为1，否则取值为0；对于判别项PM1，由于蛋白质结构域对蛋白质功能起着至关重要的作用，因此这些区域的错义变体倾向于致病，先检查变异类型，如果是错义变异，进一步检查变异是否出现在PM1良性蛋白质结构域资源文件列表中，如果是则PM1不成立，取值为0，否则取值为1；对于判别项PM2，先检查变异是否在ESP(Exome SequencingProject)、G1000(1000Genomes Project)、EXAC(Exome Aggregation Consortium)等对照组中缺失，如果缺失则PM2成立，取值为1；检查OMIM隐性遗传病数据资源列表，如果是隐性遗传病，则检查变异是否在上述对照组中是极低的频率，本发明使用0.005作为频率的阈值；对于判别项PM4，从UCSC基因组浏览器中检查RMSK数据库，如果变异属于非重复区域的非移码INDEL，或者变异类型为Stop Loss，则PM4成立，取值为1，否则取值为0；对于判别项PM5，首先检查变异类型，如果是错义变异，进一步检查变异是否存在于加载的Clinvar致病变异数据中，接下来检查该变异是否导致不同的核酸改变，若导致不同的氨基酸改变，如果是则PM5成立，取值为1，否则取值为0；对于判别项PP2，先检查变异类型，如果是错义变异，进一步检查错义变异是否是常见的致病原因，且变异所在基因是否很少有良性变异，这里使用Clinvar致病数据作为常见疾病机制的依据，如果是则PP2成立，取值为1，否则取值为0；对于判别项PP3，多条计算证据(sift,polyp2_hvar,lrt,mu_taster,mu_assessor,fathmm,cadd,metasvm)支持变异的有害影响，如果只有1个计算证据预测良性，而小于3个为未知的分类；如果小于等于3个计算证据有未知的结果，所有其他计算证据都会预测有害，则PP3成立，取值为1，否则取值为0；对于判别项PP5，检查变异标注中clnsig项，如果clnsig值等同于可能致病或致病，则PP5成立，取值为1，否则取值为0；对于判别项BP1，首先检查变异类型，如果是错义变异，进一步检查变异所在基因的截断变异是否会引起疾病，这里通过检查变异所在基因是否在BP1基因列表中实现，如果是则BP1成立，取值为1，否则取值为0；对于判别项BP3，从UCSC基因组浏览器中检查RMSK数据库，如果变异属于重复区域的非移码INDEL，则BP3成立，取值为1，否则取值为0；对于判别项BP4，多条计算证据(sift,polyp2_hvar,lrt,mu_taster,mu_assessor,fathmm,cadd,metasvm)支持变异的有害影响，如果只有1个计算证据预测有害，而小于3个为未知的分类；如果小于等于3个计算证据为未知的结果，所有其他计算证据都会预测良性，则BP4成立，取值为1，否则取值为0；对于判别项BP6，检查变异标注中clnsig项，如果clnsig值等同于可能良性或良性，则BP6成立，取值为1，否则取值为0；对于判别项BP7，如果同义变异对剪接没有影响，且核苷酸位置不是高度保守的，那么将该变异归类为良性的，BP7成立，取值为1。当预测变异对剪接没有影响时，要求dbscSNV_RF_SCORE和dbscSNV_ADA_SCORE都应该小于0.6。核苷酸保守性的预测是从dbnsfp30a数据库中检索，要求GERP++得分大于2，以表明核苷酸高度保守；对于判别项BA1，检查变异标注中ESP、G1000、ExAC或SEC等位基因频率是否高于指定的阈值，这里阈值设置为0.05，如果大于阈值，BA1成立，取值为1，否则取值为0；对于判别项BS1，在ExAC浏览器中，如果调整等位基因频率大于指定阈值，这里阈值设置为0.01，则BS1成立，取值为1，否则取值为0；对于判别项BS2，首先检查变异类型，如果是错义变异，进一步检查变异是否满足一个纯合子，显性，或X-连锁疾病，如果满足，则BS2成立，取值为1，否则取值为0；所述致病性分类单元根据上述各判别单元计算结果，按照以下规则进行分类，如果PVS1等于1，且PS的个数大于等于1，或者PM的个数大于等于2，或者PM的个数等于1且PP的个数等于1，或者PP的个数大于等于2，那么分类结果为致病的；如果PS的个数大于等于2，分类结果仍然为致病的；如果PS的个数等于1，如果PM的个数大于等于3，分类结果仍然为致病的；如果PS的个数等于1，如果PM的个数等于2且PP个数大于等于2，分类结果仍然为致病的；如果PVS1个数等于1且PM个数等于1，或者PS个数等于1且PM个数等于1或2，或者PS个数等于1且PP个数大于等于2，或者PM个数大于等于3或PM个数等于2且PP个数大于等于2，则分类结果为可能致病的；如果BA个数等于1或者BS个数大于等于2，那么分类结果为良性的；如果BS个数等于1且BP个数等于1，或者BP个数大于等于2，那么分类结果为可能良性的；除此而外的其它情况，分类结果为显著性不确定的。In the embodiment of the present invention, the pathogenicity discrimination and classification module includes a pathogenicity discrimination unit and a pathogenicity classification unit, and the pathogenicity discrimination unit is used to calculate the value of all discrimination items in each mutation site; The taxonomic unit classifies all the variant sites according to the calculation results of the pathogenicity discriminant unit; for the discriminant item PVS1, first check whether the gene of the variant site is in the LOF (Lost of Function) gene list, and further check the variant site Point annotation information, if it is nonsense variation, frameshift variation, +/-1 or 2 splice sites, start codon, single or multiple exon deletion, then PVS1 is established, and the value is 1; for Discriminant item PS1, first check the type of gene where the mutation site is located, if it is a missense mutation, further check whether the mutation exists in the loaded Clinvar pathogenic variant data, and then check whether the mutation causes different nucleic acid changes, and judge whether it is For the same amino acid change, if it is true, PS1 is established, and the value is 1, otherwise, it is 0; for the discriminant item PS4, check whether the mutation rate of the affected gene is significantly increased compared with the prevalence rate of the control group, first check the variation Type, if it is a missense variation, then check whether the variation appears in the PS4 resource file list, if so, PS4 is established, and the value is 1, otherwise the value is 0; Therefore, missense variants in these regions tend to cause disease, first check the variant type, if it is a missense variant, further check whether the variant appears in the PM1 benign protein domain resource file list, if so, then PM1 is not established, the value is 0, otherwise the value is 1; for the discriminant item PM2, first check whether the variation is missing in the control group such as ESP (Exome Sequencing Project), G1000 (1000 Genomes Project), EXAC (Exome Aggregation Consortium), if missing Then PM2 is established, and the value is 1; check the OMIM recessive genetic disease data resource list, if it is a recessive genetic disease, then check whether the variation is an extremely low frequency in the above-mentioned control group, and the present invention uses 0.005 as the threshold of frequency; For the discriminant item PM4, check the RMSK database from the UCSC genome browser, if the mutation belongs to the non-frameshift INDEL of the non-repeated region, or the mutation type is Stop Loss, then PM4 is established, and the value is 1, otherwise the value is 0; for Discriminant item PM5, first check the variant type, if it is a missense variant, further check whether the variant exists in the loaded Clinvar pathogenic variant data, and then check whether the variant leads to different nucleic acid changes, if it leads to different amino acid changes, if If yes, PM5 is established, and the value is 1; otherwise, the value is 0; for the discriminant item PP2, first check the variation type, if it is a missense variation, further check whether the missense variation is a common cause of disease, and whether the gene where the variation is located is There are very few benign mutations, and the pathogenic data of Clinvar is used here as the basis for common disease mechanisms. If yes, PP2 is established, and the value is 1, otherwise, the value is 0; for the discriminant item PP3, multiple calculation evidences (sift, polyp2_hvar, lrt, mu_taster, mu_assessor, fathmm, cadd, metasvm) support detrimental effects of mutations, if only 1 computational evidence predicts benign, and less than 3 are unknown classifications; if ≤ 3 computational evidences have unknown outcomes, all others If the calculated evidence can predict harmfulness, then PP3 is established, and the value is 1; otherwise, the value is 0; for the discriminant item PP5, check the clnsig item in the variation label, if the clnsig value is equal to the possible pathogenicity or pathogenicity, then PP5 is established, and the value is 0 The value is 1, otherwise the value is 0; for the discriminant item BP1, first check the mutation type, if it is a missense mutation, further check whether the truncated mutation of the gene where the mutation is located can cause disease, here by checking whether the gene where the mutation is located is in the BP1 gene list If it is, BP1 is established, and the value is 1; otherwise, the value is 0; for the discriminant item BP3, check the RMSK database from the UCSC genome browser, if the mutation belongs to the non-frameshift INDEL of the repeated region, then BP3 is established, The value is 1, otherwise the value is 0; for the discriminant item BP4, multiple pieces of computational evidence (sift, polyp2_hvar, lrt, mu_taster, mu_assessor, fathmm, cadd, metasvm) support the harmful effects of mutations, if only 1 computational evidence predicts Harmful, and less than 3 are unknown classifications; if less than or equal to 3 calculation evidences are unknown results, all other calculation evidences will predict benign, then BP4 is established, and the value is 1, otherwise the value is 0; for the discriminant item BP6 , check the clnsig item in the variation annotation, if the clnsig value is equal to the possible benign or benign, then BP6 is established, and the value is 1, otherwise the value is 0; for the discriminant item BP7, if the synonymous variation has no effect on splicing, and the nucleoside If the acid position is not highly conserved, then the variant is classified as benign, and BP7 is established with a value of 1. When the predicted variation has no effect on splicing, both dbscSNV_RF_SCORE and dbscSNV_ADA_SCORE should be less than 0.6. The prediction of nucleotide conservation is retrieved from the dbnsfp30a database, and the GERP++ score is required to be greater than 2 to indicate that the nucleotide is highly conserved; for the discriminant item BA1, check whether the allele frequency of ESP, G1000, ExAC or SEC in the variant annotation is high For the specified threshold, the threshold here is set to 0.05, if it is greater than the threshold, BA1 is established, and the value is 1, otherwise it is 0; for the discriminant item BS1, in the ExAC browser, if the adjusted allele frequency is greater than the specified threshold, Here the threshold is set to 0.01, then BS1 is established, the value is 1, otherwise the value is 0; for the discriminant item BS2, first check the variation type, if it is a missense variation, further check whether the variation satisfies a homozygous, dominant, or X-linked diseases, if satisfied, then BS2 is established, and the value is 1, otherwise, the value is 0; the pathogenicity classification unit classifies according to the following rules according to the calculation results of the above-mentioned discrimination units, if PVS1 is equal to 1, and The number of PS is greater than or equal to 1, or the number of PM is greater than or equal to 2, or the number of PM is equal to 1 and the number of PP is equal to 1, or the number of PP is greater than or equal to 2, then the classification result is pathogenic; if If the number of PS is greater than or equal to 2, the classification result is still pathogenic; if the number of PS is equal to 1, if the number of PM is greater than or equal to 3, the classification result is still pathogenic; if the number of PS is equal to 1, if If the number of PM is equal to 2 and the number of PP is greater than or equal to 2, the classification result is still pathogenic; if the number of PVS1 is equal to 1 and the number of PM is equal to 1, or the number of PS is equal to 1 and the number of PM is equal to 1 or 2, Or the number of PS is equal to 1 and the number of PP is greater than or equal to 2, or the number of PM is greater than or equal to 3 or the number of PM is equal to 2 and the number of PP is greater than or equal to 2, then the classification result is possibly pathogenic; if the number of BA is equal to 1 Or the number of BS is greater than or equal to 2, then the classification result is benign; if the number of BS is equal to 1 and the number of BP is equal to 1, or the number of BP is greater than or equal to 2, then the classification result is possibly benign; other In this case, the classification result is indeterminate in significance.

在本发明实施例中，结果解释与验证模块5包括结果解释单元和验证单元，结果解释单元用于给出致病性判别与分类模块的判别结果和分类结果的分类依据，并给出可视化界面，供人工参考，根据实际情况作出遗传变异的致病性判别；验证单元用于对遗传变异的致病性分类与Clinvar、InterVar遗传变异数据解读工具结果进行比对，若分类与Clinvar、InterVar遗传变异数据解读工具结果进行比对的比对结果不一致，则需要着重标识出来，由人工进一步进行审核，并对分类进行确定。In the embodiment of the present invention, the result interpretation and verification module 5 includes a result interpretation unit and a verification unit, and the result interpretation unit is used to provide the classification basis of the discrimination results and classification results of the pathogenicity discrimination and classification module, and provide a visual interface , for manual reference, and make the pathogenicity judgment of genetic variation according to the actual situation; the verification unit is used to compare the pathogenicity classification of genetic variation with the results of Clinvar and InterVar genetic variation data interpretation tools. If the results of the variation data interpretation tool are compared and the comparison results are inconsistent, it needs to be highlighted, and the manual will further review and determine the classification.

本发明的编码区域遗传变异致病性分类的数据处理系统与方法，所述处理系统由变异位点发现模块、变异位点注释模块、数据及资源加载模块、致病性判别与分类模块和结果解释和验证模块依次连接形成，由变异位点发现模块完成编码区域所有变异位点的找寻，并通过变异位点注释模块对变异位点中所有信息进行注释生成与每一变异位点对应的数据文件，然后由数据及资源加载模块读取数据文件和用于致病性判别的外部资源文件；随后通过致病性判别与分类模块对每一变异位点的判别项进行具体取值的计算，对每一判别项进行打分，将所有判别项按打分进行汇总操作，并根据汇总的情况对所有遗传变异的致病性进行分类；最后，由结果解释与验证模块给出所有遗传变异的致病性的分类依据，并与遗传变异数据解读工具的结果进行比对，若比对不一致，则再次通过人工进行进一步的审核和确认；与现有技术相比，本发明能够针对目标基因集、全外显子组测序数据，实现对大规模样本及海量变异位点信息的半自动和系统化的处理；本发明集成了变异位点发现模块、变异位点注释模块、数据及资源加载模块、致病性判别与分类模块、结果解释和验证模块等处理过程，整个数据处理流程具有规范性和系统性；本发明能够加快遗传病基因数据分析速度，极大提升遗传病数据解读人员的工作效率，避免因处理过程的繁琐而导致的错误。The data processing system and method for pathogenicity classification of genetic variation in coding regions of the present invention, the processing system consists of a variation site discovery module, a variation site annotation module, a data and resource loading module, a pathogenicity discrimination and classification module and a result The interpretation and verification modules are connected in sequence, and the variable site discovery module completes the search for all variable sites in the coding region, and annotates all the information in the variable site through the variable site annotation module to generate data corresponding to each variable site file, and then the data and resource loading module reads the data file and the external resource file used for pathogenicity discrimination; then, through the pathogenicity discrimination and classification module, the specific value of the discriminant item of each mutation site is calculated, Score each discriminant item, aggregate all discriminant items according to the score, and classify the pathogenicity of all genetic variations according to the summary; finally, the result interpretation and verification module gives the pathogenicity of all genetic variants. According to the basis of classification, it is compared with the results of genetic variation data interpretation tools. If the comparison is inconsistent, further review and confirmation will be performed manually again; compared with the prior art, the present invention can target gene sets, global Exome sequencing data realizes semi-automatic and systematic processing of large-scale samples and massive variation site information; the invention integrates a variation site discovery module, a variation site annotation module, a data and resource loading module, and a disease-causing The entire data processing process is standardized and systematic; the invention can accelerate the analysis speed of genetic disease genetic data, greatly improve the work efficiency of genetic disease data interpreters, and avoid Errors caused by cumbersome processing.

以上仅为本发明的较佳实施例，但并不限制本发明的专利范围，尽管参照前述实施例对本发明进行了详细的说明，对于本领域的技术人员来而言，其依然可以对前述各具体实施方式所记载的技术方案进行修改，或者对其中部分技术特征进行等效替换。凡是利用本发明说明书及附图内容所做的等效结构，直接或间接运用在其他相关的技术领域，均同理在本发明专利保护范围之内。The above are only preferred embodiments of the present invention, but do not limit the scope of patents of the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it can still understand the foregoing aspects The technical solutions described in the specific embodiments are modified, or some of the technical features are equivalently replaced. All equivalent structures made by utilizing the contents of the specification and drawings of the present invention and directly or indirectly used in other related technical fields are also within the protection scope of the patent of the present invention.

Claims

1. A data processing system for genetic variation pathogenicity classification of a coding region is characterized by being constructed based on ACMG (access control message) guidelines as theoretical bases, comprising a variation site discovery module, a variation site annotation module, a data and resource loading module, a pathogenicity distinguishing and classifying module and a result interpreting and verifying module which are sequentially connected, aiming at sequencing data of a target gene set and a whole exome, semi-automatic and systematic processing of large-scale samples and massive variation site information is realized, wherein:

a mutation site discovery module for searching the specific position of the genetic mutation pathogenicity mutation site in the coding region; the variant sites comprise SNPs and INDELs of small fragments;

the variant locus annotation module is used for carrying out information annotation on variant loci and generating a data file corresponding to each variant locus; the information annotation comprises the chromosome where the variation locus is located, a reference allele, a replacement allele, the location of an exon, the rarity, the gene where the variation locus is located, amino acid change, computation scores and prediction results of variation harmfulness by various computational tools, and annotation of variation frequency information in different crowds;

the data and resource loading module is used for reading an external resource file and the data file which are judged by pathogenicity; the external resource file comprises a gene list for judging pathogenicity, clinvar, OMIM, dbscSNV and dbNSFP databases; the annotation data loading unit is used for reading and storing the data file;

the pathogenicity judging and classifying module is used for calculating values of all discriminants in each variation site in the data file, scoring each discriminant, summarizing all discriminants according to the scores, and classifying pathogenicity of all genetic variations according to the summarizing condition; the discriminant items comprise PVS1, PS4, PM1, PM2, PM4, PM5, PP2, PP3, PP5, BS1, BS2, BP1, BP3, BP4, BP6, BP7 and BA1;

the result interpretation and verification module comprises a result interpretation unit and a verification unit, wherein the result interpretation unit is used for providing the discrimination result of the pathogenicity discrimination and classification module and the classification basis of the classification result, providing a visual interface for manual reference, and performing pathogenicity discrimination of genetic variation according to the actual situation; the verification unit is used for comparing the pathogenicity classification of genetic variation with Clinvar and InterVar genetic variation data interpretation tool results, and if the comparison result of the comparison between the pathogenicity classification of genetic variation and Clinvar and InterVar genetic variation data interpretation tool results is inconsistent, the pathogenicity classification of genetic variation and the InterVar genetic variation data interpretation tool results need to be marked emphatically, and the verification unit is used for further auditing manually and determining the classification.

2. The data processing system for classification of pathogenicity of genetic variation in coding region according to claim 1, wherein the variation site discovery module comprises a sequence alignment and mapping unit, a sequence data preprocessing unit, and a SNPs and small fragment INDELs variation discovery unit; the sequence alignment and mapping unit is used for receiving original sequencing data consisting of sequence data and mapping the sequence data to a reference genome; the sequence data preprocessing unit is used for preprocessing the sequence data mapped on the reference genome; the SNPs and small-fragment INDELs mutation discovery unit is used for identifying mutation sites of the preprocessed sequence data relative to a reference genome and calculating the genotype of each mutation site.

3. The data processing system for classification of pathogenicity of genetic variation in coding region according to claim 2, wherein the input of the variant site discovery module is a raw sequencing data file in fastq format and the output of the variant site discovery module is a vcf format file containing all variant sites.

4. The data processing system for classification of pathogenicity of genetic variation in coding region according to claim 3, wherein said variation site discovery module performs mapping of said raw sequencing data using BWA-MEM algorithm; the mutation site discovery module realizes search operation of the mutation site by using a GATK tool.

5. The data processing system of claim 1, wherein said mutation site annotation module comprises a site annotation unit for annotating said SNPs and said small fragments INDELs and for enabling selection of said mutation sites for annotation of information; the specific process is as follows: firstly, analyzing a VCF file generated by a mutation locus discovery module, and outputting the file to contain a mutation chromosome number, a mutation coordinate starting point end point position, a reference allele and a replacement allele; then, the file is used as input to obtain an output file containing the information of the gene name, the gene region, the transcriptome code and the protein code; then, obtaining the position of the exon where the variation exists through amino acid and sequence data resources, and classifying the type of the variation to obtain the variation of the variation base and the variation information of the amino acid; analyzing whether variation occurs at the splice sites and obtaining the codes of the splice sites; and finally, obtaining the scores and the prediction results of the variant SIFT, polyphen2, mutationTaster, LRT, FATHMM, CADD, metaSVM, clinvar and InterVar function prediction tools by means of dbNSFP database resources.

6. The data processing system for classification of pathogenicity of genetic variation in coding region according to claim 1, wherein said data and resource loading module comprises an annotation data loading unit and an external resource loading unit, said annotation data loading unit is configured to read and store said data file; the external resource loading unit is used for reading all external resource files to be analyzed.

7. The data processing system of claim 1, wherein the pathogenicity judging and classifying module comprises a pathogenicity judging unit and a pathogenicity classifying unit, and the pathogenicity judging unit is configured to calculate values of all the judging items in each of the mutation sites; and the pathogenicity classification unit classifies all the mutation sites according to the calculation result of the corresponding pathogenicity judgment unit.

8. A data processing method for classification of pathogenicity of genetic variation of coding region, which is applied to the data processing system for classification of pathogenicity of genetic variation of coding region as claimed in any one of claims 1 to 7, the method comprising the steps of:

s1, inputting original sequencing data consisting of sequence data to a mutation site discovery module, mapping the sequence data to a reference genome by using a BWA-MEM algorithm, preprocessing the mapped sequence data by using a Picard tool, and finding out a mutation site of the sequence data by using a GATK tool;

s2, performing information annotation on all the variant loci by using the variant locus annotation module to generate a data file corresponding to each variant locus;

s3, reading the data file by using the data and resource loading module, and simultaneously reading an external resource file for pathogenicity judgment;

s4, calculating values of all discrimination items of each mutation site based on ACMG guidelines, scoring the discrimination items, performing summary operation according to the scoring, and performing classification operation on pathogenicity based on the summary;

and S5, explaining the pathogenicity classification, taking the explanation as a classification basis, comparing a classification result with an interpretation result of a Clinvar and InterVar genetic variation database interpretation tool, and submitting the comparison result to manual work for further auditing and confirmation.

9. The method of claim 8, wherein in step S5, if the classification results are inconsistent with the interpretation results of Clinvar and intersar genetic variation database interpretation tools, the classification results are submitted to human review and validation, otherwise, classification corresponding to the pathogenicity is completed.