CN109616155B - Data processing system and method for genetic variation pathogenicity classification of coding region - Google Patents

Data processing system and method for genetic variation pathogenicity classification of coding region Download PDF

Info

Publication number
CN109616155B
CN109616155B CN201811374374.4A CN201811374374A CN109616155B CN 109616155 B CN109616155 B CN 109616155B CN 201811374374 A CN201811374374 A CN 201811374374A CN 109616155 B CN109616155 B CN 109616155B
Authority
CN
China
Prior art keywords
pathogenicity
data
variation
module
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811374374.4A
Other languages
Chinese (zh)
Other versions
CN109616155A (en
Inventor
诸峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Science and Technology
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN201811374374.4A priority Critical patent/CN109616155B/en
Publication of CN109616155A publication Critical patent/CN109616155A/en
Application granted granted Critical
Publication of CN109616155B publication Critical patent/CN109616155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a data processing system and a data processing method for genetic variation pathogenicity classification of a coding region, wherein the system comprises a variation locus discovery module, a variation locus annotation module, a data and resource loading module, a pathogenicity distinguishing and classifying module and a result explaining and verifying module which are sequentially connected, wherein the variation locus discovery module is used for searching the specific position of a genetic variation pathogenicity variation locus in the coding region; the variant locus annotation module is used for carrying out information annotation on variant loci and generating a data file corresponding to each variant locus; the data and resource loading module is used for reading an external resource file and the data file which are judged by pathogenicity; the pathogenicity judging and classifying module is used for calculating values of all discriminants in each variation site in the data file and classifying the pathogenicity of the genetic variation; the result interpretation and verification module is used for interpreting and manually verifying the classification results; the invention improves the working efficiency of the personnel reading the genetic disease data.

Description

Data processing system and method for genetic variation pathogenicity classification of coding region
Technical Field
The invention belongs to the field of bioinformatics, and particularly relates to a data processing system and method for genetic variation pathogenicity classification of coding regions.
Background
With the development of new-generation high-throughput sequencing technology and modern information processing technology, sequencing based on a target gene set, a whole exome and a whole genome and processing and interpretation of massive gene data become possible. The functional analysis of genetic variation, especially the classification and explanation of pathogenicity of genetic variation in coding regions, is the key point of attention in the field of gene sequencing data processing and analysis at present, and related achievements serve the fields of biomedicine, genetics and the like, and provide important basis for clinical assistant decision, accurate data interpretation, genetic consultation and the like.
At present, the classification of pathogenicity of genetic variation is guided by the classification and interpretation standards of disease variation sites published by the american society for medical genetics and genomics (ACMG) 2015 at home and abroad. The guideline gives 28 evaluation criteria for judging the clinical significance of the mutation sites, and because the guideline cannot clearly specify the details and parameters of each evaluation criterion, different data interpreters may have certain differences in specific operations, thereby leading to higher inconsistency of interpretation conclusions. In addition, for each detection sample, data interpretation personnel are required to manually inquire various database resources and compare relevant evidence information one by one, so that the whole data processing process is very complicated and inefficient, and errors are easy to occur. Therefore, the automation and systematization degree of the method in the prior art is very low, a high-efficiency sequencing data interpretation tool is lacked, and the processing requirements of large-scale sample data on rapidness and consistent results cannot be met.
Disclosure of Invention
The invention mainly aims to provide a data processing system and a data processing method for genetic variation pathogenicity classification of a coding region, which can realize semi-automatic and systematic processing of large-scale sample data and massive variation site information, accelerate genetic disease gene analysis, greatly improve the working efficiency of genetic disease data unscrambler and avoid errors caused by complexity of the processing process; the problems of low efficiency and high cost in the prior art are solved; the specific technical scheme is as follows:
on one hand, the system is constructed based on ACMG guidance as theoretical basis, and comprises a mutation site discovery module, a mutation site annotation module, a data and resource loading module, a pathogenicity distinguishing and classifying module and a result interpreting and verifying module which are connected in sequence, wherein:
a mutation site discovery module for finding the specific position of the genetic mutation pathogenicity mutation site in the coding region; the variant sites comprise SNPs and small-fragment INDELs;
the variant locus annotation module is used for carrying out information annotation on variant loci and generating a data file corresponding to each variant locus; the information annotation comprises the chromosome where the variation locus is located, a reference allele, a replacement allele, the location of an exon, the rarity, the gene where the variation locus is located, amino acid change, computation scores and prediction results of variation harmfulness by various computational tools, and annotation of variation frequency information in different crowds;
the data and resource loading module is used for reading an external resource file and the data file which are judged by pathogenicity; the external resource file comprises a gene list for judging pathogenicity, clinvar, OMIM, dbscSNV and dbNSFP databases;
the pathogenicity judging and classifying module is used for calculating values of all discrimination items in each mutation site in the data file and classifying the pathogenicity of genetic variation; the discriminant items comprise PVS1, PS4, PM1, PM2, PM4, PM5, PP2, PP3, PP5, BS1, BS2, BP1, BP3, BP4, BP6, BP7 and BA1;
and the result interpretation and verification module is used for performing result interpretation and manual verification on the classification.
Furthermore, the mutation site discovery module comprises a sequence comparison and mapping unit, a sequence data preprocessing unit and a SNPs and small fragment INDELs mutation discovery unit; the sequence alignment and mapping unit is used for receiving original sequencing data consisting of sequence data and mapping the sequence data to a reference genome; the sequence data preprocessing unit is used for preprocessing the sequence data mapped on the reference genome; the SNPs and small-fragment INDELs mutation discovery unit is used for identifying mutation sites of the preprocessed sequence data relative to a reference genome and calculating the genotype of each mutation site.
Further, the input of the mutation site discovery module is an original sequencing data file in a fastq format, and the output of the mutation site discovery module is a vcf format file containing all mutation sites.
Further, the variant site discovery module uses a BWA-MEM algorithm to complete the mapping operation of the raw sequencing data; the mutation site discovery module realizes search operation of the mutation site by using a GATK tool.
Further, the variant locus annotation module comprises a locus annotation unit, wherein the locus annotation unit annotates the SNPs and the small fragments INDELs and can select and designate the variant locus for information annotation.
Further, the data and resource loading module comprises an annotation data loading unit and an external resource loading unit, and the annotation data loading unit is used for reading and storing the data file; the external resource loading unit is used for reading the external resource file.
Further, the pathogenicity distinguishing and classifying module comprises a pathogenicity distinguishing unit and a pathogenicity classifying unit, wherein the pathogenicity distinguishing unit is used for calculating values of all discriminants in each variation locus; the pathogenicity classification unit classifies all the mutation sites according to the calculation result responding to the pathogenicity judgment unit.
Further, the result interpretation and verification module comprises a result interpretation unit and a verification unit, and the result interpretation unit is used for providing the judgment result of the pathogenicity judgment and classification module and the classification basis of the classification result; the verification unit is used for comparing the classifications and submitting the comparison results to a worker for further auditing and confirmation.
In another aspect, there is provided a data processing method for classifying genetic variation pathogenicity of coding region, which is applied to the data processing system for classifying genetic variation pathogenicity of coding region, and the method includes the steps of:
s1, inputting original sequencing data consisting of sequence data to the mutation site discovery module, mapping the sequence data to a reference genome by using a BWA-MEM algorithm, preprocessing the mapped sequence data by using a Picard tool, and finding out a mutation site of the sequence data by using a GATK tool;
s2, performing information annotation on all the mutation sites by using the mutation site annotation module to generate a data file corresponding to each mutation site;
s3, reading the data file by using the data and resource loading module, and simultaneously reading an external resource file for pathogenicity judgment;
s4, calculating values of all discriminants of each mutation site based on ACMG guidelines, scoring the discriminants, performing summary operation according to the scoring, and performing classification operation on pathogenicity based on the summary;
and S5, explaining the pathogenicity classification, taking the explanation as a classification basis, comparing a classification result with an interpretation result of a Clinvar and InterVar genetic variation database interpretation tool, and submitting the comparison result to manual work for further auditing and confirmation.
Further, in step S5, if the classification results are inconsistent with the interpretation results of the Clinvar and the intersvar genetic variation database interpretation tools, submitting the classification results to manual review and confirmation, otherwise, completing classification corresponding to the pathogenicity.
The processing system is formed by sequentially connecting a variation locus discovery module, a variation locus annotation module, a data and resource loading module, a pathogenicity distinguishing and classifying module and a result explaining and verifying module, the variation locus discovery module is used for searching all genetic variation loci in a coding region, all information in the variation loci is annotated by the variation locus annotation module to generate a data file corresponding to each variation locus, and then the data and resource loading module is used for reading the data file and an external resource file for distinguishing the pathogenicity; then, calculating the specific value of the discriminant of each mutation site through a pathogenicity discriminant and classification module, scoring each discriminant, summarizing all discriminants according to the scores, and classifying the pathogenicity of all genetic variations according to the summarized condition; finally, the result interpretation and verification module gives classification basis of pathogenicity of all genetic variations, and compares the classification basis with the result of the genetic variation data interpretation tool, if the comparison is inconsistent, further auditing and confirmation are carried out manually again; compared with the prior art, the method can realize semi-automatic and systematic processing of large-scale samples and massive variable site information aiming at the sequencing data of the target gene set and the full exome; the invention integrates the processing processes of a mutation site discovery module, a mutation site annotation module, a data and resource loading module, a pathogenicity distinguishing and classifying module, a result explaining and verifying module and the like, and the whole data processing flow has normalization and systematicness; the method can accelerate the analysis speed of genetic disease gene data, greatly improve the working efficiency of genetic disease data unscrambler, and avoid errors caused by the complexity of the processing process.
Drawings
FIG. 1 is a block diagram illustrating the structure of a data processing system for classification of pathogenicity of genetic variations in a coding region according to an embodiment of the present invention;
FIG. 2 is a block diagram illustrating a process of the mutation site discovery module according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart illustrating the annotation of information on mutation sites by the custom annotation unit according to the embodiment of the present invention;
FIG. 4 is a block diagram illustrating a flow chart of a data processing method for classifying pathogenicity of genetic variations in a coding region according to an embodiment of the present invention.
The system comprises a 1-mutation site discovery module, a 2-mutation site annotation module, a 3-data and resource loading module, a 4-pathogenicity distinguishing and classifying module and a 5-result explaining and verifying module.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
The data processing system for genetic variation pathogenicity classification of the coding region in the data processing system and the method for genetic variation pathogenicity classification of the coding region is constructed on the basis of the ACMG guideline as a theoretical basis, wherein the coding region is a coding region of all proteins on a genome, namely all exon regions on the genome; in processing data, the invention relates to BWA-MEM algorithm, picard tool and GATK tool, wherein Picard tool comprises AddOrRepleReadGroups algorithm and MarkDuplicate algorithm, and GATK tool comprises HaplotpypeCaller algorithm, genotyGVCFs algorithm, selectVariants algorithm, variantRecalibrator algorithm, aplyRecalibration algorithm and CombineVariants algorithm.
With reference to fig. 1 to 4, in the data processing system and method for genetic variation pathogenicity classification of coding region of the present invention, the data processing system for genetic variation pathogenicity classification of coding region includes a variation site discovery module 1, a variation site annotation module 2, a data and resource loading module 3, a pathogenicity discrimination and classification module 4, and a result interpretation and verification module 5, which are connected in sequence, wherein the variation site discovery module 1 is configured to find specific positions of variation sites in the coding region, and the variation sites include SNPs and INDELs of small fragments; the variant locus annotation module 2 is used for carrying out information annotation on variant loci and generating a data file corresponding to each variant locus; the information annotation comprises the annotation of the chromosome where the variation locus is located, the reference allele, the substitution allele, the location of the exon, the rarity, the gene where the variation locus is located, the amino acid change, the computation score and the prediction result of the variation harmfulness by various computational tools, and the information of variation frequency in different crowds; the data and resource loading module 3 is used for reading external resource files and data files for pathogenicity judgment; the external resource files comprise a gene list for pathogenicity judgment, clinvar, OMIM, dbscSNV and dbNSFP databases; the pathogenicity judging and classifying module 4 is used for calculating values of all discrimination items in each mutation site in the data file and classifying the pathogenicity of genetic variation; the discriminant includes PVS1, PS4, PM1, PM2, PM4, PM5, PP2, PP3, PP5, BS1, BS2, BP1, BP3, BP4, BP6, BP7, and BA1; the result interpretation and verification module 5 is used for performing result interpretation and manual verification on the classification; the data processing method for classifying the genetic variation pathogenicity of the coding region comprises the following steps: s1, inputting original sequencing data consisting of sequence data to a mutation site discovery module, mapping the sequence data to a reference genome by using a BWA-MEM algorithm, preprocessing the mapped sequence data by using a Picard tool, and finding out a mutation site of the sequence data by using a GATK tool; s2, performing information annotation on all the mutation sites by using the mutation site annotation module to generate a data file corresponding to each mutation site; s3, reading the data file by using the data and resource loading module, and simultaneously reading an external resource file for pathogenicity judgment; s4, calculating values of all discriminants of each mutation site based on ACMG guidelines, scoring the discriminants, performing summary operation according to the scoring, and performing classification operation on pathogenicity based on the summary; and S5, explaining the pathogenicity classification, taking the explanation as a classification basis, comparing a classification result with an interpretation result of a Clinvar and InterVar genetic variation database interpretation tool, and submitting the comparison result to manual work for further auditing and confirmation.
In the embodiment of the invention, the mutation site discovery module 1 comprises a sequence alignment and mapping unit, a sequence data preprocessing unit and an SNPs and small fragment INDELs mutation discovery unit; the sequence comparison and mapping unit is used for receiving original sequencing data consisting of sequence data and mapping the original sequencing data to a reference genome; the sequence data preprocessing unit is used for preprocessing the mapped sequence data; the SNPs and small-fragment INDELs mutation discovery unit is used for identifying mutation sites relative to a reference genome and calculating the genotype of each mutation site; the method utilizes a BWA-MEM algorithm to realize the mapping operation of original sequencing data to a reference genome; the preprocessing operation is realized through a Picard tool, the Picard tool comprises an AddOrRepleReadGroups algorithm and a MarkDuplicate algorithm, and the AddOrRepleReadGroups algorithm adds sequence data information into the mapped bam file; tagging duplicate information in sequence data by markduplex algorithm to mitigate bias caused by data generation steps such as PCR amplification; sequencing the sequence data by using a Picard tool; finally, a recalibration operation of the base mass fraction is performed using the GATK tool; the specific process of finding the variant site by using the variant site discovery module is as follows:
firstly, using GATK, and independently running a HaplotpypeCaller method on each sample in a GVCF mode to generate an intermediate file format of GVCF; then, combining a single sample GVCF file to generate a multi-sample VCF file by using a genotypeGVCFs method of the GATK; and using SelectVariants method of GATK to distinguish SNPS from INDELs; then, correcting the genetic variation by using a Variant Recalibrator algorithm of the GATK and an AplyRecirculation algorithm of the GATK to realize the filtration of the variation sites; finally, the combination variants method of GATK was used to put SNPs and INDELs into a VCF file.
In the embodiment of the present invention, the variant locus annotation module 2 implements annotation of the information of the variant locus by the locus annotation unit, specifically, all the variant loci or a specific variant locus can be annotated by using the dbNSFP database, and the specific process is described as follows:
firstly, analyzing a VCF file generated by a mutation site discovery module, converting the VCF file of a single sample into a format of Tab key interval, adjusting the position aiming at INDELs to eliminate redundancy, and outputting the file containing a mutation chromosome number, a mutation coordinate starting point and end point position, a reference allele and a replacement allele; then taking the file as input, and obtaining an output file containing information such as gene names, gene regions, transcriptome codes, protein codes and the like through transcriptome codes and data resources such as the transcriptome codes to the genes and the protein codes and the like; then, obtaining the position of the exon where the variation exists through amino acid and sequence data resources, and classifying the type of the variation to obtain the variation of the variation base and the variation information of the amino acid; analyzing whether variation occurs at the splice sites and obtaining the codes of the splice sites; and finally, obtaining the scores and the prediction results of the variant SIFT, polyphen2, mutationTaster, LRT, FATHMM, CADD, metaSVM, clinvar and InterVar function prediction tools by means of dbNSFP database resources.
In the embodiment of the present invention, the data and resource loading module 3 includes an annotation data loading unit and an external resource loading unit, and the annotation data loading unit is used for reading and storing a data file; the external resource loading unit is configured to read all external resource files to be analyzed, for example, a lot of Function (LOF) gene list; reading a missense gene list, and reading all pathogenic variation information in a Clinvar database, wherein the variation information comprises a chromosome of the pathogenicity of genetic variation, a start coordinate, a base change, an amino acid change and a corresponding gene name; reading related information of an OMIM database, wherein the related information comprises a gene name corresponding to an OMIM number, an OMIM number list of recessive genetic diseases, an OMIM number list of dominant genetic diseases and an Orpha number list corresponding to the OMIM number; obtaining variation information related to PS4 from gwasdb, including chromosome number, position in hg19, SNP number, reference allele and replacement allele; reading benign protein domains for PM1 determination; reading the BP1 gene list to obtain a gene name; reading the rmsk range for judging PM4 and BP3, specifically comprising a chromosome number, a starting position and an ending position; reading heterozygote and homozygote information for BS2 judgment; loading dbscSNV data information, wherein the data information comprises chromosome number, position, reference allele, replacement allele, ada score and rf score.
In the embodiment of the invention, the pathogenicity distinguishing and classifying module comprises a pathogenicity distinguishing unit and a pathogenicity classifying unit, wherein the pathogenicity distinguishing unit is used for calculating values of all distinguishing items in each variation site; the pathogenicity classification unit classifies all the mutation sites according to the calculation result responding to the pathogenicity judgment unit; for the discrimination PVS1, firstly checking whether a gene of a mutation site is in an LOF (Lost of Function) gene list, and further checking annotation information of the mutation site, wherein if the mutation site is nonsense, frameshift, +/-1 or 2 splice sites, an initiation codon and deletion of one or more exons, the PVS1 is established and takes the value of 1; for the discrimination item PS1, firstly checking the type of the gene where the mutation site is located, if missense mutation exists, further checking whether the mutation exists in the loaded Clinvar pathogenic mutation data, then checking whether the mutation causes different nucleic acid changes, judging whether the mutation is the same amino acid change, if so, judging that the PS1 is true, and taking the value as 1, otherwise, taking the value as 0; for the discrimination item PS4, checking whether the mutation rate of the affected genes is obviously increased compared with the prevalence rate of a control group, firstly checking the mutation type, if the mutation is missense mutation, then checking whether the mutation appears in a PS4 resource file list, if the mutation is missense mutation, the PS4 is established, and the value is 1, otherwise, the value is 0; for the discrimination PM1, because the protein domain plays an important role in the protein function, missense variants in the regions tend to cause diseases, firstly checking the type of variation, if the missense variation exists, further checking whether the variation exists in a resource file list of the PM1 benign protein domain, if the missense variation exists, the PM1 does not stand, the value is 0, otherwise, the value is 1; for the item PM2, first, it is checked whether mutation is missing in control groups such as ESP (extract Sequencing Project), G1000 (1000 genes Project), EXAC (extract Aggregation Consortium), etc., and if so, PM2 is established and takes a value of 1; checking an OMIM recessive genetic disease data resource list, if the OMIM recessive genetic disease is the recessive genetic disease, checking whether the variation is extremely low frequency in the control group, wherein 0.005 is used as a threshold value of the frequency; for a discriminant PM4, checking an RMSK database from a UCSC genome browser, if the mutation belongs to non-frameshift INDEL of a non-repetitive region or the mutation type is Stop Loss, the PM4 is established, and the value is 1, otherwise, the value is 0; for the discrimination PM5, firstly checking the mutation type, if the mutation is missense mutation, further checking whether the mutation exists in the loaded Clinvar pathogenic mutation data, and then checking whether the mutation causes different nucleic acid changes, if the mutation causes different amino acid changes, the PM5 is established, and the value is 1, otherwise, the value is 0; for the judgment item PP2, firstly checking the mutation type, if the mutation is missense mutation, further checking whether the missense mutation is a common pathogenic reason or not, and whether the gene where the mutation is located has few benign mutations or not, wherein Clinvar pathogenic data is used as the basis of a common disease mechanism, if the mutation is positive, the value is 1, and if not, the value is 0; for the discrimination PP3, multiple pieces of computational evidence (sift, polyp2_ hvar, lrt, mu _ taster, mu _ assessor, fastmm, cadd, metasvm) support the deleterious effects of variation, if only 1 piece of computational evidence predicts benign, less than 3 are unknown classifications; if the calculation evidence less than or equal to 3 has unknown results and all other calculation evidences can be predicted to be harmful, the PP3 is established, the value is 1, otherwise, the value is 0; for the discrimination item PP5, checking a clnsig item in the variation label, if the clnsig value is equal to possible pathogenicity or pathogenicity, the PP5 is established, the value is 1, otherwise, the value is 0; for the discrimination item BP1, firstly checking the mutation type, if the mutation is missense mutation, further checking whether the truncation mutation of the gene where the mutation is located can cause diseases, wherein the truncation mutation is realized by checking whether the gene where the mutation is located is in a BP1 gene list, if the mutation is located, the BP1 is established, the value is 1, otherwise, the value is 0; for a discriminant BP3, checking an RMSK database from a UCSC genome browser, if the mutation belongs to the non-frameshift INDEL of the repetitive region, establishing BP3, and if not, taking the value as 1, otherwise, taking the value as 0; for the discriminant item BP4, multiple pieces of computational evidence (sift, polyp2_ hvar, lrt, mu _ taster, mu _ assessor, fathmm, cadd, metasvm) support the detrimental effects of variation, and if only 1 piece of computational evidence predicts the detriment, less than 3 are unknown classifications; if less than or equal to 3 calculation evidences are unknown results, and all other calculation evidences can predict to be benign, the BP4 is established, the value is 1, otherwise, the value is 0; for the discrimination item BP6, checking a clnsig item in the variation label, if the clnsig value is equal to possibly benign or benign, determining that the BP6 is true, and taking the value as 1, otherwise, taking the value as 0; for the discriminant BP7, if the synonymous variation has no effect on splicing and the nucleotide position is not highly conserved, the variation is classified as benign and BP7 holds, taking the value 1. When no effect on splicing is predicted by mutation, it is required that both dbscSNV _ RF _ SCORE and dbscSNV _ ADA _ SCORE should be less than 0.6. The prediction of nucleotide conservation was retrieved from the dbnsfp30a database, requiring a GERP + + score of greater than 2 to indicate that nucleotides are highly conserved; for a discriminant BA1, checking whether the allele frequency of ESP, G1000, exAC or SEC in the variation label is higher than a specified threshold, wherein the threshold is set to be 0.05, if the allele frequency is higher than the threshold, BA1 is established, the value is 1, otherwise, the value is 0; for the discrimination BS1, in an ExAC browser, if the adjusted allele frequency is greater than a specified threshold, wherein the threshold is set to be 0.01, the BS1 is established, the value is 1, otherwise, the value is 0; for the discrimination item BS2, firstly checking the variation type, if the variation is missense variation, further checking whether the variation satisfies a homozygote, dominant or X-linked disease, if so, the BS2 is established, and the value is 1, otherwise, the value is 0; the pathogenicity classification unit classifies according to the calculation results of the judgment units and the following rules, if PVS1 is equal to 1 and the number of PS is more than or equal to 1, or the number of PM is more than or equal to 2, or the number of PM is equal to 1 and the number of PP is equal to 1, or the number of PP is more than or equal to 2, the classification result is pathogenic; if the number of the PS is more than or equal to 2, the classification result is still pathogenic; if the number of the PS is equal to 1, and if the number of the PM is more than or equal to 3, the classification result is still pathogenic; if the number of the PS is equal to 1, if the number of the PM is equal to 2 and the number of the PP is more than or equal to 2, the classification result is still pathogenic; if the number of PVS1 is equal to 1 and the number of PM is equal to 1, or the number of PS is equal to 1 and the number of PM is equal to 1 or 2, or the number of PS is equal to 1 and the number of PP is equal to or more than 2, or the number of PM is equal to or more than 3, or the number of PM is equal to or more than 2 and the number of PP is equal to or more than 2, the classification result is possibly pathogenic; if the number of BA is equal to 1 or the number of BS is more than or equal to 2, the classification result is benign; if the number of the BSs is equal to 1 and the number of the BPs is equal to 1 or the number of the BPs is more than or equal to 2, the classification result is possibly benign; in other cases, the classification result is uncertain.
In the embodiment of the invention, the result interpretation and verification module 5 comprises a result interpretation unit and a verification unit, wherein the result interpretation unit is used for providing the judgment result of the pathogenicity judgment and classification module and the classification basis of the classification result, providing a visual interface for manual reference, and performing pathogenicity judgment of genetic variation according to the actual situation; the verification unit is used for comparing the pathogenicity classification of genetic variation with Clinvar and InterVar genetic variation data interpretation tool results, and if the comparison result of the comparison between the pathogenicity classification of genetic variation and Clinvar and InterVar genetic variation data interpretation tool results is inconsistent, the pathogenicity classification of genetic variation and the InterVar genetic variation data interpretation tool results need to be marked emphatically, and the verification unit is used for further auditing manually and determining the classification.
The processing system is formed by sequentially connecting a variation site discovery module, a variation site annotation module, a data and resource loading module, a pathogenicity distinguishing and classifying module and a result interpreting and verifying module, wherein the variation site discovery module is used for searching all variation sites in the coding region, annotating all information in the variation sites through the variation site annotation module to generate a data file corresponding to each variation site, and then the data and resource loading module is used for reading the data files and external resource files for distinguishing the pathogenicity; then, calculating the specific value of the discriminant of each mutation site through a pathogenicity discriminant and classification module, scoring each discriminant, summarizing all discriminants according to the scores, and classifying the pathogenicity of all genetic variations according to the summarized condition; finally, the result interpretation and verification module gives classification basis of pathogenicity of all genetic variations, and compares the classification basis with the result of the genetic variation data interpretation tool, if the comparison is inconsistent, further auditing and confirming are carried out manually again; compared with the prior art, the method can realize semi-automatic and systematic processing of large-scale samples and massive variant site information aiming at sequencing data of a target gene set and a whole exome; the invention integrates the processing processes of a mutation site discovery module, a mutation site annotation module, a data and resource loading module, a pathogenicity distinguishing and classifying module, a result explaining and verifying module and the like, and the whole data processing flow has normalization and systematicness; the invention can accelerate the analysis speed of genetic disease gene data, greatly improve the working efficiency of genetic disease data interpretation personnel and avoid errors caused by the complexity of the processing process.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing detailed description, or equivalent changes may be made in some of the features of the embodiments described above. All equivalent structures made by using the contents of the specification and the attached drawings of the invention can be directly or indirectly applied to other related technical fields, and are also within the protection scope of the patent of the invention.

Claims (9)

1. A data processing system for genetic variation pathogenicity classification of a coding region is characterized by being constructed based on ACMG (access control message) guidelines as theoretical bases, comprising a variation site discovery module, a variation site annotation module, a data and resource loading module, a pathogenicity distinguishing and classifying module and a result interpreting and verifying module which are sequentially connected, aiming at sequencing data of a target gene set and a whole exome, semi-automatic and systematic processing of large-scale samples and massive variation site information is realized, wherein:
a mutation site discovery module for searching the specific position of the genetic mutation pathogenicity mutation site in the coding region; the variant sites comprise SNPs and INDELs of small fragments;
the variant locus annotation module is used for carrying out information annotation on variant loci and generating a data file corresponding to each variant locus; the information annotation comprises the chromosome where the variation locus is located, a reference allele, a replacement allele, the location of an exon, the rarity, the gene where the variation locus is located, amino acid change, computation scores and prediction results of variation harmfulness by various computational tools, and annotation of variation frequency information in different crowds;
the data and resource loading module is used for reading an external resource file and the data file which are judged by pathogenicity; the external resource file comprises a gene list for judging pathogenicity, clinvar, OMIM, dbscSNV and dbNSFP databases; the annotation data loading unit is used for reading and storing the data file;
the pathogenicity judging and classifying module is used for calculating values of all discriminants in each variation site in the data file, scoring each discriminant, summarizing all discriminants according to the scores, and classifying pathogenicity of all genetic variations according to the summarizing condition; the discriminant items comprise PVS1, PS4, PM1, PM2, PM4, PM5, PP2, PP3, PP5, BS1, BS2, BP1, BP3, BP4, BP6, BP7 and BA1;
the result interpretation and verification module comprises a result interpretation unit and a verification unit, wherein the result interpretation unit is used for providing the discrimination result of the pathogenicity discrimination and classification module and the classification basis of the classification result, providing a visual interface for manual reference, and performing pathogenicity discrimination of genetic variation according to the actual situation; the verification unit is used for comparing the pathogenicity classification of genetic variation with Clinvar and InterVar genetic variation data interpretation tool results, and if the comparison result of the comparison between the pathogenicity classification of genetic variation and Clinvar and InterVar genetic variation data interpretation tool results is inconsistent, the pathogenicity classification of genetic variation and the InterVar genetic variation data interpretation tool results need to be marked emphatically, and the verification unit is used for further auditing manually and determining the classification.
2. The data processing system for classification of pathogenicity of genetic variation in coding region according to claim 1, wherein the variation site discovery module comprises a sequence alignment and mapping unit, a sequence data preprocessing unit, and a SNPs and small fragment INDELs variation discovery unit; the sequence alignment and mapping unit is used for receiving original sequencing data consisting of sequence data and mapping the sequence data to a reference genome; the sequence data preprocessing unit is used for preprocessing the sequence data mapped on the reference genome; the SNPs and small-fragment INDELs mutation discovery unit is used for identifying mutation sites of the preprocessed sequence data relative to a reference genome and calculating the genotype of each mutation site.
3. The data processing system for classification of pathogenicity of genetic variation in coding region according to claim 2, wherein the input of the variant site discovery module is a raw sequencing data file in fastq format and the output of the variant site discovery module is a vcf format file containing all variant sites.
4. The data processing system for classification of pathogenicity of genetic variation in coding region according to claim 3, wherein said variation site discovery module performs mapping of said raw sequencing data using BWA-MEM algorithm; the mutation site discovery module realizes search operation of the mutation site by using a GATK tool.
5. The data processing system of claim 1, wherein said mutation site annotation module comprises a site annotation unit for annotating said SNPs and said small fragments INDELs and for enabling selection of said mutation sites for annotation of information; the specific process is as follows: firstly, analyzing a VCF file generated by a mutation locus discovery module, and outputting the file to contain a mutation chromosome number, a mutation coordinate starting point end point position, a reference allele and a replacement allele; then, the file is used as input to obtain an output file containing the information of the gene name, the gene region, the transcriptome code and the protein code; then, obtaining the position of the exon where the variation exists through amino acid and sequence data resources, and classifying the type of the variation to obtain the variation of the variation base and the variation information of the amino acid; analyzing whether variation occurs at the splice sites and obtaining the codes of the splice sites; and finally, obtaining the scores and the prediction results of the variant SIFT, polyphen2, mutationTaster, LRT, FATHMM, CADD, metaSVM, clinvar and InterVar function prediction tools by means of dbNSFP database resources.
6. The data processing system for classification of pathogenicity of genetic variation in coding region according to claim 1, wherein said data and resource loading module comprises an annotation data loading unit and an external resource loading unit, said annotation data loading unit is configured to read and store said data file; the external resource loading unit is used for reading all external resource files to be analyzed.
7. The data processing system of claim 1, wherein the pathogenicity judging and classifying module comprises a pathogenicity judging unit and a pathogenicity classifying unit, and the pathogenicity judging unit is configured to calculate values of all the judging items in each of the mutation sites; and the pathogenicity classification unit classifies all the mutation sites according to the calculation result of the corresponding pathogenicity judgment unit.
8. A data processing method for classification of pathogenicity of genetic variation of coding region, which is applied to the data processing system for classification of pathogenicity of genetic variation of coding region as claimed in any one of claims 1 to 7, the method comprising the steps of:
s1, inputting original sequencing data consisting of sequence data to a mutation site discovery module, mapping the sequence data to a reference genome by using a BWA-MEM algorithm, preprocessing the mapped sequence data by using a Picard tool, and finding out a mutation site of the sequence data by using a GATK tool;
s2, performing information annotation on all the variant loci by using the variant locus annotation module to generate a data file corresponding to each variant locus;
s3, reading the data file by using the data and resource loading module, and simultaneously reading an external resource file for pathogenicity judgment;
s4, calculating values of all discrimination items of each mutation site based on ACMG guidelines, scoring the discrimination items, performing summary operation according to the scoring, and performing classification operation on pathogenicity based on the summary;
and S5, explaining the pathogenicity classification, taking the explanation as a classification basis, comparing a classification result with an interpretation result of a Clinvar and InterVar genetic variation database interpretation tool, and submitting the comparison result to manual work for further auditing and confirmation.
9. The method of claim 8, wherein in step S5, if the classification results are inconsistent with the interpretation results of Clinvar and intersar genetic variation database interpretation tools, the classification results are submitted to human review and validation, otherwise, classification corresponding to the pathogenicity is completed.
CN201811374374.4A 2018-11-19 2018-11-19 Data processing system and method for genetic variation pathogenicity classification of coding region Active CN109616155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811374374.4A CN109616155B (en) 2018-11-19 2018-11-19 Data processing system and method for genetic variation pathogenicity classification of coding region

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811374374.4A CN109616155B (en) 2018-11-19 2018-11-19 Data processing system and method for genetic variation pathogenicity classification of coding region

Publications (2)

Publication Number Publication Date
CN109616155A CN109616155A (en) 2019-04-12
CN109616155B true CN109616155B (en) 2023-04-18

Family

ID=66004147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811374374.4A Active CN109616155B (en) 2018-11-19 2018-11-19 Data processing system and method for genetic variation pathogenicity classification of coding region

Country Status (1)

Country Link
CN (1) CN109616155B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111863132A (en) * 2019-04-29 2020-10-30 广州欧蒙未一医学检验实验室有限公司 Method and system for screening pathogenic variation
CN110245685B (en) * 2019-05-15 2022-03-25 清华大学 Method, system and storage medium for predicting pathogenicity of genome single-site variation
CN110379458A (en) * 2019-07-15 2019-10-25 中国人民解放军陆军军医大学第一附属医院 Pathogenicity variation site determination method, device, computer equipment and storage medium
CN110544508B (en) * 2019-07-29 2023-03-10 荣联科技集团股份有限公司 Method and device for analyzing monogenic genetic disease genes and electronic equipment
CN110957006B (en) * 2019-12-14 2023-08-11 杭州联川基因诊断技术有限公司 Interpretation method of BRCA1/2 gene variation
CN112380167B (en) * 2020-11-17 2024-06-25 深圳市和讯华谷信息技术有限公司 Batch data verification method and device, computer equipment and storage medium
CN114429785B (en) * 2022-04-01 2022-07-19 普瑞基准生物医药(苏州)有限公司 Automatic classification method and device for genetic variation and electronic equipment
CN117373696B (en) * 2023-12-08 2024-03-01 神州医疗科技股份有限公司 Automatic genetic disease interpretation system and method based on literature evidence library

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105861697A (en) * 2016-05-13 2016-08-17 万康源(天津)基因科技有限公司 System for detecting potential pathogenic variants of exome based on family
CN105925685A (en) * 2016-05-13 2016-09-07 万康源(天津)基因科技有限公司 Exome potential pathogenic mutation detection method based on family line
CN106599613A (en) * 2016-12-15 2017-04-26 博奥生物集团有限公司 Method for judging genetic tumor variation site classification
CN108108592A (en) * 2017-12-29 2018-06-01 北京聚道科技有限公司 A kind of construction method of machine learning model for the pathogenic marking of hereditary variation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105861697A (en) * 2016-05-13 2016-08-17 万康源(天津)基因科技有限公司 System for detecting potential pathogenic variants of exome based on family
CN105925685A (en) * 2016-05-13 2016-09-07 万康源(天津)基因科技有限公司 Exome potential pathogenic mutation detection method based on family line
CN106599613A (en) * 2016-12-15 2017-04-26 博奥生物集团有限公司 Method for judging genetic tumor variation site classification
CN108108592A (en) * 2017-12-29 2018-06-01 北京聚道科技有限公司 A kind of construction method of machine learning model for the pathogenic marking of hereditary variation

Also Published As

Publication number Publication date
CN109616155A (en) 2019-04-12

Similar Documents

Publication Publication Date Title
CN109616155B (en) Data processing system and method for genetic variation pathogenicity classification of coding region
Lischer et al. Reference-guided de novo assembly approach improves genome reconstruction for related species
Levy-Sakin et al. Genome maps across 26 human populations reveal population-specific patterns of structural variation
Rakocevic et al. Fast and accurate genomic analyses using genome graphs
Fermin et al. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics
Mathelier et al. Identification of altered cis-regulatory elements in human disease
Martin et al. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads
Meyer et al. Gene structure conservation aids similarity based gene prediction
Münz et al. CSN and CAVA: variant annotation tools for rapid, robust next-generation sequencing analysis in the clinical setting
Kislyuk et al. A computational genomics pipeline for prokaryotic sequencing projects
Luo et al. A comprehensive review of scaffolding methods in genome assembly
Sirén et al. Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit
Panchenko Finding weak similarities between proteins by sequence profile comparison
JP7361774B2 (en) A method for detecting genetic variation in highly homologous sequences by independent alignment and pairing of sequence reads
Shirley et al. Interpretation, stratification and evidence for sequence variants affecting mRNA splicing in complete human genome sequences
Han et al. Novel algorithms for efficient subsequence searching and mapping in nanopore raw signals towards targeted sequencing
Haimovich Methods, challenges, and promise of next-generation sequencing in cancer biology
Zhu et al. misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads
Steenwyk et al. orthofisher: a broadly applicable tool for automated gene identification and retrieval
Prodanov et al. Sensitive alignment using paralogous sequence variants improves long-read mapping and variant calling in segmental duplications
Aerts et al. Text-mining assisted regulatory annotation
Wang et al. A primer for disease gene prioritization using next-generation sequencing data
Glick et al. Panoramic: a package for constructing eukaryotic pan‐genomes
Lebo et al. Bioinformatics in clinical genomic sequencing
Solana et al. DELEAT: gene essentiality prediction and deletion design for bacterial genome reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant