CN115458055A

CN115458055A - Analysis method, system and equipment for predicting mutation pathogenicity based on target characteristics

Info

Publication number: CN115458055A
Application number: CN202211290688.2A
Authority: CN
Inventors: 吴南; 赵恒强; 杜华康; 赵森
Original assignee: Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Current assignee: Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2022-12-09

Abstract

The invention relates to a method, a system and equipment for predicting mutation pathogenicity based on target characteristics. The method comprises the following steps: obtaining mutation of a gene to be predicted; mapping the mutation of the gene to be predicted to the protein structure of the gene to be predicted to obtain the protein structure mapped with the mutation; then, carrying out feature extraction on the protein structure mapped with mutation to obtain the structural features of the mutant protein; then, carrying out feature selection on the structural features of the mutant protein to obtain a feature selection result; and finally, inputting the feature selection result into a classification prediction model to obtain a classification result of the mutation into pathogenic mutation or benign mutation. The method aims to carry out prediction mutation based on target characteristics in mutant protein structural characteristics and explore high prediction capability and potential application value of the mutant protein on the pathogenicity of mutation.

Description

Analysis method, system and equipment for predicting mutation pathogenicity based on target characteristics

Technical Field

The invention relates to the field of gene data analysis, in particular to a method, equipment, a system, a computer readable storage medium and application thereof for predicting mutation pathogenicity analysis based on target characteristics.

Background

The human genome comprises about 31.6 hundred million base pairs, encodes about 3 million genes, and exhibits an average of one variation per 500-1000 bases. In the human genome, only the protein coding region encompasses the large human-to-human differences, and to date, 650 ten thousand missense variations have been observed. In fact, there may be variations in each protein location except for the lethal protein location in the genome of 80 million people living on earth. Most of the variations, which are clinically relevant variations, have highly variable effects on protein structure and function, of which only a small fraction are pathogenic.

The AlphaFold2 protein structure generation model expands the structural coverage of the human proteome from 17% to 98.5%, with extensive and accurate structural information with unprecedented potential to help predict mutational effects. To assess the pathogenicity of genetic mutations, many of the most advanced predictors of variant effects have been developed, such as SIFT, REVEL, and EVE. Given the close relationship between the structure and function of proteins, the structural context of mutations represents promising information independent of mutation frequency and evolutionary conservation. The degree of burial or exposure of residues in the 3D structure is critical for protein folding and stability. Thus, specific structural features of muteins are powerful predictors of the pathogenicity of mutations.

Disclosure of Invention

The method disclosed by the invention fuses the information of the gene to be predicted and the predicted protein structure information, deeply excavates the variation characteristics in the gene data based on the target characteristics, further predicts the mutation pathogenicity of the gene to be predicted, excavates the high prediction capability and the potential application value of the gene to the variation pathogenicity, and solves the related life science problems.

The application discloses a pathogenicity analysis method based on target feature prediction mutation, which comprises the following steps:

obtaining mutation of a gene to be predicted, wherein the gene to be predicted comprises a wild type gene;

inputting the gene to be predicted into a protein structure generation model to obtain the protein structure of the gene to be predicted;

mapping the mutation of the gene to be predicted to the protein structure of the gene to be predicted to obtain the protein structure mapped with the mutation;

carrying out feature extraction on the protein structure mapped with mutation to obtain structural features of the mutant protein;

carrying out feature selection on the structural features of the mutant protein to obtain a feature selection result;

and inputting the characteristic selection result into a classification prediction model to obtain a classification result of the mutation into a pathogenic mutation or a benign mutation.

Further, the obtaining of the mutation of the gene to be predicted comprises preprocessing the mutation of the gene to be predicted; optionally, the pre-processing comprises using an algorithm to predict mutant splicing effects and to exclude mutations with scores exceeding a threshold; preferably, the process predicts the effect of mutant splicing using the SpliceAI algorithm and excludes mutations with a score greater than 0.5;

optionally, the pre-treatment further comprises excluding mutations where the number of residues exceeds a threshold;

optionally, the pre-treatment comprises excluding mutations that cannot be mapped to the structure due to isomer inconsistencies;

alternatively, the pretreatment comprises the exclusion of mutations that have a "conflicting pathogenicity interpretation" clinical significance.

Further, the protein structure generation model comprises one or more of the following models: alphaFold, alphaFold2, proteoGAN; preferably, the protein structure generation model is AlphaFold2.

Further, the classification prediction model obtains the classification result of the mutation into a pathogenic mutation or a benign mutation through one or more of the following structural characteristics of the mutant protein, wherein the target characteristics comprise: protein level characteristics, residue level characteristics, mutation level characteristics.

Optionally, the protein level features include thermodynamic features and protein volume features, the residue level features include secondary structure-related features and RSA, and the mutation level features include thermodynamic features and features from empirical rules that determine whether a mutation has a significant effect on protein structure;

preferably, the thermodynamic characteristics include total unfolding energy and van der waals collisions thereof, hydrogen bonding energy, side chain entropy;

preferably, the protein volume characteristics include total volume, void volume, and van der waals volume;

optionally, the secondary structure-related features are assigned using a DSSP program to generate features: DSSP (H), DSSP (E), DSSP (G), DSSP (I), DSSP (T), DSSP (S), DSSP (C);

optionally, the target one of the structural features of the mutein comprises total unfolding energy, van der waals collisions, hydrogen bond energy, side chain entropy, ionization energy, disulfide bond cleavage, total protein volume, void volume, van der waals volume, RSA, Δ Δ G, DSSP (H), DSSP (E), DSSP (G), DSSP (I), DSSP (T), DSSP (S), DSSP (C);

preferably, the target features among the structural features of the mutein include RSA, Δ Δ G, side chain entropy, ionization energy, van der waals collisions.

Further, the classification prediction model is realized based on a machine learning algorithm, and the processing process comprises feature selection, feature fusion, feature calculation and model optimization; the machine learning algorithm comprises any one or more of the following algorithms: GBDT, GBM, SVM, RF, adaboost, apriori algorithms; the model optimization comprises one or more of the following methods: steepest descent method, newton method, quasi Newton method.

Further, the step of inputting the feature selection result into a classification prediction model to obtain a classification result of the mutation into a pathogenic mutation or a benign mutation further comprises:

inputting the feature selection result into a classification prediction model to obtain a classification result of the mutation into a pathogenic mutation or a benign mutation, and outputting a first classification result;

inputting the mutation of the gene to be predicted into a predictor to obtain a mutation effect score of the gene to be predicted, and obtaining a second classification result which is a pathogenic mutation or a benign mutation based on the mutation effect score;

performing weighted fusion on the first classification result and the second classification result to obtain a fused classification result of which the mutation is pathogenic mutation or benign mutation;

the predictor predicts the mutation effect of the gene to be predicted to obtain the mutation effect score of the gene to be predicted;

optionally, the predictor includes any one or more of the following software: enformer, SIFT, PROVEAN, EVE, DEOGEN2, MUTPRED, mutation Assessor, polyphen2, clinPred, REVEL;

preferably, the predictors are PROVEAN, EVE, DEOGEN2 and mutred;

optionally, the variation effect score of the gene to be predicted comprises one or more of the following scores: sequence conservation score, allele mutation frequency score, missense mutation score, frameshift mutation score, nonsense mutation score, indel mutation score;

preferably, the variation effect score of the gene to be predicted comprises a sequence conservation score, an allele mutation frequency score and a missense mutation score;

the weighted fusion adopts any one or more of the following algorithms: boosting, XGboost, adaBoost and Voting integration algorithms.

A pathogenicity analysis device for predicting mutations based on target features, the device comprising:

a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions that, when executed, perform the pathogenicity analysis method for predicting mutations based on target features described above.

A pathogenicity analysis system for predicting mutations based on target features, comprising:

the acquisition module is used for acquiring the gene mutation to be predicted;

the structure generation module is used for mapping the obtained mutation of the gene to be predicted to the protein structure of the gene to be predicted to obtain the protein structure mapped with the mutation;

the structural feature extraction module is used for extracting the features of the protein structure mapped with mutation to obtain the structural features of the mutant protein;

the structural feature selection module is used for carrying out feature selection on the structural features of the mutant protein to obtain a feature selection result;

the classification module is used for inputting the feature selection result into a classification prediction model to obtain a classification result of the mutation into a pathogenic mutation or a benign mutation;

the classification prediction model obtains classification results of the mutations into pathogenic mutations or benign mutations through one or more of the following structural characteristics of the mutant proteins, wherein the structural characteristics of the mutant proteins comprise: protein level characteristics, residue level characteristics, mutation level characteristics.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned pathogenicity analysis method for predicting a mutation based on a target feature.

The correlation between the structural features and the pathogenicity of the variation is evaluated based on the type of genetic data. For the binary feature, a list table is built and a chi-square test is used to determine if there is an association between the two binary variables. For continuous features, correlation of features to the pathogenicity of the variation was examined by logistic regression analysis. The strength of association between features and the pathogenicity of the variation was quantified using the Odds Ratio (OR) and the 95% Confidence Interval (CI). All statistical analyses and data visualization were performed using the R software package: h2o, caret, pROC, foretplot, ggpubr, ggsci, viridis and cutpointr, where a P value <0.05 is considered statistically significant.

The use of the above-described apparatus for assisting selection of a mutagenic pathogenicity assay protocol; optionally, the protocol comprises mutational pathogenicity analysis influenced by alterations in the type and amount of structural features of the mutein; optionally, the structural features of the mutant protein comprise three levels of structural features of the mutant protein, such as protein level features, residue level features and mutation level features; preferably, the target characteristics of the structural characteristics of the mutant protein comprise RSA, delta G, side chain entropy, ionization energy and van der Waals collision;

the use of the above-described device for prediction of the pathogenicity of a mutation; optionally, the use has a positive impact on the study of diabetes, vascular, skeletal, brain function and anti-aging;

the use of the above-described apparatus for classifying a genetic mutation or predicting a property of a mutation; optionally, the classification of the genetic mutation comprises classification results of the mutation into a pathogenic mutation or a benign mutation; optionally, the attribute includes a gene family to which the mutation of the gene to be predicted belongs, the gene family includes an ion channel and collagen, and when the gene family to which the mutation of the gene to be predicted belongs is an ion channel gene, RSA in structural features of the mutant protein is selected for prediction.

The use of the above system for the diagnosis of the development of the pathogenicity of a mutation; alternatively, the occurrence and development of the mutational pathogenicity are related to the change of the type and quantity of the gene mutation characteristics;

the use of the above system for prediction of the pathogenicity of a mutation; alternatively, the use has a positive impact on the study of diabetes, vascular, skeletal, brain function and anti-aging;

the use of the above system for the classification of gene mutations or for predicting the attributes of mutations; optionally, the classification of the genetic mutation comprises classification results of the mutation into a pathogenic mutation or a benign mutation; optionally, the attribute includes a gene family to which the mutation of the gene to be predicted belongs, the gene family includes an ion channel and collagen, and when the gene family to which the mutation of the gene to be predicted belongs is an ion channel gene, RSA in structural features of the mutant protein is selected for prediction.

The invention adopts a machine learning algorithm to train high-quality mutation data with clinical significance, locks specific structural characteristics of the mutant protein by identifying key regions which are easily affected by genetic variation based on the structural characteristics of the mutant protein, can better reflect the spatial clustering of pathogenic variation, can obviously improve the performance, explores and determines the clustering effect of the significant characteristics in the structural characteristics of the pathogenic mutant protein, has strong innovation in the field of life science, and can generate beneficial promoting effect on the pathogenicity analysis and research of gene data.

The application has the advantages that:

1. the method is characterized in that a life law hidden behind gene data is deeply mined based on target characteristics, mutation of a gene to be predicted is mapped to a protein structure to obtain a protein structure mapped with the mutation, characteristics are extracted to obtain structural characteristics of the mutant protein, and five important characteristics of RSA, delta G, side chain entropy, ionization energy and van der Waals collision in the structural characteristics of the mutant protein are selected to quantify a pathogenicity classification result, so that the precision and the depth of data analysis are improved;

2. the application innovatively discloses that a characteristic selection result of a gene to be predicted is obtained based on key target characteristics, the characteristic selection result of mutation of the gene to be predicted is input into a classification prediction model, a classification result of mutation into pathogenic mutation or benign mutation is obtained, and a first classification result is output; then comprehensively considering the variation effect score of the gene to be predicted, and obtaining a second classification result of mutating into pathogenic mutation or benign mutation based on the variation effect score; then, performing weighted fusion on the first classification result and the second classification result to obtain a fused classification result of the mutation as pathogenic mutation or benign mutation;

3. the application creatively discloses pathogenicity analysis equipment and a pathogenicity analysis system for predicting mutation based on target characteristics, which can better reflect the spatial clustering of pathogenic variation and obviously improve the performance by deeply reading gene data through key target characteristics in mutant protein structural characteristics and jointly analyzing the mutation pathogenicity of a gene to be predicted with other VEPs (VEPs), so that the application is more accurately applied to the auxiliary diagnosis of occurrence and development of diseases related to the gene data and the auxiliary selection of schemes.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart of predicting pathogenicity of a gene mutation based on a target characteristic according to an embodiment of the present invention;

FIG. 2 is a SIGMA score chart for predicting the pathogenicity of a mutation based on a target feature provided by an embodiment of the invention;

FIG. 3 is a diagram of structural features of a protein based on AlphaFold2 mapping mutation provided by an embodiment of the invention;

FIG. 4 is a graph illustrating the importance of target features in predicting the pathogenicity of variants according to an embodiment of the present invention;

FIG. 5 is a diagram of the analysis of the effect of predicting the pathogenicity of a mutation by the SIGMA based on the target characteristics provided by the embodiment of the invention;

FIG. 6 is a diagram of the analysis of the effect of predicting the pathogenicity of mutations in SIGMA + based on target characteristics according to an embodiment of the present invention;

FIG. 7 is a SIGMA + score plot for predicting the pathogenicity of mutations based on target features, provided by an embodiment of the invention;

FIG. 8 is a schematic diagram of a pathogenicity analysis device for predicting mutations based on target features according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

In some of the flows described in the specification and claims of the present invention and in the above-described figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel with the order in which they occur, and that the order of the operations, such as S101, S102, etc., is merely used to distinguish between the various operations, and the order itself does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

FIG. 1 is a schematic flow chart of the method for predicting pathogenicity of gene mutation based on target characteristics, specifically, the method comprises the following steps:

s101: and obtaining the mutation of the gene to be predicted.

In one embodiment, the gene to be predicted includes a wild-type gene, and the obtaining of the mutation of the gene to be predicted includes disclosing gene data, and optionally, the mutation information of the gene to be predicted also includes obtaining genetic information of the sample by a high-throughput sequencing method. Optionally, the mutation of the gene to be predicted comprises one or more of the following data sets: dbSNP, dbVar, gnomaD, refSeq, exAC, wherein dbSNP, dbVar, refSeq are all from the ClinVar database. A total of 27, 165 benign and 22, 957 pathogenic missense variants were retrieved from the gnomAD and ClinVar databases. For the ClinVar mutation, a single nucleotide missense mutation with an audit status of at least one star was retained (panel of experts audit: criteria provided, multiple submitters, no conflict; criteria provided, single submitter).

Further, step S101 also includes preprocessing mutation data. Optionally, the pre-processing comprises using an algorithm to predict the impact of the mutation on splicing and to exclude mutations with scores above a threshold; preferably, the influence of the mutation on splicing is predicted using the SpiceAI algorithm and mutations with SpiceAI scores greater than 0.5 are excluded.

SpliceAI predicts splicing changes due to single nucleotide variation, and predicts the splicing site accurately (position, abnormal splicing probability) from any pre-mRNA sequence, thereby predicting the cryptic splicing caused by mutation of non-coding RNA region.

Optionally, the pretreatment also includes excluding mutations where the number of residues exceeds a threshold, e.g., proteins over 2700 residues are excluded;

alternatively, pretreatment includes exclusion of mutations that have "conflicting pathogenicity interpretations" clinical significance.

S102: and mapping the mutation of the gene to be predicted to the protein structure of the gene to be predicted to obtain the protein structure mapped with the mutation.

In one embodiment, the protein structure of the gene to be predicted is obtained by inputting the gene to be predicted into a protein structure generation model. Preferably, the gene to be predicted is a wild-type gene. The protein structure generation model refers to a model or software capable of generating a protein structure of a gene or a polypeptide based on an amino acid sequence of the gene or the polypeptide. Optionally, the protein structure generation model comprises one or more of the following models: VAE, alphaFold2, proteoGAN; preferably, the protein structure generation model is AlphaFold2.

VAE (variable automatic encoder) Variational automatic encoder as a generation model is completed based on joint probability.

AlphaFold speeds up and enables large-scale discovery, including the breakdown of the structure of the nuclear pore complex. The AlphaFold prediction has a very high confidence and gives a nine-helix topology, providing clues about protein function, which is crucial for unraveling the cause of rare genetic diseases.

Alphafold2 discovers a novel protein gating mechanism glucose-6-phosphatase, finds a binding site enzyme for inhibiting the enzyme, a transmembrane protein Wolframin located in ER and the like, improves the prediction accuracy to an atomic level, determines the active site of the enzyme more quickly and accurately, and has 35 thousands of prediction results.

The protegan condition creates an antagonistic network that outperforms both the classical and recent deep learning baselines in protein sequence generation, primarily with extension of protein screening with candidates that are farther away from the known sequence space than was previously possible, but are more likely to be functional than the relatively novel candidates of other approaches.

In one embodiment, preferably, the protein structure of the gene to be predicted is obtained by inputting the wild-type gene into an AlphaFold2 protein structure generation model. The protein structure data set in AlphaFold2 includes the AlphaFold protein structure database (AlphaFold db, https:// AlphaFold. Ebi. Ac. Uk /), with approximately 36.5 million structure predictions. These proteins were provided with 1400 amino acid fragments using AlphaFold2, with proteins over 2700 residues being excluded. All mutations should map to the predicted 3D structure of the protein, with mutations that cannot map to the structure due to isomer inconsistencies excluded.

S103: and (3) carrying out feature extraction on the protein structure mapped with mutation to obtain the structural features of the mutant protein.

In one embodiment, the mutein structural feature comprises a protein-level feature, a residue-level feature, a mutation-level feature; optionally, the feature extraction uses FoldX.

FoldX can rapidly quantitatively assess the importance of interactions on protein and protein complex stability using a complete atomic description of the protein structure. Among the current various energy functions, foldX is well balanced in contradiction between accuracy and rapidness, and thus can be easily used in protein design algorithms, and in the field of protein structure and folding pathway prediction where a rapid and accurate energy function is required.

In a specific embodiment, the mutation of a gene to be predicted is subjected to wild type protein structure generated based on AlphaFold2 to obtain a protein structure mapped with the mutation, and 57 mutant protein structure characteristics comprising three horizontal characteristics of protein level characteristics, residue level characteristics and mutation level characteristics are extracted; among the 57 mutant protein structural features, thermodynamic characteristics, protein volume characteristics, secondary structure-related characteristics, relative solvent accessibility characteristics of the mutant residues, and empirical rules that can determine whether a mutation has a significant effect on protein structure, where pathogenic variants exhibit greater changes in protein stability than benign variants.

Optionally, the protein level features include 16 thermodynamic features and 3 protein volume features; the thermodynamic characteristics include 15 components of the total unfolding energy and its van der waals collisions, hydrogen bonding energy, side chain entropy, etc., and the protein volume characteristics include the total volume, void volume, and van der waals volume.

Optionally, the residue level features include eight secondary structure-related features and Relative Solvent Accessibility (RSA) of the mutated residues, which are used to characterize the mutated residue structure background. For each mutant residue, its secondary structure was assigned using the DSSP (protein secondary structure dictionary) program, yielding features corresponding to eight secondary structure types: DSSP (H), DSSP (E), DSSP (G), DSSP (I), DSSP (T), DSSP (S), DSSP (C).

Alternatively, the mutation level features include 16 thermodynamic features describing changes in protein stability after mutation and 13 features from empirical rules that determine whether a mutation has a significant effect on protein structure; 16 thermodynamic characteristics, including the free energy difference (Δ Δ G) between the mutein and its 15 components; the 13 features from the empirical rules to determine whether a mutation has a significant effect on protein structure are a combination of the structural background and physicochemical properties of the mutated/mutated residues, including glycine bending, proline in the alpha helix, substitution of residues in the alpha helix by proline/glycine, cysteine residues, energy charge loss, energy hydrophobicity replacement, outlier replacement, disulfide bond cleavage, buried salt bridge cleavage, buried proline, buried residues, energy replacement glycine, unfolding free energy of the protein, and unfolding free energy difference between the mutated proteins.

FIG. 3 is a structural characteristic diagram of the mutant protein provided by the embodiment of the present invention, which shows that the structural characteristic of the mutant protein is related to the pathogenicity of the variation, and particularly that the Relative Solvent Accessibility (RSA) and the free energy difference (Δ Δ G) between the mutant protein are significantly related to the pathogenicity of the variation. As can be seen in FIG. 3A, the target characteristics of the pathogenic variants are significantly lower than the benign variants (P <2.2e-16, mannWhitneyU test), consistent with the fact that most proteins are less tolerant to the cryptic mutations than the unmasking mutations. As shown in fig. 3B, Δ Δ G measures the effect of single amino acid substitutions on protein stability, with pathogenic variations showing greater changes in protein stability than benign variations (P <2.2e-16, mann-whitney u test). As shown in fig. 3C, it was shown that disulfide bond cleavage had the highest positive predictive value in all structural features (odds ratio [ OR ] =93.8, 95% confidence interval [ CI ] =44.5-198, p < -2.2 e-16, pearson chi-square test), and almost all (98.72%) missense mutations that disrupt disulfide bonds were pathogenic, supporting an important role of disulfide bonds in protein function. As shown in FIG. 3D, the correlation between the type of secondary structure on which the variation lies and its pathogenicity is consistent with a priori knowledge, and mutations in the circular OR irregular stretches of protein (protein secondary structure dictionary [ DSSP ] -C) tend to be benign (OR =0.32, 95% C I =0.31-0.34, P < -2.2 e-16, pearson'). In contrast, mutations of regular secondary structure tend to be pathogenic, especially mutations in the alpha helix (DSSP-H; OR =1.73, 95% CI =1.66-1.79, P-Ap 2.2e-16, pearson's chi-square test) OR beta-sheet (DSSPE; OR =1.97, 95% CI =1.87-2.08, P-Ap 2.2e-16, pearson's chi-square test).

S104: and (3) carrying out feature selection on the structural features of the mutant protein to obtain a feature selection result.

In one example, the feature selection result refers to a target feature in the structural features of the mutein, including total unfolding energy, van der waals collisions, hydrogen bond energy, side chain entropy, ionization energy, disulfide bond cleavage, total protein volume, void volume, van der waals volume, RSA, Δ Δ G, DSSP (H), DSSP (E), DSSP (G), DSSP (I), DSSP (T), DSSP (S), DSSP (C).

As fig. 4 shows an evaluation graph of the importance of target features in predicting the pathogenicity of variation, preferably, the target features in the structural features of the mutein include RSA, Δ Δ G, side chain entropy, ionization energy, van der waals collisions, and the like. From FIG. 4A, the ten most important features that affect the discriminatory power of the target feature can be seen, with the residue-level feature RSA contributing the most discriminatory power, followed by the two mutation-level features (Δ Δ G and Δ Vander). Seven of these ten most important features are characteristic of the protein level, demonstrating the important role of protein stability in predicting the pathogenicity of a variation. Fig. 4B and 4C show Gene Set Enrichment Analysis (GSEA) maps of the ion channel gene family (B) and the collagen gene family (C), respectively. Notably, RSA alone can predict variable pathogenicity (at least 10 pathogenic and 10 benign variants for evaluation) in 76 out of 200 genes with >0.8 accuracy, especially with good prediction of ion channel gene mutations.

In addition, as seen in fig. 4, RSA, relative to solvent accessibility, gave the highest classification accuracy for ion channel genes (P =8.67 e-04); the target signature alone was not satisfactory for 11% of the genes (21 out of 200 genes, accuracy < 0.5), 11% of which represented the collagen gene family (P =1.99e-06; fig. 4C). Wherein Δ Δ G, and the difference in unfolding free energies between the muteins; Δ G, unfolding free energy of the protein; vdW, van der Waals.

S105: and inputting the feature selection result into a classification prediction model to obtain a classification result of the mutation into a pathogenic mutation or a benign mutation.

In one embodiment, the classification prediction model obtains the classification result of the mutation into a pathogenic mutation or a benign mutation by one or more of the following structural features of the mutant protein, including: protein level characteristics, residue level characteristics, mutation level characteristics. As shown in fig. 2, a SIGMA score chart for predicting pathogenicity based on target features is shown, that is, when the classification prediction model is a SIGMA model, a classification result of mutation into pathogenic mutation or benign mutation is obtained.

Wherein, a SIGMA (Structure-information Germine Missense multiple Association) model is developed based on the close relationship between the protein Structure and the function, and is mainly realized by using machine learning algorithms such as a gradient enhancer and the like, and the influence of Missense variants is evaluated under the protein Structure background.

Further, the classification prediction model is determined as follows:

acquiring a mutation data set, wherein the mutation data set comprises mark information;

mapping the mutation data to a protein structure to obtain a protein structure mapped with mutation;

performing feature selection on the mutant protein structural feature data by adopting a machine learning method, and screening out target features in the mutant protein structural features as a feature selection result, wherein the target features in the mutant protein structural features comprise protein level features, residue level features and mutation level features;

and selecting target characteristics in the structural characteristics of the mutant protein to establish a classification prediction model.

Specifically, the construction process of the classification prediction model SIGMA is as follows: acquiring a mutation data set, wherein the data set comprises mark information; mapping the mutation data to a protein structure to obtain the protein structure mapped with the mutation; carrying out feature extraction on the protein structure mapped with mutation to obtain structural features of the mutant protein; processing the structural characteristics of the mutant protein by using a machine learning algorithm, screening out target characteristics in the structural characteristics of the mutant protein as a characteristic selection result, and obtaining a predicted classification result; and optimizing a machine learning algorithm according to the predicted classification result and the actual result to obtain a final mutation classification result of the gene to be predicted, and outputting a trained SIGMA model.

Optionally, the mutation data set comprises one or more of the following data sets: gnomAD, humVar, exoVar, predictSNP, variBench, swissVar, hummavar, clinVar.

In one specific example, the SIGMA model was constructed with a final signature dataset containing 27, 165 benign (negative) and 22957 pathogenic (positive) mutations as the "gold standard" dataset, which was divided into 80% for training and 20% for testing, and the gnomAD database retained 27928 mutations for model evaluation. The training of the SIGMA model may be: and (3) acquiring a mutation data set containing marking information (for example, pathogenic/possible pathogenic variation is marked as positive, and benign/possible benign variation is marked as negative), repeating the protein structure generation process and the feature extraction process to obtain a feature set, inputting the feature set into the SIGMA model to obtain a classification result of a sample, calculating loss between the classification result and a true value, and obtaining the trained SIGMA model by optimizing machine learning algorithms such as back propagation, a loss function, an optimizer and the like.

Further, the marker information is defined such that a pathogenic/potentially pathogenic variant is marked as positive, whereas a benign/potentially benign variant is marked as negative. For the gnomAD mutation, preprocessing predicts the gnomAD database using SpliceAI (for the depth intron region: 50nt away from exon-intron boundaries), since depth intron prediction is difficult, only 56% deletion is observed when the threshold is set to 0.8. Thus, the effect of the mutation on splicing was predicted and mutations with SpliceAI scores greater than 0.5 were excluded, mutations introduced at mysterious splice sites mainly affected mRNA splicing rather than protein structure, and the common missense mutation of choice (maximum allele frequency in all populations gnomaD > 0.05) was marked negative. In addition, the GOF/LOF database marked 193 gain of function (GOF) and 921 loss of function (LOF) pathogenicity missense variations.

Further, obtaining the mutation data set further comprises preprocessing the mutation data set. Optionally, the pre-processing comprises using an algorithm to predict the effect of the mutation on splicing and to exclude mutations with a score above a threshold; preferably, the influence of the mutation on splicing is predicted using the SpiceAI algorithm and mutations with SpiceAI scores greater than 0.5 are excluded;

optionally, the pre-treatment further comprises excluding mutations with a number of residues exceeding a threshold 2700;

Further, the machine learning algorithm includes any one or more of the following algorithms: GBDT, GBM, SVM, RF, adaboost, apriori algorithms; the optimized machine learning algorithm comprises any one or more of the following methods: steepest descent method, newton method, quasi Newton method.

In one embodiment, step S105 in fig. 1 further includes:

inputting the feature selection result into a classification prediction model to obtain a classification result of mutation into pathogenic mutation or benign mutation as a first classification result;

inputting the mutation of the gene to be predicted into a predictor to obtain a mutation effect score of the gene to be predicted, and obtaining a second classification result of the mutation into a pathogenic mutation or a benign mutation based on the mutation effect score;

optionally, the weighted fusion uses any one or more of the following algorithms: boosting, XGboost, adaBoost and Voting integration algorithms.

Boosting is a mainstream representative technology of ensemble learning in machine learning, often defeats XGboost of a deep neural network in a plurality of data analysis competitions, and is an efficient implementation of a GradientBoost algorithm in a Boosting family.

The Voting integration algorithm is one of the most common combination strategies for the classification problem in the ensemble learning. The basic idea is to select the class that outputs the most among all machine learning algorithms.

Optionally, the predictor includes one or more of the following software: enformer, PROVEAN, EVE, DEOGEN2, MUTPRED, mutitiossossoside, SIFT, clinPred, polyPhen2, REVEL; preferably, the predictors are PROVEAN, EVE, DEOGEN2 and mutred.

PROVEAN is a tool for predicting whether protein sequence variation affects protein function, and whether non-synonymous mutation or InDel has influence on protein biological function is identified according to evolutionary conservation, neural network model and BLOSUM62 amino acid substitution scoring matrix prediction. The score range of the prediction result is-14 to 14, the threshold value is-2.5, the score is-14 to 2.5, the prediction result is Deleterious, the score is-2.5 to 14, the prediction result is Neutral, and the smaller the score is, the more harmful the result is, and vice versa.

EVE estimates the likelihood of whether each single amino acid variation is benign or pathogenic, understands the pathogenic propensity of human missense variations from the distribution of sequence variations across species, is superior to other computational predictive models in predicting clinical effects, and scores as high or better than current gold-standard high-throughput experiments testing for the impact of mutations on biological function.

DEOGEN2 contains heterogeneous information about the molecular effects of the mutation, the domains involved, the relatedness of the genes, and the interactions in which it participates. This extensive context information is non-linearly mapped into a single harmfulness score for each mutation.

The function of MUTPRED is to predict pathogenicity and molecular mechanism after amino acid substitution, and amino acid sequence in Fasta format is used as main input. Mutred is a collection of machine learning tools that can predict the pathogenicity of protein-encoding mutations to infer the molecular mechanism of disease.

The REVEL is an integration method for predicting missense Mutation, combines prediction results of a plurality of software (MutPred, FATHMM, VEST, polyPhen, SIFT, PROVEAN, mutation Assessor, mutation Taster, LRT, GERP, siPhy, phyloP and phaseStrons), applies random forest algorithm, and has good prediction effect on rare missense Mutation. The predicted score for a single mutation ranges from 0 to 1, and the threshold can be set according to the sensitivity and specificity requirements required.

In one example, the mutant genes contained 20047 pathogenic and 20148 benign variants, and also included 27, 928 mutations in six proteins characterized by the DMS experimental system, using 27 computer VEPs, such as EVE screening and DEOGEN2. The EVE score for each mutation is from https:// evemodel.org/. The predicted outcome of the other VEP for each mutation is retrieved from the dbNSFP database (version 4.1 a). These VEPs fall into two categories: a single predictor (n = 16) independent of the other VEPs and a meta predictor (n = 11) that integrates the results of the other VEPs into the input features.

Fig. 5 is an analysis diagram of predicting a mutation pathogenicity effect of a SIGMA model based on target feature prediction according to an embodiment of the present invention, where the SIGMA model uses a gradient enhancement machine GBM to perform model training to distinguish a classifier for pathogenic variation and benign variation, and uses a grid search to adjust parameters. Specifically, the effect evaluation graph shown in fig. 5 uses a SIGMA model based on the GBM of the gradient enhancer as a training data set for predicting the pathogenicity of missense mutation, with 40,195 mutations having structural features. Figure 5A shows a distribution of SIGMA scores for 20,047 pathogenic and 20,148 benign variants in the training set, which is a bimodal distribution of quantitative SIGMA scores (from 0 to 1) for potential pathogenicity of variants calculated using out-of-fold prediction, indicating that SIGMA is strongly discriminative. For the training set, the area under the receiver operating characteristic ROC curve (AUC) shown in FIG. 5B was 0.944 (95% CI = 0.942-0.946). Consistently, as shown in fig. 5C, high prediction accuracy was obtained on the test set (AUC =0.933,95% ci = 0.928-0.938). As shown in fig. 5D, comparing the AUC of SIGMA and 16 individual Variant Effect Predictors (VEPs) using the test set, SIGMA outperformed all individual VEPs, with AUC ranging from 0.779 (FATHMM) to 0.929 (MutPred). FIG. 5E shows the correlation between VEP and Depth Mutation Scanning (DMS) measurements, where Spearman's correlation is calculated between the functional scores of the DMS experiments and the predictive scores of SIGMA and 16 individual VEP. The closest competitor is DEOGEN2 (overall rho =0.387, spearman correlation analysis; fig. 5E), which contains extensive heterogeneous information, and the performance ranking of the recently developed predictor EVE increases from the sixth position on the marker dataset to the third position on the DMS dataset (fig. 5E).

The method is used for assisting the selection of a gene mutation pathogenicity analysis scheme, and the classification result of whether the mutation causes the disease is predicted according to the gene information of the mutation to be predicted. Specifically, as shown in fig. 6, which is a SIGMA + predicted mutation pathogenicity effect analysis diagram based on the target feature provided by the embodiment of the present invention, and fig. 7, which is a SIGMA + score diagram based on the target feature predicted mutation pathogenicity provided by the embodiment of the present invention, the classification effect is good. Here, SIGMA + is a combination of all five predictors, SIGMA, DEOGEN2, EVE, PROVEAN, and MutPred. FIG. 6A illustrates the potential use of SIGMA in combination with other predictors, for example, SIGMA in combination with four individual VEPs with high performance, namely DEOGEN2, EVE, PROVEAN, and MutPred, results in a more comprehensive predictor of variant pathogenicity, with various combinations of these five predictors enhancing the prediction performance (AUC ranging from 0.931 to 0.957). Figure 6B specifically demonstrates SIGMA + high predictive power for the pathogenicity of the variation (AUC = 0.964), enabling significant discrimination between benign and pathogenic variations. Fig. 6C shows the correlation between VEP and Deep Mutation Scan (DMS) measurements. As seen in fig. 6, SIGMA contributed the most to the combination, while DEOGEN2 contributed the least, indicating that SIGMA has a strong discriminatory power. Therefore, it may be advantageous to mine additional information sources than iteratively build meta-predictors using more and more existing predictors.

The embodiment of the invention provides a pathogenicity analysis system based on target characteristic prediction mutation, which comprises:

the acquisition module is used for acquiring mutation of a gene to be predicted;

the structural feature extraction module is used for carrying out feature extraction on the protein structure mapped with mutation to obtain the structural feature of the mutant protein;

the structure characteristic selection module is used for carrying out characteristic selection on the structural characteristics of the mutant protein to obtain a characteristic selection result;

the classification prediction model obtains classification results of the mutations into pathogenic mutations or benign mutations through one or more of the following structural characteristics of the mutant proteins, wherein the structural characteristics of the mutant proteins comprise the following target characteristics: protein level characteristics, residue level characteristics, mutation level characteristics.

Fig. 8 is a pathogenicity analyzing apparatus for predicting a mutation based on a target feature according to an embodiment of the present invention, including: a memory and a processor; the memory is used for storing program instructions; the processor is configured to invoke program instructions that, when executed, perform the pathogenicity analysis method for predicting mutations based on the target features described above.

The apparatus may further include: an input device and an output device. The memory, processor, input device, and output device may be connected by a bus or other means, such as the bus connection shown in fig. 8.

The present invention provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the pathogenicity analysis method for predicting a mutation based on a target feature.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative; for example, the division of the module is only a logic function division, and there may be another division manner in actual implementation; for example, multiple modules or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a software functional module form.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by hardware that is instructed to implement by a program, and the program may be stored in a computer-readable storage medium, where the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

While the invention has been described in detail with reference to certain embodiments, it will be apparent to one skilled in the art that the invention may be practiced without these specific details.

Claims

1. A pathogenicity analysis method for predicting mutations based on target characteristics, comprising:

inputting the gene to be predicted into a protein structure generation model to obtain the protein structure of the gene to be predicted; mapping the mutation of the gene to be predicted to the protein structure of the gene to be predicted to obtain the protein structure mapped with the mutation;

performing feature extraction on the protein structure mapped with the mutation to obtain structural features of the mutant protein;

inputting the feature selection result into a classification prediction model to obtain a classification result of the mutation into a pathogenic mutation or a benign mutation;

the classification prediction model obtains classification results of mutation into pathogenic mutation or benign mutation through one or more combinations of target characteristics in the following mutant protein structural characteristics, wherein the target characteristics in the mutant protein structural characteristics comprise: protein level characteristics, residue level characteristics, mutation level characteristics.

2. The pathogenicity analysis method for predicting a mutation based on a feature of interest according to claim 1, wherein the protein-level features comprise thermodynamic features and protein volume features, the residue-level features comprise secondary structure-related features and RSA, and the mutation-level features comprise thermodynamic features and features derived from empirical rules for determining whether a mutation has a significant effect on protein structure;

optionally, the target of the structural features of the mutein includes total unfolding energy, van der waals collisions, hydrogen bond energy, side chain entropy, ionization energy, disulfide bond cleavage, total wild-type protein volume, void volume, van der waals volume, RSA, Δ Δ G, DSSP (H), DSSP (E), DSSP (G), DSSP (I), DSSP (T), DSSP (S), DSSP (C);

3. The pathogenicity analysis method based on target feature prediction mutation of claim 1, wherein the feature selection result is input into a classification prediction model to obtain a classification result of mutation into pathogenic mutation or benign mutation, and the method further comprises:

inputting the feature selection result into a classification prediction model to obtain a classification result of mutation into pathogenic mutation or benign mutation, and outputting a first classification result;

inputting the mutation of the gene to be predicted into a predictor to obtain a mutation effect score of the gene to be predicted, and obtaining a second classification result of the mutation into a pathogenic mutation or a benign mutation based on the mutation effect score; and performing weighted fusion on the first classification result and the second classification result to obtain a fused classification result of which the mutation is pathogenic mutation or benign mutation.

4. The pathogenicity analysis method based on target feature prediction mutation of claim 3, characterized in that the predictor predicts the mutation effect of the gene to be predicted to obtain the mutation effect score of the gene to be predicted; the predictor comprises any one or more of the following software: enformer, PROVEAN, EVE, DEOGEN2, MUTPRED, mutation Assessor, SIFT, clinPred, polyPhen2, REVEL;

preferably, the predictors include PROVEAN, EVE, DEOGEN2, and mutred.

5. The pathogenicity analysis method for predicting mutation based on target characteristic of claim 3, wherein the score of variation effect of the gene to be predicted comprises one or more of the following scores: sequence conservation score, allele mutation frequency score, missense mutation score, frameshift mutation score, nonsense mutation score, indel mutation score;

preferably, the score of the variation effect of the gene to be predicted comprises a sequence conservation score, an allele mutation frequency score and a missense mutation score.

6. The pathogenicity analysis method based on target characteristic prediction mutation of claim 3, characterized in that the weighted fusion adopts any one or more of the following algorithms: boosting, XGboost, adaBoost and Voting integration algorithms.

7. The method for pathogenic analysis based on target characteristic prediction mutation according to claim 1, wherein the protein structure generation model comprises any one or more of the following generation models: alphaFold, alphaFold2, proteoGAN; preferably, the protein structure generation model is AlphaFold2.

8. A pathogenicity analysis device for predicting a mutation based on a target feature, the device comprising: a memory and a processor; the memory is to store program instructions; the processor is configured to invoke program instructions that, when executed, perform a pathogenicity analysis method that implements the target feature-based predictive mutation of any of claims 1-7.

9. A pathogenicity analysis system for predicting mutations based on target characteristics, the system comprising: the acquisition module is used for acquiring mutation of a gene to be predicted;

the classification module inputs the feature selection result into a classification prediction model to obtain a classification result of the mutation into a pathogenic mutation or a benign mutation;

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the pathogenicity analysis method for predicting a mutation based on a target feature of any one of claims 1 to 7.