CN106636398B

CN106636398B - Construction method of Alzheimer disease onset risk prediction model

Info

Publication number: CN106636398B
Application number: CN201611190992.4A
Authority: CN
Inventors: 蒋庆华; 刘桂友; 胡杨; 王亚东
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2016-12-21
Filing date: 2016-12-21
Publication date: 2021-01-29
Anticipated expiration: 2036-12-21
Also published as: CN106636398A

Abstract

The invention belongs to the field of medical detection, and particularly discloses a construction method of an Alzheimer disease onset risk prediction model. The improved wGRS method can further improve the accuracy of the prediction of the Alzheimer disease onset risk. Therefore, the method considers the important influence of the interaction between the SNPs on the Alzheimer's disease, applies the interaction between the SNPs to the prediction of the onset risk of the Alzheimer's disease, and further improves the accuracy of the prediction of the onset risk of the Alzheimer's disease.

Description

Construction method of Alzheimer disease onset risk prediction model

Technical Field

The invention relates to the field of medical detection, in particular to a construction method of an Alzheimer disease onset risk prediction model.

Background

Alzheimer's disease is a degenerative disease of the nervous system and is clinically characterized by dementia manifestations such as memory decline, cognitive decline and the like. Modern science considers alzheimer's disease as a result of the co-action of genes and environmental factors, among which genes play a major role.

At present, the proportion of patients suffering from Alzheimer disease rises year by year, and the daily life of people is seriously influenced. In recent years, genome-wide association studies and candidate gene studies have found a large number of Alzheimer's disease-susceptible polymorphic sites. Therefore, it is very important to establish a corresponding model through the genotype data of the Alzheimer disease individuals and the normal control individuals, and further predict the onset risk of the Alzheimer disease of the individuals.

If the genotype data of a person is determined, the model can be used to calculate the risk of developing Alzheimer's disease. If the risk of the disease is high, healthy living, exercise and nutrition balance schemes need to be formulated, so that the risk of the disease is reduced.

Genetic Risk Scoring (GRS) is an effective method for analyzing Single Nucleotide Polymorphisms (SNPs) and clinical phenotypes of complex diseases. A single SNP has a weak effect on disease, and the method integrates the weak effects of several SNPs. GRS considers that each risk allele has the same effect on disease, but simply adds the number of risk alleles. In fact, the effect of each risk allele on disease is unlikely to be the same, and a weighted genetic risk score (wGRS) is born.

The weighted GRS can be expressed as:

(β_irepresents the weight, S, of the ith SNP_iIndicates the number of risk alleles of the i-th SNP, and n is the number of SNPs). The algorithm considers that each risk allele has different influence on the disease, the influence degree of SNPs on the disease is indicated by endowing each risk allele with corresponding weight, and the wGRS is more widely applied to the prediction and evaluation of complex diseases than the GRS.

The current research shows that the interaction between SNPs has important influence on the onset of Alzheimer's disease, and the interaction between SNPs is ignored when the risk prediction of wGRS is carried out.

Disclosure of Invention

The present invention is directed to overcoming the problems in the prior art, and providing a method for constructing an Alzheimer disease risk prediction model, which is based on the genotype data of Alzheimer Disease (AD) individuals and normal individuals, establishes a more accurate Alzheimer disease risk prediction model, and predicts the Alzheimer disease risk by using the model and the genotype data of the individuals.

The technical scheme of the invention is as follows: a method for constructing an Alzheimer disease onset risk prediction model comprises the following steps:

(1) acquiring genotype data of an Alzheimer disease individual and a normal control individual;

for Alzheimer disease, firstly, performing gene sequencing on autosomes of a large number of Alzheimer disease patients and normal people to obtain original SNP genotype data of the Alzheimer disease patients and the normal people; performing quality control on original SNP genotype data, and removing SNP genotype data with the minimum allele frequency MAF of less than 0.02, unsatisfied Hadi-Winberg equilibrium test, typing success rate of less than 75% and located in a linkage disequilibrium region; the typing success rate of all SNP corresponding to the sample needs to be more than 75 percent, otherwise, the sample which does not meet the genotype deficiency ratio control of the sample is removed from the SNP genotype data; SNP genotype data meeting the conditions are retained for further analysis;

(2) after SNP genotype data which do not meet the control conditions are removed, scoring is carried out on the retained SNP genotype data; scoring the SNP genotype data by 0, 1 and 2 according to the number of high-risk alleles contained in the SNP genotype data, and expressing the corresponding SNP genotype data by adopting the scores of 0, 1 and 2;

for SNP genotype data, it is specified that there are two high-risk alleles that are homozygous for score 2, one high-risk allele that is heterozygous for score 1, and two low-risk alleles that are homozygous for score 0;

(3) SNPs with an association level p <0.05 with alzheimer's disease were considered to be significantly associated with the disease; screening out SNPs (single nucleotide polymorphisms) which are obviously related to the Alzheimer disease and SNP-SNP pairs which have obvious relevance to diseases due to the interaction between the SNPs;

1 represents patients suffering from Alzheimer disease, and 0 represents normal patients; obtaining SNP (Single nucleotide polymorphism) which is obviously related to the Alzheimer disease after the age and the sex are corrected by a single-factor logistic regression algorithm, and obtaining SNP-SNP pairs which are obviously related to the Alzheimer disease after the Bonferroni is corrected by a Lasso multiple regression method;

(4) obtaining SNP which is independently influenced by Alzheimer disease and SNP-SNP pairs which have independent influence on diseases due to the interaction between the SNPs;

the odds ratio OR value represents an index of the strength of correlation between disease and exposure, and, like Relative Risk (RR), refers to the disease risk of an exposer being a multiple of that of a non-exposer; carrying out multi-factor logistic regression algorithm analysis on the significantly related SNPs and SNP pairs to obtain SNPs, SNP-SNP pairs, corresponding ratio OR values, 95% confidence intervals and logistic regression constant terms alpha which independently influence the Alzheimer's disease, and obtaining the weight value beta of each SNP and SNP-SNP pair by taking the natural logarithm of the ratio OR value of each SNP and SNP-SNP pair;

(5) establishing an improved wGRS model by using SNP (Single nucleotide polymorphism) and SNP-SNP pair which are independently influenced by Alzheimer disease; taking each SNP and SNP-SNP pair as variable S, and according to the obtained weight value beta of each SNP and SNP-SNP pair, the improved wGRS model is expressed as the sum of the products of the weight of each variable and the weight of the variable, namely

Where n is the number of SNPs and SNP-SNP pairs, beta_iWeight value representing the ith variable, S_iRepresents the ith variable; the OR value of SNP and SNP-SNP pair which independently affect the Alzheimer disease is taken from natural logarithm to obtain the corresponding weight beta_i(ii) a When all the SNPs and SNP-SNP pairs which are independently influenced by the Alzheimer disease are included in a wGRS model, the model for obtaining the risk of the Alzheimer disease is logit P (D1 | G) ═ alpha + wGRS, wherein D1 represents a person suffering from the Alzheimer disease, G represents SNP gene data of the person, P (D1 | G) is the probability that the person possibly suffers from the Alzheimer disease calculated according to the SNP gene data of the person, and alpha is a constant term of logistic regression; wherein

n is the number of SNPs and SNP-SNP pairs, beta_iRepresents the ith variationWeight value of quantity, S_iRepresents the ith variable;

(6) predicting risk of Alzheimer's disease;

and (3) predicting the risk of the Alzheimer's disease of a person, and calculating the risk of the person suffering from the Alzheimer's disease by using the model of the risk of the Alzheimer's disease in the step (5) by only measuring the genotype data of the person.

Preferably, the quality control of the original SNP genotype data in the step (1) comprises the following specific steps:

1) removing SNP of which the minimum allele frequency MAF is less than 0.02 from the original SNP genotype data;

2) eliminating SNP which does not meet Hardy-Weinberg balance test;

3) the typing success rate of a certain SNP in all samples needs to be controlled to be more than 75 percent; eliminating SNP which does not meet the SNP typing success rate control;

4) for genome-wide correlation analysis, one sample is tested. Generally, the typing success rate of all SNPs corresponding to a sample needs to be controlled to be more than 75%, and when the quality of SNP genotype data of the sample is controlled, the sample which does not meet the genotype deficiency ratio control of the sample is removed from analysis data;

5) eliminating SNP in the linkage disequilibrium region; the remaining SNP genotype data was analyzed further.

Preferably, the step (3) specifically includes the following steps:

(1) after the SNP genotype data are scored, the SNP genotype of each sample is represented by 0, 1 and 2; when carrying out single-factor logistic regression analysis, taking a single SNP as an independent variable, taking the diseased states 0 and 1 of a sample as a dependent variable, and taking the age and the sex as covariates; obtaining the association level, the ratio and the 95% confidence interval of the SNP and the Alzheimer disease; (ii) remaining if an SNP with an association level p <0.05 with Alzheimer's disease is considered to be significantly associated with the disease;

(2) SNP-SNP pairs which are obviously related to Alzheimer disease after correction of Bonferroni are obtained by using a Lasso multiple regression method.

Preferably, the step (4) specifically includes the following steps:

1) when multi-factor logistic regression algorithm analysis is carried out on the significantly related SNPs and the SNP-SNP pairs, the significantly related SNP genotype data are represented by 0, 1 and 2, the significantly related SNP-SNP pairs are represented by the product of the two SNP genotype data, and each significantly related SNP and SNP-SNP pair is regarded as one variable; obtaining the correlation level p value, the ratio OR value, the 95% confidence interval and a constant term alpha of logistic regression of each variable and the Alzheimer disease through a multifactor logistic regression algorithm; variables with a relevance level p <0.05 were considered variables that had an independent effect on alzheimer's disease;

2) and taking the natural logarithm of the ratio OR value of each SNP and each SNP-SNP pair to obtain the weight value beta of each SNP and each SNP-SNP pair, namely each SNP and SNP-SNP pair has the weight value beta corresponding to the SNP and SNP-SNP pair.

The invention has the beneficial effects that: the embodiment of the invention provides a method for constructing an Alzheimer disease onset risk prediction model, which provides an improved wGRS method based on the existing wGRS, and not only takes the action of a single SNP into consideration when calculating the wGRS, but also takes the interaction between the SNPs into consideration. The improved wGRS method can further improve the accuracy of the prediction of the Alzheimer disease onset risk. Therefore, the method considers the important influence of the interaction between the SNPs on the Alzheimer's disease, applies the interaction between the SNPs to the prediction of the onset risk of the Alzheimer's disease, and further improves the accuracy of the prediction of the onset risk of the Alzheimer's disease.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a graph of the prediction of ROC for an original sample.

Detailed Description

An embodiment of the present invention will be described in detail below with reference to the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the embodiment.

As shown in fig. 1, an embodiment of the present invention provides a method for constructing an alzheimer disease onset risk prediction model, and when predicting an alzheimer disease risk by using genotype data, the present invention predicts an alzheimer disease onset risk by using an interaction relationship between SNPs; the invention aims to obtain an Alzheimer disease risk model by utilizing the genotype data of an Alzheimer disease individual and a normal control individual through training, and then predict the Alzheimer disease risk by utilizing the model and the genotype data of an individual to be detected. The method comprises the following steps:

the quality control of the original SNP genotype data comprises the following specific steps:

1) in correlation studies, a smaller MAF will decrease the statistical performance, leading to false negative results. Removing SNP of which the minimum allele frequency MAF is less than 0.02 from the original SNP genotype data;

2) ideally, the frequency of each allele and the genotype frequency of the allele are stable and invariant in inheritance, i.e., maintain genetic equilibrium. The significance level p value of the Hardy-Weinberg equilibrium test is 1 × 10^-6. Controlling the quality of original SNP genotype data, and removing SNP which does not meet Hardy-Weinberg balance test;

3) generally, the typing success rate of a certain SNP in all samples needs to be controlled to be more than 75%, otherwise, the quality control cannot be passed; eliminating SNP which does not meet the SNP typing success rate control;

4) for genome-wide correlation analysis, one sample is tested. Generally, the typing success rate of all SNPs corresponding to a sample needs to be controlled to be more than 75%, otherwise, the quality control cannot be passed, and when the quality control is performed on the SNP genotype data of the sample, the sample which does not meet the genotype deficiency ratio control of the sample is removed from the analysis data;

5) removing SNP in a linkage disequilibrium region when controlling the quality of original SNP genotype data; after quality control, the remaining SNP genotype data is analyzed in the next step.

the step (3) specifically comprises the following steps:

a) after the SNP genotype data are scored, the SNP genotype of each sample is represented by 0, 1 and 2; when carrying out single-factor logistic regression analysis, taking a single SNP as an independent variable, taking the diseased states 0 and 1 of a sample as a dependent variable, and taking the age and the sex as covariates; obtaining the association level, the ratio and the 95% confidence interval of the SNP and the Alzheimer disease; (ii) remaining if an SNP with an association level p <0.05 with Alzheimer's disease is considered to be significantly associated with the disease;

b) SNP-SNP pairs which are obviously related to Alzheimer disease after correction of Bonferroni are obtained by using a Lasso multiple regression method.

the step (4) specifically comprises the following steps:

n is the number of SNPs and SNP-SNP pairs, beta_iWeight value representing the ith variable, S_iRepresents the ith variable;

(6) predicting risk of Alzheimer's disease;

The invention is based on the following web pages:

(http:// journals. plos. org/plosone/article/assertuque & id. info: doi/10.1371/journal. pane.0144898. s002) downloaded genotype data of 229 alzheimer's SNPs from chinese population and 318 normal individuals, and removed one SNP that did not satisfy haddi-weinberg balance. And performing 0, 1 and 2 conversion on all genotype data according to the number of high-risk alleles, and obtaining the SNP which is obviously related to the Alzheimer disease through single-factor logistic regression analysis. Since the genotype data contains no information on age, sex, etc., 13 SNPs that are significantly related to alzheimer's disease were directly cited after the original authors corrected the information on age, sex, etc. The detailed information is shown in table 1:

table 1 13 SNPs significantly associated with AD disease

The LMR method is used for finding out SNP pairs which are obviously related to the Alzheimer disease, and the results show that rs6656401-rs3865444, rs28834970-rs6656401 and rs28834970-rs3865444 are obviously related to AD (p is less than 0.05).

Carrying out multifactorial logistic regression on 13 significantly related SNPs and 3 pairs of SNPs to obtain SNPs and SNP pairs (p <0.05) which independently affect the Alzheimer disease, corresponding OR values and 95% confidence intervals (uncorrected for information such as age, sex and the like), and obtaining corresponding weights beta by taking the natural logarithm of the OR values. Table 2 is SNP and SNP pair independently affecting AD.

TABLE 2 SNPs and SNP pairs independently affecting AD

Thus, improved wGRS was calculated using SNPs and SNP pairs that independently affected alzheimer's disease, wGRS ═ V1 (-0.456) + V2 × 0.339+ V3 (-0.464) + V4 × 0.374+ V5 (-0.754) + V6 × 0.367+ V7 × 0.667+ V8 (-0.308) + V9 — 0.398) + V10 × 1.1, and the model of alzheimer's disease was logit P (D664 ═ 1| G) ═ 0.772+ wGRS.

To test the predictive accuracy of this model, we performed predictive analysis on the original samples (229 alzheimer and 318 normal control individuals) using modified wGRS, with the predictive results as shown in table 3:

TABLE 3 modified wGRS vs. original sample prediction case Table (classification point 0.5)

The corresponding ROC curve is shown in fig. 2.

The ROC curve had an area of 0.721 and 95% CI from (0.679-0.764).

If the influence of the interaction between SNPs on the disease is not considered, 13 significant SNPs are directly adopted, and the wGRS is established to predict the original sample, and the result analysis is obtained as shown in Table 4:

TABLE 4 wGRS vs. original sample prediction case Table (classification point 0.5)

Therefore, SNPs and SNP pairs significantly associated with alzheimer's disease were used as disease-affecting factors, and SNPs and SNP pairs independently affecting alzheimer's disease and corresponding OR values were obtained by multifactorial logistic regression. The accuracy of the prediction of alzheimer's disease risk with the improved wGRS was 68.7%. The accuracy of the prediction of the risk of alzheimer's disease using only the SNPs significantly associated with alzheimer's disease without considering the interaction between the SNPs was 66.4%. The improved wGRS method provided by the invention fully considers the influence of the interaction between SNPs on the onset of Alzheimer's disease, and can improve the prediction accuracy of the risk of Alzheimer's disease by 2.3%. If the age, gender, etc. information is corrected in performing multifactorial logistic regression to obtain SNP and SNP pairs that independently affect Alzheimer's disease, it is believed that the improved wGRS will be more accurate in predicting Alzheimer's disease risk.

In summary, the method for constructing the model for predicting the risk of developing alzheimer's disease provided by the embodiments of the present invention provides an improved wGRS method based on the existing wGRS, and not only the effect of a single SNP is considered in calculating the wGRS, but also the interaction between SNPs is considered. The improved wGRS method can further improve the accuracy of the prediction of the Alzheimer disease onset risk. Therefore, the method considers the important influence of the interaction between the SNPs on the Alzheimer's disease, applies the interaction between the SNPs to the prediction of the onset risk of the Alzheimer's disease, and further improves the accuracy of the prediction of the onset risk of the Alzheimer's disease.

The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. A method for constructing an Alzheimer disease onset risk prediction model is characterized by comprising the following steps:

the step (3) specifically comprises the following steps:

1) after the SNP genotype data are scored, the SNP genotype of each sample is represented by 0, 1 and 2; when carrying out single-factor logistic regression analysis, taking a single SNP as an independent variable, taking the diseased states 0 and 1 of a sample as a dependent variable, and taking the age and the sex as covariates; obtaining the association level, the ratio and the 95% confidence interval of the SNP and the Alzheimer disease; (ii) remaining if an SNP with an association level p <0.05 with Alzheimer's disease is considered to be significantly associated with the disease;

2) obtaining SNP-SNP pairs which are obviously related to Alzheimer disease after Bonferroni correction by utilizing a Lasso multiple regression method;

(4) obtaining SNP which has independent influence on Alzheimer disease and SNP-SNP pairs which have independent influence on diseases due to the interaction between the SNP and the SNP;

the odds ratio OR value represents an index of the strength of correlation between disease and exposure, meaning that the disease risk of an exposer is a multiple of that of a non-exposer; carrying out multi-factor logistic regression algorithm analysis on the significantly related SNPs and SNP pairs to obtain SNPs, SNP-SNP pairs, corresponding ratio OR values, 95% confidence intervals and logistic regression constant terms alpha which independently influence the Alzheimer's disease, and obtaining the weight value beta of each SNP and SNP-SNP pair by taking the natural logarithm of the ratio OR value of each SNP and SNP-SNP pair;

the step (4) specifically comprises the following steps:

2) taking the natural logarithm of the ratio OR value of each SNP and each SNP-SNP pair to obtain the weight value beta of each SNP and each SNP-SNP pair, namely each SNP and each SNP-SNP pair has the weight value beta corresponding to the SNP;

Where n is the number of SNPs and SNP-SNP pairs, beta_iWeight value representing the ith variable, S_iRepresents the ith variable;

the OR value of SNP and SNP-SNP pair which independently affect the Alzheimer disease is taken from natural logarithm to obtain the corresponding weight beta_i(ii) a When all the SNPs and SNP-SNP pairs which are independently influenced by the Alzheimer disease are included in a wGRS model, the model for obtaining the risk of the Alzheimer disease is logit P (D1 | G) ═ alpha + wGRS, wherein D1 represents a person suffering from the Alzheimer disease, G represents SNP gene data of the person, P (D1 | G) is the probability that the person possibly suffers from the Alzheimer disease calculated according to the SNP gene data of the person, and alpha is a constant term of logistic regression; wherein

n is the number of SNPs and SNP-SNP pairs, beta_iWeight value representing the ith variable, S_iRepresenting the ith variable.

2. The method for constructing model for predicting the onset risk of alzheimer's disease according to claim 1, wherein the quality control of the original SNP genotype data in step (1) comprises the following steps:

2) eliminating SNP which does not meet Hardy-Weinberg balance test;

4) for genome-wide association analysis, for a sample to be tested; generally, the typing success rate of all SNPs corresponding to a sample needs to be controlled to be more than 75%, and when the quality of SNP genotype data of the sample is controlled, the sample which does not meet the genotype deficiency ratio control of the sample is removed from analysis data;