CN106636398B - Construction method of Alzheimer disease onset risk prediction model - Google Patents

Construction method of Alzheimer disease onset risk prediction model Download PDF

Info

Publication number
CN106636398B
CN106636398B CN201611190992.4A CN201611190992A CN106636398B CN 106636398 B CN106636398 B CN 106636398B CN 201611190992 A CN201611190992 A CN 201611190992A CN 106636398 B CN106636398 B CN 106636398B
Authority
CN
China
Prior art keywords
snp
disease
alzheimer
genotype data
alzheimer disease
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611190992.4A
Other languages
Chinese (zh)
Other versions
CN106636398A (en
Inventor
蒋庆华
刘桂友
胡杨
王亚东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201611190992.4A priority Critical patent/CN106636398B/en
Publication of CN106636398A publication Critical patent/CN106636398A/en
Application granted granted Critical
Publication of CN106636398B publication Critical patent/CN106636398B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Abstract

The invention belongs to the field of medical detection, and particularly discloses a construction method of an Alzheimer disease onset risk prediction model. The improved wGRS method can further improve the accuracy of the prediction of the Alzheimer disease onset risk. Therefore, the method considers the important influence of the interaction between the SNPs on the Alzheimer's disease, applies the interaction between the SNPs to the prediction of the onset risk of the Alzheimer's disease, and further improves the accuracy of the prediction of the onset risk of the Alzheimer's disease.

Description

Construction method of Alzheimer disease onset risk prediction model
Technical Field
The invention relates to the field of medical detection, in particular to a construction method of an Alzheimer disease onset risk prediction model.
Background
Alzheimer's disease is a degenerative disease of the nervous system and is clinically characterized by dementia manifestations such as memory decline, cognitive decline and the like. Modern science considers alzheimer's disease as a result of the co-action of genes and environmental factors, among which genes play a major role.
At present, the proportion of patients suffering from Alzheimer disease rises year by year, and the daily life of people is seriously influenced. In recent years, genome-wide association studies and candidate gene studies have found a large number of Alzheimer's disease-susceptible polymorphic sites. Therefore, it is very important to establish a corresponding model through the genotype data of the Alzheimer disease individuals and the normal control individuals, and further predict the onset risk of the Alzheimer disease of the individuals.
If the genotype data of a person is determined, the model can be used to calculate the risk of developing Alzheimer's disease. If the risk of the disease is high, healthy living, exercise and nutrition balance schemes need to be formulated, so that the risk of the disease is reduced.
Genetic Risk Scoring (GRS) is an effective method for analyzing Single Nucleotide Polymorphisms (SNPs) and clinical phenotypes of complex diseases. A single SNP has a weak effect on disease, and the method integrates the weak effects of several SNPs. GRS considers that each risk allele has the same effect on disease, but simply adds the number of risk alleles. In fact, the effect of each risk allele on disease is unlikely to be the same, and a weighted genetic risk score (wGRS) is born.
The weighted GRS can be expressed as:
Figure GDA0002838711380000021
irepresents the weight, S, of the ith SNPiIndicates the number of risk alleles of the i-th SNP, and n is the number of SNPs). The algorithm considers that each risk allele has different influence on the disease, the influence degree of SNPs on the disease is indicated by endowing each risk allele with corresponding weight, and the wGRS is more widely applied to the prediction and evaluation of complex diseases than the GRS.
The current research shows that the interaction between SNPs has important influence on the onset of Alzheimer's disease, and the interaction between SNPs is ignored when the risk prediction of wGRS is carried out.
Disclosure of Invention
The present invention is directed to overcoming the problems in the prior art, and providing a method for constructing an Alzheimer disease risk prediction model, which is based on the genotype data of Alzheimer Disease (AD) individuals and normal individuals, establishes a more accurate Alzheimer disease risk prediction model, and predicts the Alzheimer disease risk by using the model and the genotype data of the individuals.
The technical scheme of the invention is as follows: a method for constructing an Alzheimer disease onset risk prediction model comprises the following steps:
(1) acquiring genotype data of an Alzheimer disease individual and a normal control individual;
for Alzheimer disease, firstly, performing gene sequencing on autosomes of a large number of Alzheimer disease patients and normal people to obtain original SNP genotype data of the Alzheimer disease patients and the normal people; performing quality control on original SNP genotype data, and removing SNP genotype data with the minimum allele frequency MAF of less than 0.02, unsatisfied Hadi-Winberg equilibrium test, typing success rate of less than 75% and located in a linkage disequilibrium region; the typing success rate of all SNP corresponding to the sample needs to be more than 75 percent, otherwise, the sample which does not meet the genotype deficiency ratio control of the sample is removed from the SNP genotype data; SNP genotype data meeting the conditions are retained for further analysis;
(2) after SNP genotype data which do not meet the control conditions are removed, scoring is carried out on the retained SNP genotype data; scoring the SNP genotype data by 0, 1 and 2 according to the number of high-risk alleles contained in the SNP genotype data, and expressing the corresponding SNP genotype data by adopting the scores of 0, 1 and 2;
for SNP genotype data, it is specified that there are two high-risk alleles that are homozygous for score 2, one high-risk allele that is heterozygous for score 1, and two low-risk alleles that are homozygous for score 0;
(3) SNPs with an association level p <0.05 with alzheimer's disease were considered to be significantly associated with the disease; screening out SNPs (single nucleotide polymorphisms) which are obviously related to the Alzheimer disease and SNP-SNP pairs which have obvious relevance to diseases due to the interaction between the SNPs;
1 represents patients suffering from Alzheimer disease, and 0 represents normal patients; obtaining SNP (Single nucleotide polymorphism) which is obviously related to the Alzheimer disease after the age and the sex are corrected by a single-factor logistic regression algorithm, and obtaining SNP-SNP pairs which are obviously related to the Alzheimer disease after the Bonferroni is corrected by a Lasso multiple regression method;
(4) obtaining SNP which is independently influenced by Alzheimer disease and SNP-SNP pairs which have independent influence on diseases due to the interaction between the SNPs;
the odds ratio OR value represents an index of the strength of correlation between disease and exposure, and, like Relative Risk (RR), refers to the disease risk of an exposer being a multiple of that of a non-exposer; carrying out multi-factor logistic regression algorithm analysis on the significantly related SNPs and SNP pairs to obtain SNPs, SNP-SNP pairs, corresponding ratio OR values, 95% confidence intervals and logistic regression constant terms alpha which independently influence the Alzheimer's disease, and obtaining the weight value beta of each SNP and SNP-SNP pair by taking the natural logarithm of the ratio OR value of each SNP and SNP-SNP pair;
(5) establishing an improved wGRS model by using SNP (Single nucleotide polymorphism) and SNP-SNP pair which are independently influenced by Alzheimer disease; taking each SNP and SNP-SNP pair as variable S, and according to the obtained weight value beta of each SNP and SNP-SNP pair, the improved wGRS model is expressed as the sum of the products of the weight of each variable and the weight of the variable, namely
Figure GDA0002838711380000041
Where n is the number of SNPs and SNP-SNP pairs, betaiWeight value representing the ith variable, SiRepresents the ith variable; the OR value of SNP and SNP-SNP pair which independently affect the Alzheimer disease is taken from natural logarithm to obtain the corresponding weight betai(ii) a When all the SNPs and SNP-SNP pairs which are independently influenced by the Alzheimer disease are included in a wGRS model, the model for obtaining the risk of the Alzheimer disease is logit P (D1 | G) ═ alpha + wGRS, wherein D1 represents a person suffering from the Alzheimer disease, G represents SNP gene data of the person, P (D1 | G) is the probability that the person possibly suffers from the Alzheimer disease calculated according to the SNP gene data of the person, and alpha is a constant term of logistic regression; wherein
Figure GDA0002838711380000042
n is the number of SNPs and SNP-SNP pairs, betaiRepresents the ith variationWeight value of quantity, SiRepresents the ith variable;
(6) predicting risk of Alzheimer's disease;
and (3) predicting the risk of the Alzheimer's disease of a person, and calculating the risk of the person suffering from the Alzheimer's disease by using the model of the risk of the Alzheimer's disease in the step (5) by only measuring the genotype data of the person.
Preferably, the quality control of the original SNP genotype data in the step (1) comprises the following specific steps:
1) removing SNP of which the minimum allele frequency MAF is less than 0.02 from the original SNP genotype data;
2) eliminating SNP which does not meet Hardy-Weinberg balance test;
3) the typing success rate of a certain SNP in all samples needs to be controlled to be more than 75 percent; eliminating SNP which does not meet the SNP typing success rate control;
4) for genome-wide correlation analysis, one sample is tested. Generally, the typing success rate of all SNPs corresponding to a sample needs to be controlled to be more than 75%, and when the quality of SNP genotype data of the sample is controlled, the sample which does not meet the genotype deficiency ratio control of the sample is removed from analysis data;
5) eliminating SNP in the linkage disequilibrium region; the remaining SNP genotype data was analyzed further.
Preferably, the step (3) specifically includes the following steps:
(1) after the SNP genotype data are scored, the SNP genotype of each sample is represented by 0, 1 and 2; when carrying out single-factor logistic regression analysis, taking a single SNP as an independent variable, taking the diseased states 0 and 1 of a sample as a dependent variable, and taking the age and the sex as covariates; obtaining the association level, the ratio and the 95% confidence interval of the SNP and the Alzheimer disease; (ii) remaining if an SNP with an association level p <0.05 with Alzheimer's disease is considered to be significantly associated with the disease;
(2) SNP-SNP pairs which are obviously related to Alzheimer disease after correction of Bonferroni are obtained by using a Lasso multiple regression method.
Preferably, the step (4) specifically includes the following steps:
1) when multi-factor logistic regression algorithm analysis is carried out on the significantly related SNPs and the SNP-SNP pairs, the significantly related SNP genotype data are represented by 0, 1 and 2, the significantly related SNP-SNP pairs are represented by the product of the two SNP genotype data, and each significantly related SNP and SNP-SNP pair is regarded as one variable; obtaining the correlation level p value, the ratio OR value, the 95% confidence interval and a constant term alpha of logistic regression of each variable and the Alzheimer disease through a multifactor logistic regression algorithm; variables with a relevance level p <0.05 were considered variables that had an independent effect on alzheimer's disease;
2) and taking the natural logarithm of the ratio OR value of each SNP and each SNP-SNP pair to obtain the weight value beta of each SNP and each SNP-SNP pair, namely each SNP and SNP-SNP pair has the weight value beta corresponding to the SNP and SNP-SNP pair.
The invention has the beneficial effects that: the embodiment of the invention provides a method for constructing an Alzheimer disease onset risk prediction model, which provides an improved wGRS method based on the existing wGRS, and not only takes the action of a single SNP into consideration when calculating the wGRS, but also takes the interaction between the SNPs into consideration. The improved wGRS method can further improve the accuracy of the prediction of the Alzheimer disease onset risk. Therefore, the method considers the important influence of the interaction between the SNPs on the Alzheimer's disease, applies the interaction between the SNPs to the prediction of the onset risk of the Alzheimer's disease, and further improves the accuracy of the prediction of the onset risk of the Alzheimer's disease.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a graph of the prediction of ROC for an original sample.
Detailed Description
An embodiment of the present invention will be described in detail below with reference to the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the embodiment.
As shown in fig. 1, an embodiment of the present invention provides a method for constructing an alzheimer disease onset risk prediction model, and when predicting an alzheimer disease risk by using genotype data, the present invention predicts an alzheimer disease onset risk by using an interaction relationship between SNPs; the invention aims to obtain an Alzheimer disease risk model by utilizing the genotype data of an Alzheimer disease individual and a normal control individual through training, and then predict the Alzheimer disease risk by utilizing the model and the genotype data of an individual to be detected. The method comprises the following steps:
(1) acquiring genotype data of an Alzheimer disease individual and a normal control individual;
for Alzheimer disease, firstly, performing gene sequencing on autosomes of a large number of Alzheimer disease patients and normal people to obtain original SNP genotype data of the Alzheimer disease patients and the normal people; performing quality control on original SNP genotype data, and removing SNP genotype data with the minimum allele frequency MAF of less than 0.02, unsatisfied Hadi-Winberg equilibrium test, typing success rate of less than 75% and located in a linkage disequilibrium region; the typing success rate of all SNP corresponding to the sample needs to be more than 75 percent, otherwise, the sample which does not meet the genotype deficiency ratio control of the sample is removed from the SNP genotype data; SNP genotype data meeting the conditions are retained for further analysis;
the quality control of the original SNP genotype data comprises the following specific steps:
1) in correlation studies, a smaller MAF will decrease the statistical performance, leading to false negative results. Removing SNP of which the minimum allele frequency MAF is less than 0.02 from the original SNP genotype data;
2) ideally, the frequency of each allele and the genotype frequency of the allele are stable and invariant in inheritance, i.e., maintain genetic equilibrium. The significance level p value of the Hardy-Weinberg equilibrium test is 1 × 10-6. Controlling the quality of original SNP genotype data, and removing SNP which does not meet Hardy-Weinberg balance test;
3) generally, the typing success rate of a certain SNP in all samples needs to be controlled to be more than 75%, otherwise, the quality control cannot be passed; eliminating SNP which does not meet the SNP typing success rate control;
4) for genome-wide correlation analysis, one sample is tested. Generally, the typing success rate of all SNPs corresponding to a sample needs to be controlled to be more than 75%, otherwise, the quality control cannot be passed, and when the quality control is performed on the SNP genotype data of the sample, the sample which does not meet the genotype deficiency ratio control of the sample is removed from the analysis data;
5) removing SNP in a linkage disequilibrium region when controlling the quality of original SNP genotype data; after quality control, the remaining SNP genotype data is analyzed in the next step.
(2) After SNP genotype data which do not meet the control conditions are removed, scoring is carried out on the retained SNP genotype data; scoring the SNP genotype data by 0, 1 and 2 according to the number of high-risk alleles contained in the SNP genotype data, and expressing the corresponding SNP genotype data by adopting the scores of 0, 1 and 2;
for SNP genotype data, it is specified that there are two high-risk alleles that are homozygous for score 2, one high-risk allele that is heterozygous for score 1, and two low-risk alleles that are homozygous for score 0;
(3) SNPs with an association level p <0.05 with alzheimer's disease were considered to be significantly associated with the disease; screening out SNPs (single nucleotide polymorphisms) which are obviously related to the Alzheimer disease and SNP-SNP pairs which have obvious relevance to diseases due to the interaction between the SNPs;
1 represents patients suffering from Alzheimer disease, and 0 represents normal patients; obtaining SNP (Single nucleotide polymorphism) which is obviously related to the Alzheimer disease after the age and the sex are corrected by a single-factor logistic regression algorithm, and obtaining SNP-SNP pairs which are obviously related to the Alzheimer disease after the Bonferroni is corrected by a Lasso multiple regression method;
the step (3) specifically comprises the following steps:
a) after the SNP genotype data are scored, the SNP genotype of each sample is represented by 0, 1 and 2; when carrying out single-factor logistic regression analysis, taking a single SNP as an independent variable, taking the diseased states 0 and 1 of a sample as a dependent variable, and taking the age and the sex as covariates; obtaining the association level, the ratio and the 95% confidence interval of the SNP and the Alzheimer disease; (ii) remaining if an SNP with an association level p <0.05 with Alzheimer's disease is considered to be significantly associated with the disease;
b) SNP-SNP pairs which are obviously related to Alzheimer disease after correction of Bonferroni are obtained by using a Lasso multiple regression method.
(4) Obtaining SNP which is independently influenced by Alzheimer disease and SNP-SNP pairs which have independent influence on diseases due to the interaction between the SNPs;
the odds ratio OR value represents an index of the strength of correlation between disease and exposure, and, like Relative Risk (RR), refers to the disease risk of an exposer being a multiple of that of a non-exposer; carrying out multi-factor logistic regression algorithm analysis on the significantly related SNPs and SNP pairs to obtain SNPs, SNP-SNP pairs, corresponding ratio OR values, 95% confidence intervals and logistic regression constant terms alpha which independently influence the Alzheimer's disease, and obtaining the weight value beta of each SNP and SNP-SNP pair by taking the natural logarithm of the ratio OR value of each SNP and SNP-SNP pair;
the step (4) specifically comprises the following steps:
1) when multi-factor logistic regression algorithm analysis is carried out on the significantly related SNPs and the SNP-SNP pairs, the significantly related SNP genotype data are represented by 0, 1 and 2, the significantly related SNP-SNP pairs are represented by the product of the two SNP genotype data, and each significantly related SNP and SNP-SNP pair is regarded as one variable; obtaining the correlation level p value, the ratio OR value, the 95% confidence interval and a constant term alpha of logistic regression of each variable and the Alzheimer disease through a multifactor logistic regression algorithm; variables with a relevance level p <0.05 were considered variables that had an independent effect on alzheimer's disease;
2) and taking the natural logarithm of the ratio OR value of each SNP and each SNP-SNP pair to obtain the weight value beta of each SNP and each SNP-SNP pair, namely each SNP and SNP-SNP pair has the weight value beta corresponding to the SNP and SNP-SNP pair.
(5) Establishing an improved wGRS model by using SNP (Single nucleotide polymorphism) and SNP-SNP pair which are independently influenced by Alzheimer disease; taking each SNP and SNP-SNP pair as variable S, and according to the obtained weight value beta of each SNP and SNP-SNP pair, the improved wGRS model is expressed as the sum of the products of the weight of each variable and the weight of the variable, namely
Figure GDA0002838711380000091
Where n is the number of SNPs and SNP-SNP pairs, betaiWeight value representing the ith variable, SiRepresents the ith variable; the OR value of SNP and SNP-SNP pair which independently affect the Alzheimer disease is taken from natural logarithm to obtain the corresponding weight betai(ii) a When all the SNPs and SNP-SNP pairs which are independently influenced by the Alzheimer disease are included in a wGRS model, the model for obtaining the risk of the Alzheimer disease is logit P (D1 | G) ═ alpha + wGRS, wherein D1 represents a person suffering from the Alzheimer disease, G represents SNP gene data of the person, P (D1 | G) is the probability that the person possibly suffers from the Alzheimer disease calculated according to the SNP gene data of the person, and alpha is a constant term of logistic regression; wherein
Figure GDA0002838711380000092
n is the number of SNPs and SNP-SNP pairs, betaiWeight value representing the ith variable, SiRepresents the ith variable;
(6) predicting risk of Alzheimer's disease;
and (3) predicting the risk of the Alzheimer's disease of a person, and calculating the risk of the person suffering from the Alzheimer's disease by using the model of the risk of the Alzheimer's disease in the step (5) by only measuring the genotype data of the person.
The invention is based on the following web pages:
(http:// journals. plos. org/plosone/article/assertuque & id. info: doi/10.1371/journal. pane.0144898. s002) downloaded genotype data of 229 alzheimer's SNPs from chinese population and 318 normal individuals, and removed one SNP that did not satisfy haddi-weinberg balance. And performing 0, 1 and 2 conversion on all genotype data according to the number of high-risk alleles, and obtaining the SNP which is obviously related to the Alzheimer disease through single-factor logistic regression analysis. Since the genotype data contains no information on age, sex, etc., 13 SNPs that are significantly related to alzheimer's disease were directly cited after the original authors corrected the information on age, sex, etc. The detailed information is shown in table 1:
table 1 13 SNPs significantly associated with AD disease
Figure GDA0002838711380000101
The LMR method is used for finding out SNP pairs which are obviously related to the Alzheimer disease, and the results show that rs6656401-rs3865444, rs28834970-rs6656401 and rs28834970-rs3865444 are obviously related to AD (p is less than 0.05).
Carrying out multifactorial logistic regression on 13 significantly related SNPs and 3 pairs of SNPs to obtain SNPs and SNP pairs (p <0.05) which independently affect the Alzheimer disease, corresponding OR values and 95% confidence intervals (uncorrected for information such as age, sex and the like), and obtaining corresponding weights beta by taking the natural logarithm of the OR values. Table 2 is SNP and SNP pair independently affecting AD.
TABLE 2 SNPs and SNP pairs independently affecting AD
Figure GDA0002838711380000111
Thus, improved wGRS was calculated using SNPs and SNP pairs that independently affected alzheimer's disease, wGRS ═ V1 (-0.456) + V2 × 0.339+ V3 (-0.464) + V4 × 0.374+ V5 (-0.754) + V6 × 0.367+ V7 × 0.667+ V8 (-0.308) + V9 — 0.398) + V10 × 1.1, and the model of alzheimer's disease was logit P (D664 ═ 1| G) ═ 0.772+ wGRS.
To test the predictive accuracy of this model, we performed predictive analysis on the original samples (229 alzheimer and 318 normal control individuals) using modified wGRS, with the predictive results as shown in table 3:
TABLE 3 modified wGRS vs. original sample prediction case Table (classification point 0.5)
Figure GDA0002838711380000112
The corresponding ROC curve is shown in fig. 2.
The ROC curve had an area of 0.721 and 95% CI from (0.679-0.764).
If the influence of the interaction between SNPs on the disease is not considered, 13 significant SNPs are directly adopted, and the wGRS is established to predict the original sample, and the result analysis is obtained as shown in Table 4:
TABLE 4 wGRS vs. original sample prediction case Table (classification point 0.5)
Figure GDA0002838711380000121
Therefore, SNPs and SNP pairs significantly associated with alzheimer's disease were used as disease-affecting factors, and SNPs and SNP pairs independently affecting alzheimer's disease and corresponding OR values were obtained by multifactorial logistic regression. The accuracy of the prediction of alzheimer's disease risk with the improved wGRS was 68.7%. The accuracy of the prediction of the risk of alzheimer's disease using only the SNPs significantly associated with alzheimer's disease without considering the interaction between the SNPs was 66.4%. The improved wGRS method provided by the invention fully considers the influence of the interaction between SNPs on the onset of Alzheimer's disease, and can improve the prediction accuracy of the risk of Alzheimer's disease by 2.3%. If the age, gender, etc. information is corrected in performing multifactorial logistic regression to obtain SNP and SNP pairs that independently affect Alzheimer's disease, it is believed that the improved wGRS will be more accurate in predicting Alzheimer's disease risk.
In summary, the method for constructing the model for predicting the risk of developing alzheimer's disease provided by the embodiments of the present invention provides an improved wGRS method based on the existing wGRS, and not only the effect of a single SNP is considered in calculating the wGRS, but also the interaction between SNPs is considered. The improved wGRS method can further improve the accuracy of the prediction of the Alzheimer disease onset risk. Therefore, the method considers the important influence of the interaction between the SNPs on the Alzheimer's disease, applies the interaction between the SNPs to the prediction of the onset risk of the Alzheimer's disease, and further improves the accuracy of the prediction of the onset risk of the Alzheimer's disease.
The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims (2)

1. A method for constructing an Alzheimer disease onset risk prediction model is characterized by comprising the following steps:
(1) acquiring genotype data of an Alzheimer disease individual and a normal control individual;
for Alzheimer disease, firstly, performing gene sequencing on autosomes of a large number of Alzheimer disease patients and normal people to obtain original SNP genotype data of the Alzheimer disease patients and the normal people; performing quality control on original SNP genotype data, and removing SNP genotype data with the minimum allele frequency MAF of less than 0.02, unsatisfied Hadi-Winberg equilibrium test, typing success rate of less than 75% and located in a linkage disequilibrium region; the typing success rate of all SNP corresponding to the sample needs to be more than 75 percent, otherwise, the sample which does not meet the genotype deficiency ratio control of the sample is removed from the SNP genotype data; SNP genotype data meeting the conditions are retained for further analysis;
(2) after SNP genotype data which do not meet the control conditions are removed, scoring is carried out on the retained SNP genotype data; scoring the SNP genotype data by 0, 1 and 2 according to the number of high-risk alleles contained in the SNP genotype data, and expressing the corresponding SNP genotype data by adopting the scores of 0, 1 and 2;
for SNP genotype data, it is specified that there are two high-risk alleles that are homozygous for score 2, one high-risk allele that is heterozygous for score 1, and two low-risk alleles that are homozygous for score 0;
(3) SNPs with an association level p <0.05 with alzheimer's disease were considered to be significantly associated with the disease; screening out SNPs (single nucleotide polymorphisms) which are obviously related to the Alzheimer disease and SNP-SNP pairs which have obvious relevance to diseases due to the interaction between the SNPs;
1 represents patients suffering from Alzheimer disease, and 0 represents normal patients; obtaining SNP (Single nucleotide polymorphism) which is obviously related to the Alzheimer disease after the age and the sex are corrected by a single-factor logistic regression algorithm, and obtaining SNP-SNP pairs which are obviously related to the Alzheimer disease after the Bonferroni is corrected by a Lasso multiple regression method;
the step (3) specifically comprises the following steps:
1) after the SNP genotype data are scored, the SNP genotype of each sample is represented by 0, 1 and 2; when carrying out single-factor logistic regression analysis, taking a single SNP as an independent variable, taking the diseased states 0 and 1 of a sample as a dependent variable, and taking the age and the sex as covariates; obtaining the association level, the ratio and the 95% confidence interval of the SNP and the Alzheimer disease; (ii) remaining if an SNP with an association level p <0.05 with Alzheimer's disease is considered to be significantly associated with the disease;
2) obtaining SNP-SNP pairs which are obviously related to Alzheimer disease after Bonferroni correction by utilizing a Lasso multiple regression method;
(4) obtaining SNP which has independent influence on Alzheimer disease and SNP-SNP pairs which have independent influence on diseases due to the interaction between the SNP and the SNP;
the odds ratio OR value represents an index of the strength of correlation between disease and exposure, meaning that the disease risk of an exposer is a multiple of that of a non-exposer; carrying out multi-factor logistic regression algorithm analysis on the significantly related SNPs and SNP pairs to obtain SNPs, SNP-SNP pairs, corresponding ratio OR values, 95% confidence intervals and logistic regression constant terms alpha which independently influence the Alzheimer's disease, and obtaining the weight value beta of each SNP and SNP-SNP pair by taking the natural logarithm of the ratio OR value of each SNP and SNP-SNP pair;
the step (4) specifically comprises the following steps:
1) when multi-factor logistic regression algorithm analysis is carried out on the significantly related SNPs and the SNP-SNP pairs, the significantly related SNP genotype data are represented by 0, 1 and 2, the significantly related SNP-SNP pairs are represented by the product of the two SNP genotype data, and each significantly related SNP and SNP-SNP pair is regarded as one variable; obtaining the correlation level p value, the ratio OR value, the 95% confidence interval and a constant term alpha of logistic regression of each variable and the Alzheimer disease through a multifactor logistic regression algorithm; variables with a relevance level p <0.05 were considered variables that had an independent effect on alzheimer's disease;
2) taking the natural logarithm of the ratio OR value of each SNP and each SNP-SNP pair to obtain the weight value beta of each SNP and each SNP-SNP pair, namely each SNP and each SNP-SNP pair has the weight value beta corresponding to the SNP;
(5) establishing an improved wGRS model by using SNP (Single nucleotide polymorphism) and SNP-SNP pair which are independently influenced by Alzheimer disease; taking each SNP and SNP-SNP pair as variable S, and according to the obtained weight value beta of each SNP and SNP-SNP pair, the improved wGRS model is expressed as the sum of the products of the weight of each variable and the weight of the variable, namely
Figure FDA0002763630490000031
Where n is the number of SNPs and SNP-SNP pairs, betaiWeight value representing the ith variable, SiRepresents the ith variable;
the OR value of SNP and SNP-SNP pair which independently affect the Alzheimer disease is taken from natural logarithm to obtain the corresponding weight betai(ii) a When all the SNPs and SNP-SNP pairs which are independently influenced by the Alzheimer disease are included in a wGRS model, the model for obtaining the risk of the Alzheimer disease is logit P (D1 | G) ═ alpha + wGRS, wherein D1 represents a person suffering from the Alzheimer disease, G represents SNP gene data of the person, P (D1 | G) is the probability that the person possibly suffers from the Alzheimer disease calculated according to the SNP gene data of the person, and alpha is a constant term of logistic regression; wherein
Figure FDA0002763630490000032
n is the number of SNPs and SNP-SNP pairs, betaiWeight value representing the ith variable, SiRepresenting the ith variable.
2. The method for constructing model for predicting the onset risk of alzheimer's disease according to claim 1, wherein the quality control of the original SNP genotype data in step (1) comprises the following steps:
1) removing SNP of which the minimum allele frequency MAF is less than 0.02 from the original SNP genotype data;
2) eliminating SNP which does not meet Hardy-Weinberg balance test;
3) the typing success rate of a certain SNP in all samples needs to be controlled to be more than 75 percent; eliminating SNP which does not meet the SNP typing success rate control;
4) for genome-wide association analysis, for a sample to be tested; generally, the typing success rate of all SNPs corresponding to a sample needs to be controlled to be more than 75%, and when the quality of SNP genotype data of the sample is controlled, the sample which does not meet the genotype deficiency ratio control of the sample is removed from analysis data;
5) eliminating SNP in the linkage disequilibrium region; the remaining SNP genotype data was analyzed further.
CN201611190992.4A 2016-12-21 2016-12-21 Construction method of Alzheimer disease onset risk prediction model Active CN106636398B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611190992.4A CN106636398B (en) 2016-12-21 2016-12-21 Construction method of Alzheimer disease onset risk prediction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611190992.4A CN106636398B (en) 2016-12-21 2016-12-21 Construction method of Alzheimer disease onset risk prediction model

Publications (2)

Publication Number Publication Date
CN106636398A CN106636398A (en) 2017-05-10
CN106636398B true CN106636398B (en) 2021-01-29

Family

ID=58834537

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611190992.4A Active CN106636398B (en) 2016-12-21 2016-12-21 Construction method of Alzheimer disease onset risk prediction model

Country Status (1)

Country Link
CN (1) CN106636398B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109280695A (en) * 2017-07-20 2019-01-29 浙江金华中科分数生命科技有限公司 Utilize the polygenes score analysis method of human-body biological Samples Estimates complex disease onset risk
CN108172296A (en) * 2018-01-23 2018-06-15 上海其明信息技术有限公司 A kind of method for building up of database and the Risk Forecast Method of genetic disease
CN108256293A (en) * 2018-02-09 2018-07-06 哈尔滨工业大学深圳研究生院 A kind of statistical method and system of the disease association assortment of genes
CN108897985A (en) * 2018-05-04 2018-11-27 上海市内分泌代谢病研究所 A kind of method and its application of Glycohemoglobin HbA1c genetic locus scoring
CN108913776B (en) * 2018-08-14 2023-03-17 天佳吉瑞基因科技有限公司 Screening method and kit for DNA molecular markers related to radiotherapy and chemotherapy injury
CN109712716B (en) * 2018-12-25 2021-08-31 广州医科大学附属第一医院 Disease influence factor determination method, system and computer equipment
CN109468376A (en) * 2018-12-29 2019-03-15 青海省人民医院 Acute and chronic altitude sickness tumor susceptibility gene early warning detection kit
CN110349623A (en) * 2019-01-17 2019-10-18 哈尔滨工业大学 Based on the senile dementia ospc gene and site selection method for improving Mendelian randomization
US11621087B2 (en) * 2019-09-24 2023-04-04 International Business Machines Corporation Machine learning for amyloid and tau pathology prediction
CN111180012A (en) * 2019-12-27 2020-05-19 哈尔滨工业大学 Gene identification method based on empirical Bayes and Mendelian randomized fusion
CN112280863B (en) * 2020-11-06 2024-01-12 南京普恩瑞生物科技有限公司 Method and kit for targeting drug apatinib effectiveness
CN112489801A (en) * 2020-12-04 2021-03-12 北京睿思昆宁科技有限公司 Method, device and equipment for determining disease risk
CN113160887B (en) * 2021-04-23 2022-06-14 哈尔滨工业大学 Screening method of tumor neoantigen fused with single cell TCR sequencing data
CN113506631A (en) * 2021-08-06 2021-10-15 中国医学科学院基础医学研究所 Risk prediction method for improving diagnosis accuracy of chronic obstructive pulmonary acute exacerbation state

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103154272A (en) * 2010-08-25 2013-06-12 香港中文大学 Methods and kits for predicting the risk of diabetes associated complications using genetic markers and arrays
WO2016061246A1 (en) * 2014-10-14 2016-04-21 Wake Forest University Health Sciences Methods and compositions for correlating genetic markers with cancer risk

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103154272A (en) * 2010-08-25 2013-06-12 香港中文大学 Methods and kits for predicting the risk of diabetes associated complications using genetic markers and arrays
WO2016061246A1 (en) * 2014-10-14 2016-04-21 Wake Forest University Health Sciences Methods and compositions for correlating genetic markers with cancer risk

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Evaluation of genetic risk score models in the presence of interaction and linkage disequilibrium;Ronglin Che et al.;《ORIGINAL RESEARCH ARTICLE》;20130723;第4卷;第1-10页 *
使用肺癌GWAS数据进行遗传风险预测的方法和策略研究;段巍巍等;《中国卫生统计》;20150825;第32卷(第04期);554-557 *
基于环境与遗传风险的2型糖尿病发病风险预测模型的比较;张留伟等;《中国慢性病预防与控制》;20160215;第24卷(第02期);84-88 *

Also Published As

Publication number Publication date
CN106636398A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN106636398B (en) Construction method of Alzheimer disease onset risk prediction model
Okada et al. Deep whole-genome sequencing reveals recent selection signatures linked to evolution and disease risk of Japanese
Guo et al. Global genetic differentiation of complex traits shaped by natural selection in humans
Lin et al. DNA methylation levels at individual age-associated CpG sites can be indicative for life expectancy
Tan et al. Twin methodology in epigenetic studies
US20200027557A1 (en) Multimodal modeling systems and methods for predicting and managing dementia risk for individuals
Kim et al. Quantitative measures of healthy aging and biological age
WO2008067551A2 (en) Genetic analysis systems and methods
Liu et al. Identification of genetic and epigenetic marks involved in population structure
Wu et al. A novel method for identifying nonlinear gene–environment interactions in case–control association studies
Timmins et al. Genome-wide association study of self-reported walking pace suggests beneficial effects of brisk walking on health and survival
Li et al. ATOM: a powerful gene-based association test by combining optimally weighted markers
Branch et al. The genetic basis of spatial cognitive variation in a food-caching bird
Iwasaki et al. Inclusion of a genetic risk score into a validated risk prediction model for colorectal cancer in Japanese men improves performance
US20220367063A1 (en) Polygenic risk score for in vitro fertilization
Zhu et al. Detection of copy number variation and selection signatures on the X chromosome in Chinese indigenous sheep with different types of tail
Varón-González et al. Epistasis regulates the developmental stability of the mouse craniofacial shape
CN116486913B (en) System, apparatus and medium for de novo predictive regulatory mutations based on single cell sequencing
Cummings et al. Genome-wide scan identifies a quantitative trait locus at 4p15. 3 for serum urate
Knutson et al. MATS: a novel multi-ancestry transcriptome-wide association study to account for heterogeneity in the effects of cis-regulated gene expression on complex traits
Nustad et al. Modeling dependency structures in 450k DNA methylation data
Xu et al. The interplay between host genetics and the gut microbiome reveals common and distinct microbiome features for human complex diseases
Tournoud et al. A strategy to build and validate a prognostic biomarker model based on RT-qPCR gene expression and clinical covariates
Boomsma Twin, association and current “omics” studies
Yan et al. GWAS-based machine learning for prediction of age-related macular degeneration Risk

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant