CN106636398B - Construction method of Alzheimer disease onset risk prediction model - Google Patents
Construction method of Alzheimer disease onset risk prediction model Download PDFInfo
- Publication number
- CN106636398B CN106636398B CN201611190992.4A CN201611190992A CN106636398B CN 106636398 B CN106636398 B CN 106636398B CN 201611190992 A CN201611190992 A CN 201611190992A CN 106636398 B CN106636398 B CN 106636398B
- Authority
- CN
- China
- Prior art keywords
- snp
- disease
- alzheimer
- genotype data
- alzheimer disease
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/156—Polymorphic or mutational markers
Abstract
The invention belongs to the field of medical detection, and particularly discloses a construction method of an Alzheimer disease onset risk prediction model. The improved wGRS method can further improve the accuracy of the prediction of the Alzheimer disease onset risk. Therefore, the method considers the important influence of the interaction between the SNPs on the Alzheimer's disease, applies the interaction between the SNPs to the prediction of the onset risk of the Alzheimer's disease, and further improves the accuracy of the prediction of the onset risk of the Alzheimer's disease.
Description
Technical Field
The invention relates to the field of medical detection, in particular to a construction method of an Alzheimer disease onset risk prediction model.
Background
Alzheimer's disease is a degenerative disease of the nervous system and is clinically characterized by dementia manifestations such as memory decline, cognitive decline and the like. Modern science considers alzheimer's disease as a result of the co-action of genes and environmental factors, among which genes play a major role.
At present, the proportion of patients suffering from Alzheimer disease rises year by year, and the daily life of people is seriously influenced. In recent years, genome-wide association studies and candidate gene studies have found a large number of Alzheimer's disease-susceptible polymorphic sites. Therefore, it is very important to establish a corresponding model through the genotype data of the Alzheimer disease individuals and the normal control individuals, and further predict the onset risk of the Alzheimer disease of the individuals.
If the genotype data of a person is determined, the model can be used to calculate the risk of developing Alzheimer's disease. If the risk of the disease is high, healthy living, exercise and nutrition balance schemes need to be formulated, so that the risk of the disease is reduced.
Genetic Risk Scoring (GRS) is an effective method for analyzing Single Nucleotide Polymorphisms (SNPs) and clinical phenotypes of complex diseases. A single SNP has a weak effect on disease, and the method integrates the weak effects of several SNPs. GRS considers that each risk allele has the same effect on disease, but simply adds the number of risk alleles. In fact, the effect of each risk allele on disease is unlikely to be the same, and a weighted genetic risk score (wGRS) is born.
The weighted GRS can be expressed as:(βirepresents the weight, S, of the ith SNPiIndicates the number of risk alleles of the i-th SNP, and n is the number of SNPs). The algorithm considers that each risk allele has different influence on the disease, the influence degree of SNPs on the disease is indicated by endowing each risk allele with corresponding weight, and the wGRS is more widely applied to the prediction and evaluation of complex diseases than the GRS.
The current research shows that the interaction between SNPs has important influence on the onset of Alzheimer's disease, and the interaction between SNPs is ignored when the risk prediction of wGRS is carried out.
Disclosure of Invention
The present invention is directed to overcoming the problems in the prior art, and providing a method for constructing an Alzheimer disease risk prediction model, which is based on the genotype data of Alzheimer Disease (AD) individuals and normal individuals, establishes a more accurate Alzheimer disease risk prediction model, and predicts the Alzheimer disease risk by using the model and the genotype data of the individuals.
The technical scheme of the invention is as follows: a method for constructing an Alzheimer disease onset risk prediction model comprises the following steps:
(1) acquiring genotype data of an Alzheimer disease individual and a normal control individual;
for Alzheimer disease, firstly, performing gene sequencing on autosomes of a large number of Alzheimer disease patients and normal people to obtain original SNP genotype data of the Alzheimer disease patients and the normal people; performing quality control on original SNP genotype data, and removing SNP genotype data with the minimum allele frequency MAF of less than 0.02, unsatisfied Hadi-Winberg equilibrium test, typing success rate of less than 75% and located in a linkage disequilibrium region; the typing success rate of all SNP corresponding to the sample needs to be more than 75 percent, otherwise, the sample which does not meet the genotype deficiency ratio control of the sample is removed from the SNP genotype data; SNP genotype data meeting the conditions are retained for further analysis;
(2) after SNP genotype data which do not meet the control conditions are removed, scoring is carried out on the retained SNP genotype data; scoring the SNP genotype data by 0, 1 and 2 according to the number of high-risk alleles contained in the SNP genotype data, and expressing the corresponding SNP genotype data by adopting the scores of 0, 1 and 2;
for SNP genotype data, it is specified that there are two high-risk alleles that are homozygous for score 2, one high-risk allele that is heterozygous for score 1, and two low-risk alleles that are homozygous for score 0;
(3) SNPs with an association level p <0.05 with alzheimer's disease were considered to be significantly associated with the disease; screening out SNPs (single nucleotide polymorphisms) which are obviously related to the Alzheimer disease and SNP-SNP pairs which have obvious relevance to diseases due to the interaction between the SNPs;
1 represents patients suffering from Alzheimer disease, and 0 represents normal patients; obtaining SNP (Single nucleotide polymorphism) which is obviously related to the Alzheimer disease after the age and the sex are corrected by a single-factor logistic regression algorithm, and obtaining SNP-SNP pairs which are obviously related to the Alzheimer disease after the Bonferroni is corrected by a Lasso multiple regression method;
(4) obtaining SNP which is independently influenced by Alzheimer disease and SNP-SNP pairs which have independent influence on diseases due to the interaction between the SNPs;
the odds ratio OR value represents an index of the strength of correlation between disease and exposure, and, like Relative Risk (RR), refers to the disease risk of an exposer being a multiple of that of a non-exposer; carrying out multi-factor logistic regression algorithm analysis on the significantly related SNPs and SNP pairs to obtain SNPs, SNP-SNP pairs, corresponding ratio OR values, 95% confidence intervals and logistic regression constant terms alpha which independently influence the Alzheimer's disease, and obtaining the weight value beta of each SNP and SNP-SNP pair by taking the natural logarithm of the ratio OR value of each SNP and SNP-SNP pair;
(5) establishing an improved wGRS model by using SNP (Single nucleotide polymorphism) and SNP-SNP pair which are independently influenced by Alzheimer disease; taking each SNP and SNP-SNP pair as variable S, and according to the obtained weight value beta of each SNP and SNP-SNP pair, the improved wGRS model is expressed as the sum of the products of the weight of each variable and the weight of the variable, namelyWhere n is the number of SNPs and SNP-SNP pairs, betaiWeight value representing the ith variable, SiRepresents the ith variable; the OR value of SNP and SNP-SNP pair which independently affect the Alzheimer disease is taken from natural logarithm to obtain the corresponding weight betai(ii) a When all the SNPs and SNP-SNP pairs which are independently influenced by the Alzheimer disease are included in a wGRS model, the model for obtaining the risk of the Alzheimer disease is logit P (D1 | G) ═ alpha + wGRS, wherein D1 represents a person suffering from the Alzheimer disease, G represents SNP gene data of the person, P (D1 | G) is the probability that the person possibly suffers from the Alzheimer disease calculated according to the SNP gene data of the person, and alpha is a constant term of logistic regression; whereinn is the number of SNPs and SNP-SNP pairs, betaiRepresents the ith variationWeight value of quantity, SiRepresents the ith variable;
(6) predicting risk of Alzheimer's disease;
and (3) predicting the risk of the Alzheimer's disease of a person, and calculating the risk of the person suffering from the Alzheimer's disease by using the model of the risk of the Alzheimer's disease in the step (5) by only measuring the genotype data of the person.
Preferably, the quality control of the original SNP genotype data in the step (1) comprises the following specific steps:
1) removing SNP of which the minimum allele frequency MAF is less than 0.02 from the original SNP genotype data;
2) eliminating SNP which does not meet Hardy-Weinberg balance test;
3) the typing success rate of a certain SNP in all samples needs to be controlled to be more than 75 percent; eliminating SNP which does not meet the SNP typing success rate control;
4) for genome-wide correlation analysis, one sample is tested. Generally, the typing success rate of all SNPs corresponding to a sample needs to be controlled to be more than 75%, and when the quality of SNP genotype data of the sample is controlled, the sample which does not meet the genotype deficiency ratio control of the sample is removed from analysis data;
5) eliminating SNP in the linkage disequilibrium region; the remaining SNP genotype data was analyzed further.
Preferably, the step (3) specifically includes the following steps:
(1) after the SNP genotype data are scored, the SNP genotype of each sample is represented by 0, 1 and 2; when carrying out single-factor logistic regression analysis, taking a single SNP as an independent variable, taking the diseased states 0 and 1 of a sample as a dependent variable, and taking the age and the sex as covariates; obtaining the association level, the ratio and the 95% confidence interval of the SNP and the Alzheimer disease; (ii) remaining if an SNP with an association level p <0.05 with Alzheimer's disease is considered to be significantly associated with the disease;
(2) SNP-SNP pairs which are obviously related to Alzheimer disease after correction of Bonferroni are obtained by using a Lasso multiple regression method.
Preferably, the step (4) specifically includes the following steps:
1) when multi-factor logistic regression algorithm analysis is carried out on the significantly related SNPs and the SNP-SNP pairs, the significantly related SNP genotype data are represented by 0, 1 and 2, the significantly related SNP-SNP pairs are represented by the product of the two SNP genotype data, and each significantly related SNP and SNP-SNP pair is regarded as one variable; obtaining the correlation level p value, the ratio OR value, the 95% confidence interval and a constant term alpha of logistic regression of each variable and the Alzheimer disease through a multifactor logistic regression algorithm; variables with a relevance level p <0.05 were considered variables that had an independent effect on alzheimer's disease;
2) and taking the natural logarithm of the ratio OR value of each SNP and each SNP-SNP pair to obtain the weight value beta of each SNP and each SNP-SNP pair, namely each SNP and SNP-SNP pair has the weight value beta corresponding to the SNP and SNP-SNP pair.
The invention has the beneficial effects that: the embodiment of the invention provides a method for constructing an Alzheimer disease onset risk prediction model, which provides an improved wGRS method based on the existing wGRS, and not only takes the action of a single SNP into consideration when calculating the wGRS, but also takes the interaction between the SNPs into consideration. The improved wGRS method can further improve the accuracy of the prediction of the Alzheimer disease onset risk. Therefore, the method considers the important influence of the interaction between the SNPs on the Alzheimer's disease, applies the interaction between the SNPs to the prediction of the onset risk of the Alzheimer's disease, and further improves the accuracy of the prediction of the onset risk of the Alzheimer's disease.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a graph of the prediction of ROC for an original sample.
Detailed Description
An embodiment of the present invention will be described in detail below with reference to the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the embodiment.
As shown in fig. 1, an embodiment of the present invention provides a method for constructing an alzheimer disease onset risk prediction model, and when predicting an alzheimer disease risk by using genotype data, the present invention predicts an alzheimer disease onset risk by using an interaction relationship between SNPs; the invention aims to obtain an Alzheimer disease risk model by utilizing the genotype data of an Alzheimer disease individual and a normal control individual through training, and then predict the Alzheimer disease risk by utilizing the model and the genotype data of an individual to be detected. The method comprises the following steps:
(1) acquiring genotype data of an Alzheimer disease individual and a normal control individual;
for Alzheimer disease, firstly, performing gene sequencing on autosomes of a large number of Alzheimer disease patients and normal people to obtain original SNP genotype data of the Alzheimer disease patients and the normal people; performing quality control on original SNP genotype data, and removing SNP genotype data with the minimum allele frequency MAF of less than 0.02, unsatisfied Hadi-Winberg equilibrium test, typing success rate of less than 75% and located in a linkage disequilibrium region; the typing success rate of all SNP corresponding to the sample needs to be more than 75 percent, otherwise, the sample which does not meet the genotype deficiency ratio control of the sample is removed from the SNP genotype data; SNP genotype data meeting the conditions are retained for further analysis;
the quality control of the original SNP genotype data comprises the following specific steps:
1) in correlation studies, a smaller MAF will decrease the statistical performance, leading to false negative results. Removing SNP of which the minimum allele frequency MAF is less than 0.02 from the original SNP genotype data;
2) ideally, the frequency of each allele and the genotype frequency of the allele are stable and invariant in inheritance, i.e., maintain genetic equilibrium. The significance level p value of the Hardy-Weinberg equilibrium test is 1 × 10-6. Controlling the quality of original SNP genotype data, and removing SNP which does not meet Hardy-Weinberg balance test;
3) generally, the typing success rate of a certain SNP in all samples needs to be controlled to be more than 75%, otherwise, the quality control cannot be passed; eliminating SNP which does not meet the SNP typing success rate control;
4) for genome-wide correlation analysis, one sample is tested. Generally, the typing success rate of all SNPs corresponding to a sample needs to be controlled to be more than 75%, otherwise, the quality control cannot be passed, and when the quality control is performed on the SNP genotype data of the sample, the sample which does not meet the genotype deficiency ratio control of the sample is removed from the analysis data;
5) removing SNP in a linkage disequilibrium region when controlling the quality of original SNP genotype data; after quality control, the remaining SNP genotype data is analyzed in the next step.
(2) After SNP genotype data which do not meet the control conditions are removed, scoring is carried out on the retained SNP genotype data; scoring the SNP genotype data by 0, 1 and 2 according to the number of high-risk alleles contained in the SNP genotype data, and expressing the corresponding SNP genotype data by adopting the scores of 0, 1 and 2;
for SNP genotype data, it is specified that there are two high-risk alleles that are homozygous for score 2, one high-risk allele that is heterozygous for score 1, and two low-risk alleles that are homozygous for score 0;
(3) SNPs with an association level p <0.05 with alzheimer's disease were considered to be significantly associated with the disease; screening out SNPs (single nucleotide polymorphisms) which are obviously related to the Alzheimer disease and SNP-SNP pairs which have obvious relevance to diseases due to the interaction between the SNPs;
1 represents patients suffering from Alzheimer disease, and 0 represents normal patients; obtaining SNP (Single nucleotide polymorphism) which is obviously related to the Alzheimer disease after the age and the sex are corrected by a single-factor logistic regression algorithm, and obtaining SNP-SNP pairs which are obviously related to the Alzheimer disease after the Bonferroni is corrected by a Lasso multiple regression method;
the step (3) specifically comprises the following steps:
a) after the SNP genotype data are scored, the SNP genotype of each sample is represented by 0, 1 and 2; when carrying out single-factor logistic regression analysis, taking a single SNP as an independent variable, taking the diseased states 0 and 1 of a sample as a dependent variable, and taking the age and the sex as covariates; obtaining the association level, the ratio and the 95% confidence interval of the SNP and the Alzheimer disease; (ii) remaining if an SNP with an association level p <0.05 with Alzheimer's disease is considered to be significantly associated with the disease;
b) SNP-SNP pairs which are obviously related to Alzheimer disease after correction of Bonferroni are obtained by using a Lasso multiple regression method.
(4) Obtaining SNP which is independently influenced by Alzheimer disease and SNP-SNP pairs which have independent influence on diseases due to the interaction between the SNPs;
the odds ratio OR value represents an index of the strength of correlation between disease and exposure, and, like Relative Risk (RR), refers to the disease risk of an exposer being a multiple of that of a non-exposer; carrying out multi-factor logistic regression algorithm analysis on the significantly related SNPs and SNP pairs to obtain SNPs, SNP-SNP pairs, corresponding ratio OR values, 95% confidence intervals and logistic regression constant terms alpha which independently influence the Alzheimer's disease, and obtaining the weight value beta of each SNP and SNP-SNP pair by taking the natural logarithm of the ratio OR value of each SNP and SNP-SNP pair;
the step (4) specifically comprises the following steps:
1) when multi-factor logistic regression algorithm analysis is carried out on the significantly related SNPs and the SNP-SNP pairs, the significantly related SNP genotype data are represented by 0, 1 and 2, the significantly related SNP-SNP pairs are represented by the product of the two SNP genotype data, and each significantly related SNP and SNP-SNP pair is regarded as one variable; obtaining the correlation level p value, the ratio OR value, the 95% confidence interval and a constant term alpha of logistic regression of each variable and the Alzheimer disease through a multifactor logistic regression algorithm; variables with a relevance level p <0.05 were considered variables that had an independent effect on alzheimer's disease;
2) and taking the natural logarithm of the ratio OR value of each SNP and each SNP-SNP pair to obtain the weight value beta of each SNP and each SNP-SNP pair, namely each SNP and SNP-SNP pair has the weight value beta corresponding to the SNP and SNP-SNP pair.
(5) Establishing an improved wGRS model by using SNP (Single nucleotide polymorphism) and SNP-SNP pair which are independently influenced by Alzheimer disease; taking each SNP and SNP-SNP pair as variable S, and according to the obtained weight value beta of each SNP and SNP-SNP pair, the improved wGRS model is expressed as the sum of the products of the weight of each variable and the weight of the variable, namelyWhere n is the number of SNPs and SNP-SNP pairs, betaiWeight value representing the ith variable, SiRepresents the ith variable; the OR value of SNP and SNP-SNP pair which independently affect the Alzheimer disease is taken from natural logarithm to obtain the corresponding weight betai(ii) a When all the SNPs and SNP-SNP pairs which are independently influenced by the Alzheimer disease are included in a wGRS model, the model for obtaining the risk of the Alzheimer disease is logit P (D1 | G) ═ alpha + wGRS, wherein D1 represents a person suffering from the Alzheimer disease, G represents SNP gene data of the person, P (D1 | G) is the probability that the person possibly suffers from the Alzheimer disease calculated according to the SNP gene data of the person, and alpha is a constant term of logistic regression; whereinn is the number of SNPs and SNP-SNP pairs, betaiWeight value representing the ith variable, SiRepresents the ith variable;
(6) predicting risk of Alzheimer's disease;
and (3) predicting the risk of the Alzheimer's disease of a person, and calculating the risk of the person suffering from the Alzheimer's disease by using the model of the risk of the Alzheimer's disease in the step (5) by only measuring the genotype data of the person.
The invention is based on the following web pages:
(http:// journals. plos. org/plosone/article/assertuque & id. info: doi/10.1371/journal. pane.0144898. s002) downloaded genotype data of 229 alzheimer's SNPs from chinese population and 318 normal individuals, and removed one SNP that did not satisfy haddi-weinberg balance. And performing 0, 1 and 2 conversion on all genotype data according to the number of high-risk alleles, and obtaining the SNP which is obviously related to the Alzheimer disease through single-factor logistic regression analysis. Since the genotype data contains no information on age, sex, etc., 13 SNPs that are significantly related to alzheimer's disease were directly cited after the original authors corrected the information on age, sex, etc. The detailed information is shown in table 1:
table 1 13 SNPs significantly associated with AD disease
The LMR method is used for finding out SNP pairs which are obviously related to the Alzheimer disease, and the results show that rs6656401-rs3865444, rs28834970-rs6656401 and rs28834970-rs3865444 are obviously related to AD (p is less than 0.05).
Carrying out multifactorial logistic regression on 13 significantly related SNPs and 3 pairs of SNPs to obtain SNPs and SNP pairs (p <0.05) which independently affect the Alzheimer disease, corresponding OR values and 95% confidence intervals (uncorrected for information such as age, sex and the like), and obtaining corresponding weights beta by taking the natural logarithm of the OR values. Table 2 is SNP and SNP pair independently affecting AD.
TABLE 2 SNPs and SNP pairs independently affecting AD
Thus, improved wGRS was calculated using SNPs and SNP pairs that independently affected alzheimer's disease, wGRS ═ V1 (-0.456) + V2 × 0.339+ V3 (-0.464) + V4 × 0.374+ V5 (-0.754) + V6 × 0.367+ V7 × 0.667+ V8 (-0.308) + V9 — 0.398) + V10 × 1.1, and the model of alzheimer's disease was logit P (D664 ═ 1| G) ═ 0.772+ wGRS.
To test the predictive accuracy of this model, we performed predictive analysis on the original samples (229 alzheimer and 318 normal control individuals) using modified wGRS, with the predictive results as shown in table 3:
TABLE 3 modified wGRS vs. original sample prediction case Table (classification point 0.5)
The corresponding ROC curve is shown in fig. 2.
The ROC curve had an area of 0.721 and 95% CI from (0.679-0.764).
If the influence of the interaction between SNPs on the disease is not considered, 13 significant SNPs are directly adopted, and the wGRS is established to predict the original sample, and the result analysis is obtained as shown in Table 4:
TABLE 4 wGRS vs. original sample prediction case Table (classification point 0.5)
Therefore, SNPs and SNP pairs significantly associated with alzheimer's disease were used as disease-affecting factors, and SNPs and SNP pairs independently affecting alzheimer's disease and corresponding OR values were obtained by multifactorial logistic regression. The accuracy of the prediction of alzheimer's disease risk with the improved wGRS was 68.7%. The accuracy of the prediction of the risk of alzheimer's disease using only the SNPs significantly associated with alzheimer's disease without considering the interaction between the SNPs was 66.4%. The improved wGRS method provided by the invention fully considers the influence of the interaction between SNPs on the onset of Alzheimer's disease, and can improve the prediction accuracy of the risk of Alzheimer's disease by 2.3%. If the age, gender, etc. information is corrected in performing multifactorial logistic regression to obtain SNP and SNP pairs that independently affect Alzheimer's disease, it is believed that the improved wGRS will be more accurate in predicting Alzheimer's disease risk.
In summary, the method for constructing the model for predicting the risk of developing alzheimer's disease provided by the embodiments of the present invention provides an improved wGRS method based on the existing wGRS, and not only the effect of a single SNP is considered in calculating the wGRS, but also the interaction between SNPs is considered. The improved wGRS method can further improve the accuracy of the prediction of the Alzheimer disease onset risk. Therefore, the method considers the important influence of the interaction between the SNPs on the Alzheimer's disease, applies the interaction between the SNPs to the prediction of the onset risk of the Alzheimer's disease, and further improves the accuracy of the prediction of the onset risk of the Alzheimer's disease.
The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.
Claims (2)
1. A method for constructing an Alzheimer disease onset risk prediction model is characterized by comprising the following steps:
(1) acquiring genotype data of an Alzheimer disease individual and a normal control individual;
for Alzheimer disease, firstly, performing gene sequencing on autosomes of a large number of Alzheimer disease patients and normal people to obtain original SNP genotype data of the Alzheimer disease patients and the normal people; performing quality control on original SNP genotype data, and removing SNP genotype data with the minimum allele frequency MAF of less than 0.02, unsatisfied Hadi-Winberg equilibrium test, typing success rate of less than 75% and located in a linkage disequilibrium region; the typing success rate of all SNP corresponding to the sample needs to be more than 75 percent, otherwise, the sample which does not meet the genotype deficiency ratio control of the sample is removed from the SNP genotype data; SNP genotype data meeting the conditions are retained for further analysis;
(2) after SNP genotype data which do not meet the control conditions are removed, scoring is carried out on the retained SNP genotype data; scoring the SNP genotype data by 0, 1 and 2 according to the number of high-risk alleles contained in the SNP genotype data, and expressing the corresponding SNP genotype data by adopting the scores of 0, 1 and 2;
for SNP genotype data, it is specified that there are two high-risk alleles that are homozygous for score 2, one high-risk allele that is heterozygous for score 1, and two low-risk alleles that are homozygous for score 0;
(3) SNPs with an association level p <0.05 with alzheimer's disease were considered to be significantly associated with the disease; screening out SNPs (single nucleotide polymorphisms) which are obviously related to the Alzheimer disease and SNP-SNP pairs which have obvious relevance to diseases due to the interaction between the SNPs;
1 represents patients suffering from Alzheimer disease, and 0 represents normal patients; obtaining SNP (Single nucleotide polymorphism) which is obviously related to the Alzheimer disease after the age and the sex are corrected by a single-factor logistic regression algorithm, and obtaining SNP-SNP pairs which are obviously related to the Alzheimer disease after the Bonferroni is corrected by a Lasso multiple regression method;
the step (3) specifically comprises the following steps:
1) after the SNP genotype data are scored, the SNP genotype of each sample is represented by 0, 1 and 2; when carrying out single-factor logistic regression analysis, taking a single SNP as an independent variable, taking the diseased states 0 and 1 of a sample as a dependent variable, and taking the age and the sex as covariates; obtaining the association level, the ratio and the 95% confidence interval of the SNP and the Alzheimer disease; (ii) remaining if an SNP with an association level p <0.05 with Alzheimer's disease is considered to be significantly associated with the disease;
2) obtaining SNP-SNP pairs which are obviously related to Alzheimer disease after Bonferroni correction by utilizing a Lasso multiple regression method;
(4) obtaining SNP which has independent influence on Alzheimer disease and SNP-SNP pairs which have independent influence on diseases due to the interaction between the SNP and the SNP;
the odds ratio OR value represents an index of the strength of correlation between disease and exposure, meaning that the disease risk of an exposer is a multiple of that of a non-exposer; carrying out multi-factor logistic regression algorithm analysis on the significantly related SNPs and SNP pairs to obtain SNPs, SNP-SNP pairs, corresponding ratio OR values, 95% confidence intervals and logistic regression constant terms alpha which independently influence the Alzheimer's disease, and obtaining the weight value beta of each SNP and SNP-SNP pair by taking the natural logarithm of the ratio OR value of each SNP and SNP-SNP pair;
the step (4) specifically comprises the following steps:
1) when multi-factor logistic regression algorithm analysis is carried out on the significantly related SNPs and the SNP-SNP pairs, the significantly related SNP genotype data are represented by 0, 1 and 2, the significantly related SNP-SNP pairs are represented by the product of the two SNP genotype data, and each significantly related SNP and SNP-SNP pair is regarded as one variable; obtaining the correlation level p value, the ratio OR value, the 95% confidence interval and a constant term alpha of logistic regression of each variable and the Alzheimer disease through a multifactor logistic regression algorithm; variables with a relevance level p <0.05 were considered variables that had an independent effect on alzheimer's disease;
2) taking the natural logarithm of the ratio OR value of each SNP and each SNP-SNP pair to obtain the weight value beta of each SNP and each SNP-SNP pair, namely each SNP and each SNP-SNP pair has the weight value beta corresponding to the SNP;
(5) establishing an improved wGRS model by using SNP (Single nucleotide polymorphism) and SNP-SNP pair which are independently influenced by Alzheimer disease; taking each SNP and SNP-SNP pair as variable S, and according to the obtained weight value beta of each SNP and SNP-SNP pair, the improved wGRS model is expressed as the sum of the products of the weight of each variable and the weight of the variable, namelyWhere n is the number of SNPs and SNP-SNP pairs, betaiWeight value representing the ith variable, SiRepresents the ith variable;
the OR value of SNP and SNP-SNP pair which independently affect the Alzheimer disease is taken from natural logarithm to obtain the corresponding weight betai(ii) a When all the SNPs and SNP-SNP pairs which are independently influenced by the Alzheimer disease are included in a wGRS model, the model for obtaining the risk of the Alzheimer disease is logit P (D1 | G) ═ alpha + wGRS, wherein D1 represents a person suffering from the Alzheimer disease, G represents SNP gene data of the person, P (D1 | G) is the probability that the person possibly suffers from the Alzheimer disease calculated according to the SNP gene data of the person, and alpha is a constant term of logistic regression; whereinn is the number of SNPs and SNP-SNP pairs, betaiWeight value representing the ith variable, SiRepresenting the ith variable.
2. The method for constructing model for predicting the onset risk of alzheimer's disease according to claim 1, wherein the quality control of the original SNP genotype data in step (1) comprises the following steps:
1) removing SNP of which the minimum allele frequency MAF is less than 0.02 from the original SNP genotype data;
2) eliminating SNP which does not meet Hardy-Weinberg balance test;
3) the typing success rate of a certain SNP in all samples needs to be controlled to be more than 75 percent; eliminating SNP which does not meet the SNP typing success rate control;
4) for genome-wide association analysis, for a sample to be tested; generally, the typing success rate of all SNPs corresponding to a sample needs to be controlled to be more than 75%, and when the quality of SNP genotype data of the sample is controlled, the sample which does not meet the genotype deficiency ratio control of the sample is removed from analysis data;
5) eliminating SNP in the linkage disequilibrium region; the remaining SNP genotype data was analyzed further.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611190992.4A CN106636398B (en) | 2016-12-21 | 2016-12-21 | Construction method of Alzheimer disease onset risk prediction model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611190992.4A CN106636398B (en) | 2016-12-21 | 2016-12-21 | Construction method of Alzheimer disease onset risk prediction model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106636398A CN106636398A (en) | 2017-05-10 |
CN106636398B true CN106636398B (en) | 2021-01-29 |
Family
ID=58834537
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611190992.4A Active CN106636398B (en) | 2016-12-21 | 2016-12-21 | Construction method of Alzheimer disease onset risk prediction model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106636398B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109280695A (en) * | 2017-07-20 | 2019-01-29 | 浙江金华中科分数生命科技有限公司 | Utilize the polygenes score analysis method of human-body biological Samples Estimates complex disease onset risk |
CN108172296A (en) * | 2018-01-23 | 2018-06-15 | 上海其明信息技术有限公司 | A kind of method for building up of database and the Risk Forecast Method of genetic disease |
CN108256293A (en) * | 2018-02-09 | 2018-07-06 | 哈尔滨工业大学深圳研究生院 | A kind of statistical method and system of the disease association assortment of genes |
CN108897985A (en) * | 2018-05-04 | 2018-11-27 | 上海市内分泌代谢病研究所 | A kind of method and its application of Glycohemoglobin HbA1c genetic locus scoring |
CN108913776B (en) * | 2018-08-14 | 2023-03-17 | 天佳吉瑞基因科技有限公司 | Screening method and kit for DNA molecular markers related to radiotherapy and chemotherapy injury |
CN109712716B (en) * | 2018-12-25 | 2021-08-31 | 广州医科大学附属第一医院 | Disease influence factor determination method, system and computer equipment |
CN109468376A (en) * | 2018-12-29 | 2019-03-15 | 青海省人民医院 | Acute and chronic altitude sickness tumor susceptibility gene early warning detection kit |
CN110349623A (en) * | 2019-01-17 | 2019-10-18 | 哈尔滨工业大学 | Based on the senile dementia ospc gene and site selection method for improving Mendelian randomization |
US11621087B2 (en) * | 2019-09-24 | 2023-04-04 | International Business Machines Corporation | Machine learning for amyloid and tau pathology prediction |
CN111180012A (en) * | 2019-12-27 | 2020-05-19 | 哈尔滨工业大学 | Gene identification method based on empirical Bayes and Mendelian randomized fusion |
CN112280863B (en) * | 2020-11-06 | 2024-01-12 | 南京普恩瑞生物科技有限公司 | Method and kit for targeting drug apatinib effectiveness |
CN112489801A (en) * | 2020-12-04 | 2021-03-12 | 北京睿思昆宁科技有限公司 | Method, device and equipment for determining disease risk |
CN113160887B (en) * | 2021-04-23 | 2022-06-14 | 哈尔滨工业大学 | Screening method of tumor neoantigen fused with single cell TCR sequencing data |
CN113506631A (en) * | 2021-08-06 | 2021-10-15 | 中国医学科学院基础医学研究所 | Risk prediction method for improving diagnosis accuracy of chronic obstructive pulmonary acute exacerbation state |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103154272A (en) * | 2010-08-25 | 2013-06-12 | 香港中文大学 | Methods and kits for predicting the risk of diabetes associated complications using genetic markers and arrays |
WO2016061246A1 (en) * | 2014-10-14 | 2016-04-21 | Wake Forest University Health Sciences | Methods and compositions for correlating genetic markers with cancer risk |
-
2016
- 2016-12-21 CN CN201611190992.4A patent/CN106636398B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103154272A (en) * | 2010-08-25 | 2013-06-12 | 香港中文大学 | Methods and kits for predicting the risk of diabetes associated complications using genetic markers and arrays |
WO2016061246A1 (en) * | 2014-10-14 | 2016-04-21 | Wake Forest University Health Sciences | Methods and compositions for correlating genetic markers with cancer risk |
Non-Patent Citations (3)
Title |
---|
Evaluation of genetic risk score models in the presence of interaction and linkage disequilibrium;Ronglin Che et al.;《ORIGINAL RESEARCH ARTICLE》;20130723;第4卷;第1-10页 * |
使用肺癌GWAS数据进行遗传风险预测的方法和策略研究;段巍巍等;《中国卫生统计》;20150825;第32卷(第04期);554-557 * |
基于环境与遗传风险的2型糖尿病发病风险预测模型的比较;张留伟等;《中国慢性病预防与控制》;20160215;第24卷(第02期);84-88 * |
Also Published As
Publication number | Publication date |
---|---|
CN106636398A (en) | 2017-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106636398B (en) | Construction method of Alzheimer disease onset risk prediction model | |
Okada et al. | Deep whole-genome sequencing reveals recent selection signatures linked to evolution and disease risk of Japanese | |
Guo et al. | Global genetic differentiation of complex traits shaped by natural selection in humans | |
Lin et al. | DNA methylation levels at individual age-associated CpG sites can be indicative for life expectancy | |
Tan et al. | Twin methodology in epigenetic studies | |
US20200027557A1 (en) | Multimodal modeling systems and methods for predicting and managing dementia risk for individuals | |
Kim et al. | Quantitative measures of healthy aging and biological age | |
WO2008067551A2 (en) | Genetic analysis systems and methods | |
Liu et al. | Identification of genetic and epigenetic marks involved in population structure | |
Wu et al. | A novel method for identifying nonlinear gene–environment interactions in case–control association studies | |
Timmins et al. | Genome-wide association study of self-reported walking pace suggests beneficial effects of brisk walking on health and survival | |
Li et al. | ATOM: a powerful gene-based association test by combining optimally weighted markers | |
Branch et al. | The genetic basis of spatial cognitive variation in a food-caching bird | |
Iwasaki et al. | Inclusion of a genetic risk score into a validated risk prediction model for colorectal cancer in Japanese men improves performance | |
US20220367063A1 (en) | Polygenic risk score for in vitro fertilization | |
Zhu et al. | Detection of copy number variation and selection signatures on the X chromosome in Chinese indigenous sheep with different types of tail | |
Varón-González et al. | Epistasis regulates the developmental stability of the mouse craniofacial shape | |
CN116486913B (en) | System, apparatus and medium for de novo predictive regulatory mutations based on single cell sequencing | |
Cummings et al. | Genome-wide scan identifies a quantitative trait locus at 4p15. 3 for serum urate | |
Knutson et al. | MATS: a novel multi-ancestry transcriptome-wide association study to account for heterogeneity in the effects of cis-regulated gene expression on complex traits | |
Nustad et al. | Modeling dependency structures in 450k DNA methylation data | |
Xu et al. | The interplay between host genetics and the gut microbiome reveals common and distinct microbiome features for human complex diseases | |
Tournoud et al. | A strategy to build and validate a prognostic biomarker model based on RT-qPCR gene expression and clinical covariates | |
Boomsma | Twin, association and current “omics” studies | |
Yan et al. | GWAS-based machine learning for prediction of age-related macular degeneration Risk |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |