CN113506594A - Construction method, device and application of multi-gene genetic risk comprehensive score of coronary heart disease - Google Patents

Construction method, device and application of multi-gene genetic risk comprehensive score of coronary heart disease Download PDF

Info

Publication number
CN113506594A
CN113506594A CN202110579230.8A CN202110579230A CN113506594A CN 113506594 A CN113506594 A CN 113506594A CN 202110579230 A CN202110579230 A CN 202110579230A CN 113506594 A CN113506594 A CN 113506594A
Authority
CN
China
Prior art keywords
heart disease
coronary heart
sub
phenotype
snp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110579230.8A
Other languages
Chinese (zh)
Other versions
CN113506594B (en
Inventor
顾东风
鲁向锋
黄建凤
王来元
陈恕凤
刘钟应
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuwai Hospital of CAMS and PUMC
Original Assignee
Fuwai Hospital of CAMS and PUMC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuwai Hospital of CAMS and PUMC filed Critical Fuwai Hospital of CAMS and PUMC
Priority to CN202110579230.8A priority Critical patent/CN113506594B/en
Publication of CN113506594A publication Critical patent/CN113506594A/en
Priority to PCT/CN2022/095221 priority patent/WO2022247903A1/en
Application granted granted Critical
Publication of CN113506594B publication Critical patent/CN113506594B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Ecology (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method and a device for constructing a polygenic genetic risk comprehensive score (metaPRS) of coronary heart disease and application thereof. The construction method of the coronary heart disease polygene genetic risk comprehensive score comprises the following steps: screening a set of SNPs associated with coronary heart disease and/or associated with a coronary heart disease-associated phenotype; detecting the genotype of the SNP to be detected of the individual; respectively extracting risk alleles, effect values and P values of the detected SNP corresponding to a plurality of sub-phenotypes from the whole genome correlation research result, constructing a plurality of candidate sub-phenotype PRSs and screening the optimal sub-phenotype PRSs; determining a weight for each sub-phenotypic PRS; converting the weight of the sub-phenotypic PRS into a weight at the SNP level; constructing a coronary heart disease polygene genetic risk comprehensive score metaPRS. The invention has important significance for predicting the onset risk of coronary heart disease and refining and layering.

Description

Construction method, device and application of multi-gene genetic risk comprehensive score of coronary heart disease
Technical Field
The invention relates to a method and a device for constructing a polygenic genetic risk comprehensive score (metaPRS) of coronary heart disease and application thereof.
Background
The development of cardiovascular disease (CVD) is influenced by a combination of genetic and environmental factors.
In the primary prevention of cardiovascular disease, risk prediction and assessment play a crucial role. Genetic factors as stable and quantifiable life-long markers have long been expected to be useful in risk assessment of disease to promote accurate prevention of cardiovascular disease. Over the last 10 years, genome-wide association studies have successfully identified hundreds of regions that have significant associations with coronary heart disease and coronary heart disease-associated phenotypes (blood lipid levels, blood pressure, type 2 diabetes, and BMI). Recently, coronary heart disease polygenic genetic risk score (PRS) integrating information of multiple genetic variations has been successfully developed and used for clinical utility assessment of risk prediction of coronary heart disease (Eur. Heart. J.37,561-567 (2016); Nat. Genet.50,1219-1224 (2018); J.Am. Coll. Cardiol.72,1883-1893 (2018); Eur. Heart. J.37,3267-3278 (2016); Jama323,627-635 (2020); Jama323, 645 (2020); JAMA Cardiol. 3,693- -702 (2018); N.Engl. J.Med.375, 2349-8 (2016)). However, almost all of these genetic scores are constructed based on the european population, and the differences in ectopic site frequencies among different populations, differences in linkage disequilibrium patterns, have resulted in the inability of the european population to use the scores in the east asia and chinese populations. Second, this heterogeneity can also result from differences in lifestyle, other risk factors, and potential gene-environment interactions among different populations. Studies have reported that the predictive effect of these genetic scores predicts a significant decline in potency in other ethnic groups.
Therefore, there is an urgent need to develop genetic risk scores for the east asian population, particularly the chinese population.
Disclosure of Invention
The invention aims to provide a method for constructing a polygene genetic risk score of coronary heart disease.
The invention also aims to provide a device for constructing the multi-gene genetic risk score of the coronary heart disease.
Specifically, in one aspect, the invention provides a method for constructing a multi-gene genetic risk score of coronary heart disease, which is a method for constructing a multi-gene genetic risk comprehensive score of coronary heart disease, and the method comprises the following steps:
(1) screening a set of Single Nucleotide Polymorphism (SNP) sites related to coronary heart disease or related to coronary heart disease-related phenotype (achieving a complete genome significant association); wherein the coronary heart disease associated phenotype comprises: blood pressure, type 2 diabetes, blood lipids, obesity, and stroke;
(2) genotyping based on the single nucleotide polymorphic sites in step (1);
(3) respectively extracting risk alleles, effect values and P values of the detected SNP corresponding to a plurality of sub-phenotypes from the whole genome association research result, constructing a plurality of candidate sub-phenotype PRSs and screening the optimal sub-phenotype PRSs;
(4) determining a weight for each sub-phenotypic PRS;
(5) converting the weight of the sub-phenotypic PRS into a weight at the SNP level;
(6) constructing a coronary heart disease polygene genetic risk comprehensive score metaPRS.
According to the specific embodiment of the invention, in the method for constructing the polygenic genetic risk score of coronary heart disease, the coronary heart disease-related phenotype blood pressure comprises: systolic, diastolic, pulse, mean arterial, and hypertension; coronary heart disease-related phenotype obesity (body mass index) including body mass index, waist circumference, and waist-to-hip ratio; coronary heart disease-associated phenotypic lipids include total cholesterol, low density lipoprotein cholesterol, triglycerides and high density lipoprotein cholesterol.
According to a specific embodiment of the present invention, in the method for constructing a multiple-gene genetic risk score of coronary heart disease of the present invention, the plurality of sub-phenotypes comprises: coronary heart disease, body mass index, blood pressure, type 2 diabetes, total cholesterol, low density lipoprotein cholesterol, triglycerides, high density lipoprotein cholesterol, and stroke. That is, in the method for constructing a multi-gene genetic risk score for coronary heart disease of the present invention, the constructed plurality of candidate sub-phenotype PRSs include: coronary heart disease, stroke, type 2 diabetes, blood pressure, body mass index, total cholesterol, low density lipoprotein cholesterol, triglycerides, and high density lipoprotein cholesterol.
According to the specific embodiment of the invention, in the method for constructing the multiple-gene genetic risk score of coronary heart disease, the set of single nucleotide polymorphism sites is included in the whole genome association research and is found to have a whole genome significant association with coronary heart disease or a coronary heart disease-related phenotype (coronary heart disease-related risk factors). Specifically, the single nucleotide polymorphism sites are included in the collection of single nucleotide polymorphism sites: the SNP sites related to coronary heart disease or cerebral apoplexy and the SNP sites related to blood pressure, type 2 diabetes, blood fat and obesity can be selectively and further included in the SNP sites related to clinical phenotype of arteriosclerosis. According to a specific embodiment of the present invention, in the method for constructing a polygenic genetic risk score for coronary heart disease of the present invention, the polygenic genetic risk score for coronary heart disease is used for assessing the risk of coronary heart disease in east asian people, and the single nucleotide polymorphic sites included in the set of single nucleotide polymorphic sites may be of all people, for example, european people and east asian people, wherein the single nucleotide polymorphic sites related to clinical phenotypes such as blood pressure, type 2 diabetes, blood lipid, obesity, and arteriosclerosis may be of mainly east asian people.
According to the specific embodiment of the invention, in the method for constructing the multi-gene genetic risk score of coronary heart disease, the crowd in the queue for genotyping is east Asian crowd.
According to the specific embodiment of the invention, in the method for constructing the polygenic genetic risk score of coronary heart disease, the genotyping is carried out by using a multiple polymerase chain reaction targeted amplicon sequencing technology. Median sequencing depth was 982 x.
According to the specific embodiment of the invention, in the construction method of the multiple-gene genetic risk score of coronary heart disease, SNP with a genotype detection rate lower than 95% can be eliminated in the genotyping process, and a qualified SNP set can be obtained.
According to the specific embodiment of the invention, in the construction method of the coronary heart disease polygene genetic risk score, the risk alleles, effect values and P values of the detected SNP corresponding to a plurality of sub-phenotypes are respectively extracted from the large-scale east Asian population whole genome association research results.
According to the specific embodiment of the present invention, in the method for constructing a multi-gene genetic risk score of coronary heart disease, the process of constructing each sub-phenotype PRS includes:
dividing multiple groups of SNPs according to the extracted P value, and for each group of SNPs, based on the queue population data, using a plink software clumping command according to r2<0.2 pruning to obtain a plurality of groups of SNP combinations;
using genotype data, individual SNP risk allele factors (0, 1, OR 2) are weighted according to their corresponding effect values and summed to construct a plurality of candidate PRSs incorporating different combination SNPs, a logistic regression model is used to assess the association of these candidate PRSs with coronary heart disease, and the score with the greatest Odds Ratio (OR) (one standard deviation per PRS increase) is selected as the best sub-phenotypic PRS.
According to a more specific embodiment of the present invention, in the above process of constructing each sub-phenotype PRS, N sets of SNPs can be separated according to the extracted P value, wherein N is greater than or equal to 2. For example, P values of 0.5,0.4,0.3,0.2,0.1,0.05,0.01,10 can be used-3,10-4,10-5,10-6,10-7From these, 9, 10, 11 or 12 groups were selected.
According to a more specific embodiment of the present invention, in the above process of constructing PRSs of respective subphenotypes, when N sets of SNPs are separated based on the size of the extracted P value, linkage disequilibrium r is determined2<At 0.2, N sets of SNP combinations can be obtained, i.e., N candidate PRSs incorporating SNPs of different combinations can be constructed.
In the invention, correlation coefficients r and P values between every two of all the sub-phenotype PRSs can be further calculated through Pearson correlation analysis.
According to the specific embodiment of the invention, in the method for constructing the multi-gene genetic risk score of coronary heart disease, part of people can be selected from all cohort people according to a predetermined proportion to be used as a training set (the rest of people can be used as a verification set). The process of constructing the sub-phenotype PRSs and determining the weight of each sub-phenotype PRS is carried out in a training set.
According to the specific embodiment of the present invention, in the method for constructing a multi-gene genetic risk score of coronary heart disease, the process of determining the weight of each sub-phenotype PRS includes:
converting each sub-phenotype PRS into a standardized score with a mean value of 0 and a standard deviation of 1;
using the training set, putting the normalized PRSs of each sub-phenotype and covariates (age and sex) to be adjusted into an elastic reticular logistic regression model, selecting the model with the highest AUC as a final model, and obtaining coefficients (beta) of each PRS from the final model1…βnN PRSs in total) as weights.
In some embodiments of the invention, an elastic reticular logistic regression model that corrects the correlation between individual sub-phenotypic PRSs is used to evaluate the correlation of 9 (i.e., n is 9) sub-phenotypic PRSs with coronary heart disease, and the OR value of the elastic reticular logistic regression estimate is compared to the OR value of the univariate logistic regression estimate. Further, the invention constructs and verifies coronary heart disease metaPRS by integrating 9 sub-phenotype PRSs and converting the weight of the sub-phenotype PRSs into the weight of SNP level.
According to the specific embodiment of the invention, in the method for constructing the multi-gene genetic risk score of coronary heart disease, the process of converting the weight of the sub-phenotype PRS into the weight of the SNP level is carried out according to the following model:
Figure BDA0003085403870000041
wherein σ1,…,σnIs the standard deviation, α, of each (n total) sub-phenotypic PRS in the training setj1,...,αjnIs the effect value of the ith SNP corresponding to each sub-phenotype if not included in the kth scoreA certain SNP, the magnitude of the effect value of the SNP is alphajkIs set to 0.
According to the specific embodiment of the invention, in the method for constructing the coronary heart disease polygene genetic risk score, the constructed coronary heart disease polygene genetic risk comprehensive score metaPRS is as follows:
metaPRS=∑βsnp_i×Ni
wherein, β SNP _ i refers to the effect value of the ith SNP, and Ni refers to the number of the effect alleles of the ith SNP carried by the individual.
According to the specific embodiment of the invention, the method for constructing the multi-gene genetic risk comprehensive score of coronary heart disease can further comprise the process of evaluating the effect of the constructed metaPRS on the risk prediction and stratification of coronary heart disease.
According to the specific embodiment of the invention, in the method for constructing the coronary heart disease polygenic genetic risk score, preferably, 20% and 80% percentiles of metaPRS of all individuals in a cohort group are used as cut points to divide the individual coronary heart disease genetic morbidity risk into low, medium and high risk groups.
In another aspect, the present invention further provides a device for constructing a multiple gene genetic risk composite score of coronary heart disease, the device comprising:
a genotyping module for genotyping;
the sub-phenotype PRS construction module is used for respectively extracting dangerous alleles, effect values and P values of the detected SNP corresponding to a plurality of sub-phenotypes from the whole genome correlation research result, constructing candidate sub-phenotype PRSs and screening optimal sub-phenotype PRSs;
a model training module for determining a weight for each sub-phenotypic PRS in a training set;
a metaPRS construction module for converting the weight of the sub-phenotypic PRS into the weight of the SNP level and constructing a coronary heart disease polygenic genetic risk composite score (metaPRS).
According to the specific embodiment of the invention, the device for constructing the multi-gene genetic risk comprehensive score of coronary heart disease can also optionally comprise an SNP screening module for screening a set of Single Nucleotide Polymorphism (SNP) sites related to coronary heart disease or related to coronary heart disease-related phenotype.
According to the specific embodiment of the invention, in the device for constructing the coronary heart disease polygenic genetic risk comprehensive score, the genotyping module can also be used for eliminating SNP (single nucleotide polymorphism) with the genotype detection rate lower than 95% after genotyping.
According to the specific embodiment of the invention, in the device for constructing the multi-gene genetic risk comprehensive score of coronary heart disease, optionally, the metaPRS construction module can be further used for evaluating the effect of the constructed metaPRS on risk prediction and stratification of coronary heart disease.
In another aspect, the present invention further provides a computer device, which includes a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the evaluation of the individual coronary heart disease onset risk by using the multiple gene genetic risk comprehensive score for coronary heart disease constructed by the method of the present invention.
In the embodiment of the invention, in order to accurately evaluate the correlation effect value of genetic variation and CAD (computer aided design) incidence risk of east Asian population, the invention carries out a whole-gene correlation research in 51,531 coronary heart disease cases and 21,5934 controls. Then integrating 9 coronary heart diseases and related phenotype genetic information thereof to construct polygenic genetic risk scores in 2800 coronary heart disease cases and 2055 healthy controls, and finally verifying and evaluating in 41,271 prospective queues of Chinese population. The constructed polygene genetic risk score has good prediction value on the occurrence of the coronary heart disease. It was found that individuals with high genetic risk (20% of genetic risk) had a risk of developing coronary heart disease about 3 times higher (HR:2.93, 95% CI:2.44-3.51) than individuals with low genetic risk (20% of genetic risk), and that the final risk of coronary heart disease was 16.0% and 5.8% in both groups, respectively. And have similar predictive effects in the male and female populations. The research proves that the multi-gene genetic risk comprehensive score can realize refined layering of coronary heart disease risks, and the method has important application prospects in the aspects of constructing the multi-gene genetic risk comprehensive score of the coronary heart disease and preventing the coronary heart disease from the first stage.
Drawings
FIG. 1 is a study flow chart of the present invention. PRS, multigene risk scoring, among others.
Figure 2 shows the sequencing depth of 588 variant sites successfully typed.
Figure 3 shows the correlation of PRS in coronary heart disease with the effect values of GWAS in east asia and europe and america in the training set. The age and gender were adjusted using logistic regression models to calculate Odds Ratios (ORs) and 95% Confidence Intervals (CIs). Scores were calculated using the effect values of eastern asian population and european UK Biobank coronary heart disease GWAS data, respectively, as SNPs weights. Setting different P value threshold values (0.5,0.4,0.3,0.2,0.1,0.05,0.01, 10)-3,10-4,10-5, 10-6,10-7) Respectively constructing 12 PRSs (linkage disequilibrium r) containing different SNPs combinations2<0.2)。
FIG. 4 shows the correlation of sub-phenotypic PRSs (each increase by one standard deviation) in the training set with CAD at different P-value thresholds. Age and gender were adjusted using logistic regression to calculate Odds Ratios (OR) and 95% Confidence Intervals (CI).
FIG. 5 is a correlation plot of sub-phenotype PRS and metaPRS in prospective cohorts. Wherein, P<0.05,**P<10-3, ***P<10-10
FIG. 6 shows the association of sub-phenotypic multigene risk scores (one standard deviation increase per training set) with coronary heart disease in the training set. Age and gender were adjusted using logistic regression and elastic mesh logistic regression, respectively, to calculate Odds Ratios (OR) and 95% Confidence Intervals (CI).
Figure 7 shows the risk ratio of metaPRS (one standard deviation per increment) and sub-phenotypic PRS to CAD onset in a prospective cohort. Analysis was performed using a cox model with age as the time scale, adjusting cohort source and gender.
Figure 8 shows the relative and absolute risk of coronary heart disease onset for different genetic groups (< 20%, 20% -80%, grouped > 80%). Wherein gender and queue source are adjusted, age is taken as a scale, and Cox model of competitive risk is considered to estimate HR and 95% CI of different genetic risk groups and the cumulative incidence of coronary heart disease. The dashed line represents 95% CI. CAD, coronary heart disease; HR, risk ratio; CI, confidence interval.
Figure 9 shows the relative and absolute risk of coronary heart disease development for different genetic groups (< 20%, 20% -80%, group > 80%) stratified by gender. Wherein, adjusting gender and queue source, taking age as scale and considering Cox model of competition risk to estimate HR and 95% CI of different genetic risk groups and the cumulative incidence of coronary heart disease. The dashed line represents 95% CI. CAD, coronary heart disease; HR, risk ratio; CI, confidence interval.
Detailed Description
For a more clear understanding of the technical features, objects and advantages of the present invention, reference is now made to the following detailed description taken in conjunction with the accompanying specific embodiments, and the technical solutions of the present invention are described, with the understanding that these examples are provided for the purpose of illustration only and are not intended to limit the scope of the present invention. Various changes and/or modifications within the spirit of the invention, which are readily contemplated by those skilled in the art, are deemed to be within the scope of the invention. In the examples, each raw reagent material is commercially available, and the experimental method without specifying the specific conditions is a conventional method and a conventional condition well known in the art, or a condition recommended by an instrument manufacturer.
Example 1
Research design process and research population
The study design flow is shown in figure 1. The present inventors developed a multi-gene risk score (PRS) for CAD in 2800 CAD patients and 2055 health controls (table 1) and then validated it in a large-scale prospective cohort of people. CAD cases in the training set came from the hospital mons outside, chinese medical science institute. The diagnosis of Myocardial Infarction (MI) strictly follows diagnostic criteria based on signs, symptoms, electrocardiogram and heart enzyme activity. Combined with whether a history of myocardial infarction was previously diagnosed, or whether the left coronary artery trunk was over 50% stenosed, or whether more than 70% of at least one major epicardial vascular stenosis was diagnosed as coronary heart disease.
Validation cohort three sub-cohorts from The China-PAR study, including The China Cardiovascular health Multi-center cooperative research (InterASIA), The China Cardiovascular epidemiology Multi-center cooperative research (China MUCA-1998), The China Metabolic syndrome Community intervention and The China family health research (CIMIC) (Yang, X.et al.Predicting The 10-Yeast Risks of Atherootic Cardiovascular Disease in Chinese Point projection: The China-PAR Project (Prediction for CVD Risk in China). circular 134,1430-1440 (2016)). Briefly, ChinaMUCA-1998, InterASIA and CIMIC baselines were established in 1998, 2000-. According to the unified standard, InterASIA and China MUCA-1998 were followed for the first time in 2007 + 2008, and unified follow-up was performed on all three queues in 2012 + 2015 and 2018 + 2020. In this study, a total of 43,582 participants' blood samples and primary covariate data were collected independent of the training set. A final total of 41,271 participants were enrolled after excluding 561 individuals with high genotype deletion (> 5.0%) or low mean sequencing depth (<30 layers), 1352 individuals with <30 or >75 years of age at baseline, 398 baseline diagnosed coronary heart disease.
All studies were approved by the ethical review committee of the hospital, outside the hospital, china medical sciences. Each participant signed an informed consent prior to data collection.
TABLE 1 training set general information
Figure BDA0003085403870000071
Figure BDA0003085403870000081
The values are mean (SD) or N (%).
Data collection and risk factor definition
Important information during baseline and follow-up visits was collected by trained investigators under strict quality control. Standard questionnaires were used to collect personal information (gender, date of birth, etc.), lifestyle information (eating habits, physical activity, etc.), disease history and CAD family history. Participants also received physical examinations (weight, height, blood pressure, etc.) and provided fasting blood samples for measurement of blood lipid and blood glucose levels.
To obtain information about disease outcome and mortality during follow-up, researchers follow-up participants or their agents while collecting medical records (or evidence of death) of the participants. Two committee members who did not know the baseline information independently verified the event. If there is an inconsistency, other committee members will participate in the discussion to eventually reach consensus. Coronary heart disease onset is defined as the first onset of unstable angina, non-fatal acute myocardial infarction, or coronary death. Fatal events caused by myocardial infarction or other coronary artery disease are defined as coronary heart disease deaths. The time interval between the baseline date and the date of coronary heart disease occurrence, death or last visit is the follow-up year.
Genetic variation site selection and genotyping
The invention firstly selects 600 genetic variation sites which are found to have significant genome-wide association (P) with coronary heart disease (n-212) or coronary heart disease related risk factors in genome-wide association research<5×10-8) Examples include stroke (n-42), blood pressure (n-56), blood lipids (n-130), T2D (n-90), and obesity (n-79) (table 2). All genetic variation site information is provided in table 3. In short, the invention selects all genetic variation sites reported by the east Asia and European population for coronary heart disease; for other risk factors, the present invention focuses primarily on the reported sites of genetic variation in the east Asian population.
Training set samples were genotyped using Multi-Etnic Genotyping Arrays (MEGA) chips from Infinium to obtain genetic variation information at the detection sites. In cohort populations, the present invention uses multiplex PCR targeted amplicon sequencing technology to genotype samples. Multiplex primers were designed for each mutation using routine procedures in the field and high throughput sequencing of the amplification target regions was performed using an Illumina Hiseq X Ten sequencer. After 12 mutation sites were removed with a detection rate of < 95% or the mutations missing in the training data set, 588 mutations or their substitution sites were detected successfully, with an average detection rate of 99.9% and a median of 982 x in the sequencing depth (fig. 2). In order to evaluate the repeatability of genotyping, 1648 samples are genotyped for multiple times, and the consistency rate of the identification result is more than 99.4%.
TABLE 2 sources of selected genetic variations in this study
Figure BDA0003085403870000091
CAD, coronary heart disease; SBP, systolic blood pressure; DBP, diastolic pressure; PP, pulse pressure; MAP, mean arterial pressure; HTN, hypertension; T2D, type 2 diabetes; BMI, body mass index; WC, waist circumference; WHR, waist-hip ratio; TC, total cholesterol; LDL-C, low density lipoprotein cholesterol; TG, triglycerides; HDL-C, high density lipoprotein cholesterol.
Construction of MetaPRS
(1) Extracting SNP effect values from GWAS result data, and calculating PRS of each sub-phenotype
According to the invention, 9 genetic scores of CAD-related phenotypes are constructed according to effect values of large-scale whole genome association research of east Asia population. In order to accurately estimate the CAD effect value of the selected variation in east Asian population, the invention carries out the whole genome association study of coronary heart disease in east Asian population, and the total sample size is 267,465 (51,531 patients with coronary heart disease and 215,934 patients with non-coronary heart disease). For the other 8 phenotypes (stroke, type 2 diabetes, blood pressure, body mass index, total cholesterol, ldl cholesterol, triglycerides and hdl cholesterol), the present invention obtained the risk alleles, effect values and P values for each sub-phenotype at each locus from a large genome wide association study published by the east asian population. A detailed list of selected studies is shown in table 3.
TABLE 3 sources of summarized data for multigene risk score calculation
Figure BDA0003085403870000101
GWAS, whole genome association study; EWAS, whole exon association study; BP, blood pressure; CAD, coronary artery disease; T2D, diabetes type 2; BMI, body mass index; TC, total cholesterol; LDL-C, low density lipoprotein cholesterol; TG, triglycerides; HDL-C, high density lipoprotein cholesterol.
Taking the sub-phenotype CAD as an example, the invention integrates the large-scale coronary heart disease case control genome data of east Asian population and Chinese population, carries out the correlation study of the whole genome of the coronary heart disease, the sample reaches 51,531 coronary heart disease patients and 215,934 non-coronary heart disease patients, and uses a fixed effect model to carry out Meta analysis on the correlation analysis results of different sub-queues, thereby obtaining the risk allele, the effect value and the P value of the detected SNP. According to the extracted P value, according to 0.5,0.4,0.3,0.2,0.1,0.05,0.01,10-3,10-4,10-5,10-6,10-712 sets of SNPs were screened, and for each set of SNPs, linkage disequilibrium r was calculated using plink software (version 1.9) marketing commands based on cohort population data2<0.2 pruning, finally obtaining 12 groups of SNP combinations. Using training set genotype data, weighting individual SNP risk allelic factors (0, 1 OR 2) according to corresponding effect values, summing to construct 12 candidate PRSs including different combination SNPs, evaluating the association of the candidate PRSs and the coronary heart disease by using a logistic regression model, and selecting the best PRS for the coronary heart disease with the score with the largest Odds Ratio (OR) (every time PRS is increased by one standard deviation). For the other 8 phenotypes, SNP effect values were obtained from the literature for the corresponding phenotypes provided in table 3, followed by the construction of the other 8 sub-phenotypic PRSs following the same procedure as described above. Among them, the SNP sites and the effect values utilized by the best sub-phenotypic PRS are shown in Table 4.
(2) Calculating weights for individual sub-phenotypic PRSs in a training set
The 9 sub-phenotypic PRSs were converted to a score with a mean of 0 and a standard deviation of 1. Using the training set, the normalized 9 sub-phenotype PRSs and the covariates to be adjusted (age, sex) were put together into an elastic mesh logistic regression model (cv. glmnet function, R package "glmnet") that evaluated a series of different penalties using a 10-fold cross-validation method (setting alpha 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0).9 or 1.0), setting the model parameter type.measure as "AUC", automatically screening the model with the highest AUC (area under receiver operating characteristic curve) as the final model, and obtaining the coefficient (beta) of each PRS from the model1…β9) As weights. Table 5 provides the weights for each of the subphenotypic PRS, with subphenotypic weights of TG, HDL and LDL of 0.
(3) Conversion of weight of sub-phenotypic PRS into weight of SNP level
Figure BDA0003085403870000111
Converting PRS level weights to SNP level weights using the above formula, where σ1,…,σ9Is the standard deviation, α, of each sub-phenotypic PRS in the training setj1,…,αj9Is that the ith SNP corresponds to the effector value of each sub-phenotype, and if a SNP is not included in the kth score, the effector value of that SNP is of a magnitude αjkIs set to 0.
(4) Calculating metaPRS
Using the formula: calculating the metaPRS of the individual, wherein β SNP _ i refers to the effect value of the ith SNP (i.e. the weight of the SNP level obtained in the step 3), and Ni refers to the number of the effect alleles of the ith SNP carried by the individual.
After statistical processing steps, the final weight of a total of 510 SNPs was not 0 and was included in the metaPRS calculation, and information and weights for all eligible SNPs are provided in table 4.
(5) MetaPRS tangent point partitioning
Taking 20% and 80% percentiles of metaPRS of all individuals in the cohort population as cut points, and dividing the genetic risk of the coronary heart disease of the individuals into low, medium and high risk groups.
TABLE 4 information and weights of SNPs determined by the invention
Figure BDA0003085403870000121
Figure BDA0003085403870000131
Figure BDA0003085403870000141
Figure BDA0003085403870000151
Figure BDA0003085403870000161
Figure BDA0003085403870000171
Figure BDA0003085403870000181
Figure BDA0003085403870000191
Figure BDA0003085403870000201
Figure BDA0003085403870000211
Figure BDA0003085403870000221
TABLE 5 weight of each subphenotype in the multiple Gene genetic Risk Complex Scoring of coronary artery disease
Name of subphenotype PRS weights
Coronary heart disease 0.452
Blood pressure 0.074
Body mass index 0.072
Diabetes mellitus 0.064
Total Cholesterol 0.038
Cerebral apoplexy 0.004
Low density lipoprotein cholesterol 0
High density lipoprotein cholesterol 0
Triglycerides 0
Statistical analysis
For continuous variables, population characteristics are described as mean (standard deviation); for categorical variables, the population characteristics are described as a number (percentage). The polygenic genetic scores were divided into three groups (high, medium, low genetic risk group) according to < 20%, 20% -80%, and > 80% quantile. The risk ratio (HRs) and its 95% Confidence Interval (CIs) for coronary events in different genetic risk groups were estimated using a Cox proportional hazards regression model adjusted for age and gender, correcting cohort sources, and considering the competing risks of non-coronary death. The lifetime risk (to age 80) of coronary heart disease in different genetic risk groups was assessed using a Cox proportional hazards regression model on a time scale of age. The analysis used the 'subvfit. coxph' function in R package survivval. All reported P values in this study were uncorrected, and a two-sided P value <0.05 was considered statistically significant. Statistical analysis was performed in R software (R Foundation for Statistical Computing, Vienna, Austria, version 3.5.0) or SAS Statistical software (SAS Institute Inc, Cary, NC, version 9.4).
Baseline information for a proactive queue
Table 6 shows baseline information for 41,271 subjects in the cohort population. The mean age at baseline was 52.3 years (standard deviation, 10.6 years), of which 42.5% were males. Men currently smoke at a higher rate than women. After 534,701 people total years (average follow-up 13.0 years), 1303 coronary heart diseases occur together.
TABLE 6 Baseline information for look-ahead queues
Figure BDA0003085403870000231
The values are mean (SD) or N (%). CAD, coronary heart disease.
Prediction of coronary heart disease by polygenic genetic risk scoring
The invention firstly sets 12 threshold values (0.5,0.4,0.3,0.2,0.1,0.05,0.01, 10) according to the GWAS result P value of the coronary heart disease of east Asia population-3,10-4,10-5,10-6,10-7) Screening 12 groups of different SNPs combinations, and then adopting GWAS result data of European population in training setPRSs of coronary heart disease were calculated as SNP effect values, and their correlation strength with coronary heart disease was further evaluated. As shown in figure 3, when using effect values from the european population, 12 PRSs incorporating different SNP combinations (each SD increase) all significantly decreased OR (95% CI) values associated with coronary heart disease when compared to using effect values for coronary heart disease GWAS in the east asian population. Therefore, the study used the GWAS effect values of the east asian population to construct PRS for each sub-phenotype, the strength of association of PRS for each candidate sub-phenotype with coronary heart disease in the training set is shown in fig. 4, and the score with the largest OR value was selected as the final PRS for the sub-phenotype.
There were varying degrees of correlation between the 9 sub-phenotypic PRSs (fig. 5). The association of 9 sub-phenotype PRSs with coronary heart disease was further evaluated using an elastic reticular logistic regression model that corrects the correlation between individual sub-phenotype PRSs, with the OR values estimated by the elastic reticular logistic regression compared to those estimated by the univariate logistic regression as shown in fig. 6 (LDL-C, TG and HDL-C weights of 0 in fig. 6). Finally, coronary heart disease metaPRS was constructed by integrating 9 sub-phenotypic PRSs and validated in cohort population.
MetaPRS showed the greatest intensity of association with coronary heart disease risk compared to sub-phenotypic PRS (FIG. 7), with HR of 1.44 (95% CI:1.36-1.52) (P ═ 2.84X 10) for each 1 standard deviation increase in metaPRS (coronary heart disease)-39). Association of metaPRS with coronary heart disease was independent of dyslipidemia, hypertension, BMI, diabetes, smoking status and family history of coronary heart disease (table 7).
TABLE 7 Risk ratio of MetaPRS to coronary event after correction of coronary Risk factors (one standard deviation for each increase in MetaPRS)
Model (model) HR (95%CI) P value
metaPRS 1.44 (1.36,1.52) 2.84×10-39
MetaPRS + dyslipidemia 1.42 (1.34,1.50) 2.54×10-35
MetaPRS + hypertension 1.41 (1.34,1.49) 2.78×10-35
MetaPRS + diabetes mellitus 1.43 (1.36,1.51) 1.33×10-37
MetaPRS + body Mass index 1.42 (1.35,1.50) 1.74×10-36
MetaPRS + smoking cigarette 1.44 (1.36,1.52) 4.55×10-39
MetaPRS + CAD family history 1.44 (1.36,1.52) 9.52×10-39
MetaPRS +6 common CAD Risk factors 1.39 (1.32,1.47) 2.75×10-31
CAD, coronary heart disease; PRS, genetic risk score; HR, risk ratio; CI, confidence interval.
metaPRS were grouped by 20%, 80% quantile and the risk of coronary events was 3 times higher for individuals with high genetic risk (80% higher genetic risk) compared to individuals with low genetic risk (20% lower genetic risk) (HR 2.93, 95% CI:2.44-3.51) (fig. 8). The cumulative risk of developing coronary heart disease in these two groups was 5.8% and 16.0% before age 80. Similar results were obtained by performing the analysis according to gender stratification (fig. 9).

Claims (10)

1. A method for constructing a multi-gene genetic risk comprehensive score of coronary heart disease comprises the following steps:
(1) screening a set of Single Nucleotide Polymorphism Sites (SNPs) associated with coronary heart disease and/or associated with a phenotype associated with coronary heart disease; wherein the coronary heart disease associated phenotype comprises: blood pressure, type 2 diabetes, blood lipids, obesity, and stroke;
(2) genotyping based on the single nucleotide polymorphic sites in step (1);
(3) respectively extracting risk alleles, effect values and P values of the detected SNP corresponding to a plurality of sub-phenotypes from the whole genome correlation research result, constructing a plurality of candidate sub-phenotype PRSs and screening the optimal sub-phenotype PRSs;
(4) determining a weight for each sub-phenotypic PRS;
(5) converting the weight of the sub-phenotypic PRS into a weight at the SNP level;
(6) constructing a coronary heart disease polygene genetic risk comprehensive score metaPRS.
2. The method of claim 1, wherein coronary heart disease-associated phenotype blood pressure comprises: systolic, diastolic, pulse, mean arterial, and hypertension; coronary heart disease associated phenotypic obesity including body mass index, waist circumference and waist-hip ratio; coronary heart disease-associated phenotypic lipids including total cholesterol, low density lipoprotein cholesterol, triglycerides and high density lipoprotein cholesterol;
preferably, the plurality of sub-phenotypes comprises: coronary heart disease, body mass index, blood pressure, type 2 diabetes, total cholesterol, low density lipoprotein cholesterol, triglycerides, high density lipoprotein cholesterol, and stroke.
3. The method according to claim 2, wherein the multiple genetic risk composite score for coronary heart disease is used for assessing the risk of coronary heart disease in the east asian population, and the single nucleotide polymorphism sites found to have significant genome-wide association with coronary heart disease or a coronary heart disease-related phenotype in the genome-wide association study are included in the set of single nucleotide polymorphism sites.
4. The method according to claim 1 or 3, wherein in step (2), the cohort population for genotyping is an east Asian population; preferably, genotyping is performed using multiplex polymerase chain reaction targeted amplicon sequencing techniques.
5. The method according to claim 1, wherein in step (3), the process of constructing each candidate sub-phenotypic PRS comprises:
dividing multiple groups of SNPs according to the extracted P value, and for each group of SNPs, based on the queue population data, using a plink software clumping command according to r2<0.2 pruning to obtain a plurality of groups of SNP combinations;
using genotype data, individual SNP risk allele factors (0, 1, OR 2) are weighted according to their corresponding effect values and summed to construct a plurality of candidate PRSs incorporating different combination SNPs, and a logistic regression model is used to assess the association of these candidate PRSs with coronary heart disease, with the score with the greatest Odds Ratio (OR) (one standard deviation increase in PRS) being selected as the best sub-phenotypic PRS.
6. The method according to claim 1, wherein in step (4), the process of determining the weight of each sub-phenotypic PRS comprises:
converting each sub-phenotype PRS into a standardized score with a mean value of 0 and a standard deviation of 1;
using the training set, putting the normalized PRSs of each sub-phenotype and covariates (age and sex) to be adjusted into an elastic reticular logistic regression model together, selecting the model with the highest AUC as a final model, and obtaining coefficients (beta) of each PRS from the final model1…βn) As weights.
7. The method according to claim 1, wherein the process of converting the weight of the sub-phenotypic PRS into the weight of the SNP level in step (5) is performed according to the following model:
Figure FDA0003085403860000021
wherein σ1,…,σiIs the standard deviation, α, of each sub-phenotypic PRS in the training setj1,...,αjnIs that the ith SNP corresponds to the effector value of each sub-phenotype, and if a SNP is not included in the kth score, the effector value of that SNP is of a magnitude αjkIs set to 0.
8. The method according to claim 1, wherein in step (6), the constructed coronary heart disease polygenic genetic risk composite score metaPRS is:
metaPRS=∑βsnp_i×Ni
wherein, beta SNP _ i refers to the effect value of the ith SNP, and Ni refers to the number of effect alleles of the ith SNP carried by the individual;
preferably, 20% and 80% percentiles of metaPRS of all individuals in the cohort group are used as cut points to divide the individual coronary heart disease genetic morbidity risk into low, medium and high risk groups.
9. An apparatus for constructing a multi-gene genetic risk composite score for coronary heart disease, the apparatus comprising:
a genotyping module for genotyping;
the sub-phenotype PRS construction module is used for respectively extracting dangerous alleles, effect values and P values of the detected SNP corresponding to a plurality of sub-phenotypes from the whole genome correlation research result, constructing candidate sub-phenotype PRSs and screening optimal sub-phenotype PRSs;
a model training module for determining a weight for each sub-phenotypic PRS in a training set;
the metaPRS construction module is used for converting the weight of the sub-phenotype PRS into the weight of the SNP level and constructing the multi-gene genetic risk comprehensive score metaPRS of the coronary heart disease; optionally, the metaPRS construction module is further used for evaluating the effect of the constructed metaPRS on the prediction and stratification of coronary heart disease onset risk.
10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform the method of any one of claims 1 to 8 to evaluate the risk of coronary heart disease in an individual.
CN202110579230.8A 2021-05-26 2021-05-26 Construction method, device and application of polygene genetic risk comprehensive score of coronary heart disease Active CN113506594B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110579230.8A CN113506594B (en) 2021-05-26 2021-05-26 Construction method, device and application of polygene genetic risk comprehensive score of coronary heart disease
PCT/CN2022/095221 WO2022247903A1 (en) 2021-05-26 2022-05-26 Polygenic risk score for coronary heart disease, construction method therefor, and application thereof in combination with clinical risk assessment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110579230.8A CN113506594B (en) 2021-05-26 2021-05-26 Construction method, device and application of polygene genetic risk comprehensive score of coronary heart disease

Publications (2)

Publication Number Publication Date
CN113506594A true CN113506594A (en) 2021-10-15
CN113506594B CN113506594B (en) 2023-02-03

Family

ID=78008724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110579230.8A Active CN113506594B (en) 2021-05-26 2021-05-26 Construction method, device and application of polygene genetic risk comprehensive score of coronary heart disease

Country Status (1)

Country Link
CN (1) CN113506594B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022247903A1 (en) * 2021-05-26 2022-12-01 中国医学科学院阜外医院 Polygenic risk score for coronary heart disease, construction method therefor, and application thereof in combination with clinical risk assessment
CN117789819A (en) * 2024-02-27 2024-03-29 北京携云启源科技有限公司 Construction method of VTE risk assessment model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101302563A (en) * 2008-07-08 2008-11-12 上海中优医药高科技有限公司 Comprehensive evaluation method of polygenic diseases genetic risk
CN102758010A (en) * 2012-06-07 2012-10-31 中国医学科学院阜外心血管病医院 Combination of multiple genetic single nucleotide polymorphisms and environmental factors related to coronary heart disease and application of combination
CN102757954A (en) * 2012-06-07 2012-10-31 中国医学科学院阜外心血管病医院 Combination of multiple genetic single nucleotide polymorphisms related to coronary heart disease and application of combination
CN111128298A (en) * 2019-12-24 2020-05-08 大连海事大学 Method and system for obtaining multi-gene risk scores based on deep learning model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101302563A (en) * 2008-07-08 2008-11-12 上海中优医药高科技有限公司 Comprehensive evaluation method of polygenic diseases genetic risk
CN102758010A (en) * 2012-06-07 2012-10-31 中国医学科学院阜外心血管病医院 Combination of multiple genetic single nucleotide polymorphisms and environmental factors related to coronary heart disease and application of combination
CN102757954A (en) * 2012-06-07 2012-10-31 中国医学科学院阜外心血管病医院 Combination of multiple genetic single nucleotide polymorphisms related to coronary heart disease and application of combination
CN111128298A (en) * 2019-12-24 2020-05-08 大连海事大学 Method and system for obtaining multi-gene risk scores based on deep learning model

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022247903A1 (en) * 2021-05-26 2022-12-01 中国医学科学院阜外医院 Polygenic risk score for coronary heart disease, construction method therefor, and application thereof in combination with clinical risk assessment
CN117789819A (en) * 2024-02-27 2024-03-29 北京携云启源科技有限公司 Construction method of VTE risk assessment model
CN117789819B (en) * 2024-02-27 2024-06-11 北京携云启源科技有限公司 Construction method of VTE risk assessment model

Also Published As

Publication number Publication date
CN113506594B (en) 2023-02-03

Similar Documents

Publication Publication Date Title
CN113012761B (en) Method and device for constructing stroke polygene genetic risk comprehensive score and application
CN109661475A (en) Multiple Optimization mispairing expands (MOMA) target number
CN113506594B (en) Construction method, device and application of polygene genetic risk comprehensive score of coronary heart disease
US20120309639A1 (en) Compositions and Methods for Diagnosing Genome Related Diseases and Disorders
WO2022179637A1 (en) Stroke polygenic risk score and pathogenesis risk evaluation device and application thereof
JP2022549737A (en) Polygenic risk score for in vitro fertilization
Pereira et al. Genetic risk analysis of coronary artery disease in a population-based study in Portugal, using a genetic risk score of 31 variants
CN116287204A (en) Application of mutation condition of detection characteristic gene in preparation of venous thromboembolism risk detection product
Bray et al. Transethnic and race-stratified genome-wide association study of fibroid characteristics in African American and European American women
US20230383349A1 (en) Methods of assessing risk of developing a disease
CN115029431A (en) Type 2diabetes gene detection kit and type 2diabetes genetic risk assessment system
Wang et al. Effects of angiotensinogen and angiotensin II type I receptor genes on blood pressure and left ventricular mass trajectories in multiethnic youth
CN113643753B (en) Multi-gene genetic risk scoring and combined clinical risk assessment application of coronary heart disease
WO2022247903A1 (en) Polygenic risk score for coronary heart disease, construction method therefor, and application thereof in combination with clinical risk assessment
EP4031688B1 (en) In vitro method for determining the risk of developing breast cancer in a subject
Wang et al. Identification of a blood-based 12-gene signature that predicts the severity of coronary artery stenosis: An integrative approach based on gene network construction, Support Vector Machine algorithm, and multi-cohort validation
CN116386882A (en) Coronary heart disease genetic risk prediction method and system integrating genetic information of different populations
KR102042823B1 (en) SNP marker set for predicting of prognosis of rheumatoid arthritis
CN118186072A (en) Multi-gene detection kit for metabolic fatty liver disease and genetic risk assessment system
WO2021095855A1 (en) Method for assessing risk of ischemic heart disease and system for assessing risk of same
정유리 Predicting Coronary Artery Disease Risk using Polygenic Risk Scores and Clinical Variables in the East Asian Population
KR20220077892A (en) Method for risk prediction of cardio-cerebrovascular disease using metabolic disease polygenic risk score
CN101871003A (en) Kit and primer for predicting Chinese type-2 diabetes susceptibility
WO2006099142A2 (en) Prognostic method for vascular diseases
Li Puberty and DNA Methylation with Lung Function in Young Adults and Asthma Acquisition During Adolescence and Young Adulthood

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant