CN109585017B - Risk prediction algorithm model and device for age-related macular degeneration - Google Patents

Risk prediction algorithm model and device for age-related macular degeneration Download PDF

Info

Publication number
CN109585017B
CN109585017B CN201910101067.7A CN201910101067A CN109585017B CN 109585017 B CN109585017 B CN 109585017B CN 201910101067 A CN201910101067 A CN 201910101067A CN 109585017 B CN109585017 B CN 109585017B
Authority
CN
China
Prior art keywords
amd
mutation
chromosome position
risk
chromosome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910101067.7A
Other languages
Chinese (zh)
Other versions
CN109585017A (en
Inventor
王丽君
高军晖
袁卫兰
龚建兵
刘慧敏
林灵
许骋
张英霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Biotecan Medical Diagnostics Co ltd
Shanghai Biotecan Biology Medicine Technology Co ltd
Original Assignee
Shanghai Biotecan Medical Diagnostics Co ltd
Shanghai Biotecan Biology Medicine Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Biotecan Medical Diagnostics Co ltd, Shanghai Biotecan Biology Medicine Technology Co ltd filed Critical Shanghai Biotecan Medical Diagnostics Co ltd
Priority to CN201910101067.7A priority Critical patent/CN109585017B/en
Publication of CN109585017A publication Critical patent/CN109585017A/en
Application granted granted Critical
Publication of CN109585017B publication Critical patent/CN109585017B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a risk prediction algorithm model and a device for Age-related macular degeneration (Age-related macular degeneration, AMD). Specifically, the application provides genotyping of 7 related single nucleotide polymorphisms (Single nucleotide polymorphism, SNPs), converting the genotyping into OR (Odd Ratio) values, combining 7 clinical information, and constructing risk prediction algorithm models and devices by adopting a machine learning method. The application can assist clinic to predict AMD in advance and diagnose early, and has great clinical significance for reducing the incidence of AMD and improving the disease treatment rate.

Description

Risk prediction algorithm model and device for age-related macular degeneration
Technical Field
The application relates to the field of medical biological detection, in particular to a risk prediction algorithm model and a risk prediction device for Age-related macular degeneration (Age-related macular degeneration, AMD).
Background
Age-related macular degeneration (Age-related macular degeneration, AMD) is a major contributor to blindness in the elderly. The disease has complex etiology related to age, sex, smoking, race, heredity and other factors, is irreversible vision loss, and has no effective treatment means at present. AMD has a high incidence, and meta-analysis results show that the total incidence of AMD is 8.01% worldwide, and the incidence of AMD in European, african and Asian populations is 11.2%, 7.1% and 6.8%, respectively. The incidence rates of early AMD and advanced AMD of old people in China are 4.7% -9.2% and 0.2% -1.9% respectively. It was predicted that global AMD patients would reach 1.96 billion and 2.88 billion by 2020 and 2040, respectively. Along with the acceleration of the aging of the population in China, AMD has a remarkable rising trend.
AMD occurs as a result of a combination of environmental and genetic factors, with genetic factors accounting for a high proportion, up to 45-70%, of the risk of developing the disease. AMD is of complex etiology, and its pathogenesis is related to both genetic and environmental factors, as mentioned above, which account for a significant proportion of the risk of developing the disease. Obviously, if genetic and environmental factors are comprehensively considered and conventional and auxiliary AMD examination such as eyesight, ocular tension, fundus examination, fundus angiography, optical tomography and the like are combined, the accurate diagnosis and effective risk assessment of the AMD can be greatly improved, and the method is also beneficial to the prevention of the AMD and early discovery and treatment of the AMD.
Thus, there is an urgent need in the art to develop a reliable method for early prediction and diagnosis of AMD.
Disclosure of Invention
The application aims to provide a risk prediction algorithm model and a risk prediction device for Age-related macular degeneration (Age-related macular degeneration, AMD).
In a first aspect of the application, there is provided a set of biomarkers, said set comprising biomarkers selected from the group consisting of: rs2338104, rs754203, or combinations thereof.
In another preferred embodiment, the biomarker panel is a biomarker panel for diagnosing macular degeneration (AMD) disease, further comprising a biomarker selected from the group consisting of: rs2284664, rs2071277, rs1999930, rs10490924, rs5749482, or a combination thereof.
In another preferred embodiment, the biomarker panel is a biomarker panel for diagnosing A Macular Degeneration (AMD) disease, comprising a biomarker selected from table a:
table A
Numbering device Chromosome location Mutant bases
rs2338104 12:109457363 C>G
rs754203 14:99691630 A>G
rs2284664 1:196733395 C>T
rs2071277 6:32203906 T>C
rs1999930 6:116065971 C>T
rs10490924 10:122454932 G>T
rs5749482 22:32663679 C>G
In another preferred embodiment, the biomarker panel is used for diagnosing macular degeneration (AMD) disease, or for preparing a kit or reagent for assessing the risk (susceptibility) of developing macular degeneration (AMD) disease in a subject or for diagnosing (including early diagnosis and/or assisted diagnosis) macular degeneration (AMD) disease in a subject.
In another preferred embodiment, the collection comprises biomarkers selected from table B:
table B
In another preferred embodiment, the collection comprises biomarkers b 1-b 2.
In another preferred embodiment, the collection further comprises biomarkers b 3-b 7.
In another preferred embodiment, the collection further comprises a biomarker: rs551397, rs800292, rs10737680, rs3753396, rs1410996, rs2284664, rs1065489, or a combination thereof.
In another preferred embodiment, the biomarker or set of biomarkers is derived from blood, plasma, serum, or an oral swab sample.
In another preferred embodiment, each biomarker is detected by PCR.
In another preferred embodiment, amplification of the DNA fragment and single base extension are performed using fluorescent quantitative PCR.
In another preferred embodiment, detection of biological standard is performed using MassARRAT Analyzer system.
In another preferred embodiment, the PCR comprises QPCR, fluorescent quantitative PCR.
In another preferred embodiment, the collection is used for the assessment or diagnosis of risk of developing AMD.
In another preferred embodiment, said assessing the risk of developing AMD in a subject comprises early screening for AMD.
In a second aspect of the application there is provided a combination of reagents for use in the assessment or diagnosis of risk of developing AMD, the combination of reagents comprising reagents for detecting individual biomarkers in a collection according to the first aspect of the application.
In a third aspect of the application there is provided a kit comprising a collection according to the first aspect of the application and/or a combination of reagents according to the second aspect of the application.
In another preferred embodiment, each biomarker in the collection according to the first aspect of the application is used as a standard.
In another preferred embodiment, the kit further comprises a description.
In a fourth aspect of the application, there is provided the use of a biomarker panel for the manufacture of a kit for the assessment or diagnosis of risk of developing AMD, wherein the biomarker panel comprises two biomarkers selected from the group consisting of: rs2338104, rs754203, or combinations thereof.
In another preferred embodiment, for use in the assessment or diagnosis of risk of developing AMD, the biomarker panel further comprises a biomarker selected from the group consisting of: rs2284664, rs2071277, rs1999930, rs10490924, rs5749482, or a combination thereof.
In another preferred embodiment, the evaluating comprises the steps of:
(1) Providing a sample derived from the subject to be tested, and detecting SNP typing values (namely A1 or A2 of the table 2) of each biomarker in the set in the sample;
(2) Comparing the site information measured in step (1) with a reference data set;
preferably, the reference data set comprises the individual biomarkers as in the collection derived from AMD patients and healthy controls;
in another preferred embodiment, the sample is selected from the group consisting of: blood, plasma, serum, and buccal swabs.
In another preferred embodiment, the comparing the site information measured in step (1) with a reference data set further includes the step of creating a multivariate statistical model of supervised machine learning, preferably an Xgboost analysis model, to output the likelihood of illness.
In another preferred embodiment, if the likelihood of disease is > 0.5, the subject is determined to be at risk of or suffering from AMD disease.
In another preferred embodiment, the method further comprises the step of treating the sample prior to step (1).
In a fifth aspect of the application, there is provided a method for assessing or diagnosing risk of developing AMD in a subject, comprising the steps of:
(1) Providing a sample derived from the subject to be tested, and detecting the site information (such as SNP typing values (namely A1 or A2 of table 2)) of each biomarker in the set in the sample;
(2) Comparing the type measured in step (1) with a reference data set;
preferably, the reference data set comprises data derived from individual biomarkers in the collection of AMD patients and healthy controls;
in another preferred embodiment, the sample is selected from the group consisting of: blood, plasma, serum, and buccal swabs.
In another preferred embodiment, the comparing the data corresponding to the measured model in step (1) with a reference data set further includes the step of creating a machine learning model with supervised ensemble learning to output the likelihood of illness, and preferably, the machine learning model is an Xgboost analysis model.
In another preferred embodiment, if the likelihood of disease is > 0.5, the subject is determined to be at risk of or suffering from AMD disease.
In another preferred embodiment, the method further comprises the step of treating the sample prior to step (1).
In a sixth aspect of the application, there is provided a method of screening for a candidate compound for assessing or diagnosing risk of developing AMD comprising the steps of:
(1) Administering a test compound to a subject to be tested in a test set, detecting the level V1 of each biomarker in a collection in a sample derived from the subject in the test set; in a control group, administering a blank (including vehicle) to a subject to be tested, and detecting the level V2 of each biomarker in the collection in a sample derived from the subject in the control group;
(2) Comparing the level V1 detected in the previous step with the level V2 to determine whether the test compound is a candidate compound for treating AMD, wherein the set comprises two or more biomarkers selected from the group consisting of: rs2338104, rs1999930, rs10490924.
In another preferred embodiment, the subject has AMD.
In another preferred embodiment, if the level V1 of one or more biomarkers selected from subset H is significantly lower than the level V2, the test compound is indicated as a candidate compound for treating AMD.
In another preferred embodiment, the term "substantially lower" means that the ratio of level V1/level V2 is 0.8 or less, preferably 0.6 or less, more preferably 0.4 or less.
In a seventh aspect of the application, there is provided the use of a set of biomarkers for screening candidate compounds for assessing or diagnosing risk of developing AMD and/or for assessing the therapeutic effect of a candidate compound on AMD, wherein the set of biomarkers is selected from the group consisting of two biomarkers of: rs2338104, rs754203, or combinations thereof.
In another preferred embodiment, the biomarker further comprises: rs2284664, rs2071277, rs1999930, rs10490924, rs5749482, or a combination thereof.
In an eighth aspect of the application, there is provided an AMD early stage auxiliary screening system, the system comprising:
(a) An AMD related disease signature input module for inputting AMD related disease signatures of a subject;
wherein the AMD associated disease profile comprises two or more of the following group a of site information (e.g., SNP typing values (i.e., A1 or A2 of table 2): rs2284664, rs2071277, rs1999930, rs10490924, rs2338104, rs754203, rs5749482, or a combination thereof;
(b) The processing module is used for grading the input AMD related disease characteristics according to a preset judgment standard so as to obtain a risk grade; comparing the risk score with a risk threshold value of the AMD related diseases so as to obtain an auxiliary screening result, wherein when the risk score is higher than the risk threshold value, the risk of the subject for the AMD related diseases is prompted to be higher than that of a normal population; and
(c) And the output module is used for outputting the auxiliary screening result.
In another preferred embodiment, in the step (a), the following AMD related disease characteristics are further included: age, diabetes condition, body mass index (BMI index), renal injury condition, atherosclerosis, drinking condition, and whether it is often an outdoor condition.
In another preferred embodiment, the subject is a human.
In another preferred embodiment, the subject comprises an infant, adolescent or adult.
In another preferred embodiment, in the processing module, the risk score processing is performed as follows:
in another preferred embodiment, the feature input module comprises a sample collector.
In another preferred embodiment, the feature input module is selected from the group consisting of: massARRAT Analyzer 4system typing output module, askme module.
In another preferred embodiment, the module for determining and processing the AMD related disease includes a processor and a memory, wherein the memory stores the risk threshold data or model of the AMD related disease based on the AMD related disease characteristics.
In another preferred embodiment, the output module includes a reporting system (e.g., an Askme reporting system).
It is understood that within the scope of the present application, the above-described technical features of the present application and technical features specifically described below (e.g., in the examples) may be combined with each other to constitute new or preferred technical solutions. And are limited to a space, and are not described in detail herein.
Drawings
Fig. 1 shows the technical route of the present application.
FIG. 2 shows the experimental procedure for genotyping of gene SNPs using the MassARRAT Analyzer 4 system.
Figure 3 shows the Logistic regression, random forest, adaboost, and repeated 1000 random splits of the Xgboost classifier training and test sets, the test set average results were ROC curves, and the feature variables contained clinical information and site information (snp+cc).
FIG. 4 shows that Xgboost is repeated 1000 times for learning and prediction, the average prediction result of the test set is an ROC curve, the "CC" is only clinical information data of the characteristic variable, the "SNP" is only SNP sites of the characteristic variable, and the SNP+CC is the characteristic variable and comprises clinical information and site information.
FIG. 5 shows the importance scores of the first 10 feature variables of the Xgboost output.
FIG. 6 shows the number of variables versus the ROC-AUC score. The method comprises the steps of obtaining importance (Feature-import) scores of variable features according to an Xgboost model, optimizing a screening model again according to the scores, increasing the number of Feature variables from large to small one by one according to the importance scores, inputting the Feature variables into the model for training and testing to obtain the number of variables required by the optimal ROC-AUC of the test, wherein the number of variables corresponding to the optimal ROC-AUC shown in the figure is 4, and the first four Feature variables of the importance scores can be used as input variables, and the ROC-AUC score is highest at the moment.
Fig. 7 shows ROC curves with Xgboost as machine learning model, age, rs754203, rs2338104, diabetes as variables, and average of 1000 test sets.
Detailed Description
The present inventors have conducted extensive and intensive studies, and have developed, for the first time, a risk prediction algorithm model and apparatus for Age-related macular degeneration (Age-related macular degeneration, AMD). The application adopts the risk (Odd ratio) values of 7 related SNPs, combines 7 clinical information, and adopts a machine learning method to construct a risk prediction algorithm model and a device. The application can assist clinic to predict AMD in advance and diagnose early, and has great clinical significance for reducing the incidence of AMD and improving the disease treatment rate. The present application has been completed on the basis of this finding.
Terminology
rs2338104: sequence(s)
TGAAAAAGTTCTAAAATTAGATAGT [ C/G ] GTTATGGCCTCACAACTTGTGAATA, chromosome position 12:1094577363, is involved in the gene KCTD10
rs754203: sequence(s)
GTGCTGTCCTGGGGCCCAGGAGCCC [ C/T ] GGGGGCAAGGCTCTGCCCTGTTGCT, chromosome position 14:99691630, is involved in the gene CYP46A1 (GeneView)
rs2284664: sequence(s)
AGAAAAATACCAGTCTCCATAGATC [ A/G/T ] TAAAGCAAATAGATGGTCTTAAAAT, chromosome position 1:196733795, is involved in the gene CFH
rs2071277: sequence(s)
GGCAGTGACTGATGCAGTGTGTGAC [ A/G ] TCTAATCTCCCCCATAATTACAGGC, chromosome position 6:32203906, is involved in the gene NOTCH4
rs1999930: sequence(s)
ATAGGACAGATTCTAGATTTTCCTT [ A/C/G/T ] TGATACAGAGAAATATAAGACATAA, chromosome position 6:116065971, is involved in the gene FRK
rs10490924: sequence(s)
TTTATCACACTCCATGATCCCAGCT [ G/T ] CTAAAATCCACACTGAGCTCTGCTT, chromosome position 10:12245932, are involved in the gene ARMS2
rs5749482: sequence(s)
TGGGAACTGACTAATACAGCATGTA [ C/G ] GAACTATGAAATATGAATTGTGTAA, chromosomal location 32663679, are involved in the genes LOC105373002, SYN3
Age-related macular degeneration (Age-related macular degeneration, AMD)
Is an aging change in the structure of the macular area. The retinal pigment epithelial cells mainly show reduced phagocytic digestion capacity of the extracellular ganglion membrane, so that residual small bodies of the disc membrane which are not completely digested are retained in basal cell primary pulp and discharged outwards, deposited on the Brucella to form drusen, and after secondary pathological changes, macular degeneration is caused or the Brucella is caused to break, and choroidal capillaries enter the lower RPE and the lower retinal nerve epithelium through the broken Brucella to form choroidal neovascularization. Due to the abnormal structure of the newly generated blood vessel wall, the leakage and the bleeding of the blood vessel are caused, and a series of secondary pathological changes are further caused. Age-related macular degeneration mostly occurs over 45 years old, and its prevalence increases with age, and is an important disease that is blinding in the current elderly.
Single nucleotide polymorphism (Single nucleotide polymorphism, SNP)
Mainly refers to DNA sequence polymorphisms at the genomic level caused by single nucleotide variations. SNPs are widely present in the human genome, on average perThere are 1 base pair, and the estimated total number can be 300 ten thousand or more. SNPs are binary markers, caused by single base transitions or transversions, and also by base insertions or deletions. SNPs may be either within the gene sequence or on non-coding sequences outside the gene.
Xgboost
A boosting supervised ensemble learning model is composed of a plurality of associated CART trees. CART is a binary decision tree, each time a branch is made, each threshold value of each feature column is exhausted, a feature column which reduces the impurity to the greatest extent and a threshold value thereof are found according to the GINI coefficient, then the two branches are divided according to the feature column < = threshold value and the feature column > threshold value, and each branch contains samples meeting the branching condition; branching is continued by the same method until all samples under the branching belong to a unified class or a preset termination condition is reached, and if the class in the final leaf node is not unique, the class of most samples is used as the class of the leaf node. Xgboost can be expressed as the following formula:
for the predictor, F denotes all possible CART trees, and F denotes a specific CART tree.
The objective function of the model is the following formula:
as a loss function sum ΣΩ (f k ) As a regular term, the point where Obj (θ) takes the minimum value is the predicted value of this node, the minimum +.>The function value is the minimum loss function. The Xgboost adopts an addition training method, optimizes an objective function step by step, optimizes a first tree, and optimizes a second tree until k trees are optimized.
ROC-AUC
A method for evaluating accuracy of model, ROC curve is a graph formed by test subject working characteristic curve (Receiver operating characteristic curve), false positive probability (False positive rate) is taken as horizontal axis, true positive (True positive rate) is taken as vertical axis, and is a comprehensive index reflecting sensitivity and specificity continuous variable. AUC is the area under the ROC curve (Area under the curve). The closer the ROC-AUC value is between 1.0 and 0.5, the better the diagnosis effect, the lower the accuracy at 0.5-0.7, the accuracy at 0.7-0.9, and the higher the accuracy at AUC above 0.9. Auc=0.5, indicates that the diagnostic method is completely ineffective and of no diagnostic value. AUC <0.5 does not fit the real situation and rarely occurs in practice.
The main advantages of the application include:
1) The method predicts the AMD risk value by site information and clinical data for the first time in the clinical field, and is suitable for detecting high-flux samples;
2) The application predicts the risk of AMD at future ages, can prompt the change of life habits and other actions on risk values, and has the function of preventing and warning AMD diseases.
The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. The experimental procedure, which does not address the specific conditions in the examples below, is generally followed by routine conditions, such as, for example, sambrook et al, molecular cloning: conditions described in the laboratory Manual (New York: cold Spring Harbor Laboratory Press, 1989) or as recommended by the manufacturer. Percentages and parts are weight percentages and parts unless otherwise indicated.
Example 1.
From the 108 candidate SNP locus data, 7 locus data related to AMD diseases required by an algorithm model and a device are screened through statistical analysis.
SNP statistics and clinical informatics analysis are carried out on the recruitment experiment training group and the control group, 108 SNP loci are found through a large number of screening, and the SNP loci are shown in table 1.SNP typing data are obtained by the following steps:
1. sample collection: the two acquisition modes below are adopted.
a) Blood sample collection mode: whole blood was collected in 2-4mL of EDTA anticoagulant tubes.
b) The collecting mode of the oral swab comprises the following steps: the nylon flocking oral swab is used for scraping the oral palate and mucous membranes on two sides of the oral cavity of a person to be tested until all nylon flocking parts of the oral swab are wet, and a sampled oral swab sample is put into a test tube containing sample protection liquid (1-2 mL) for preservation.
2. Sample transportation: an ice bag is added into a foam box with a sample for low-temperature transportation.
3. Amplification of the DNA fragment and single base extension were performed using 7500 fluorescent quantitative PCR. Dye MIX was first configured: 1) When the dye is prepared, a plurality of holes are added, and the dye is put into-20 ℃ for preservation after the preparation is completed; secondly, marking the tube wall of the mixed liquor centrifuge tube by a dye method and a probe method, so that the confusion of the two dyes is avoided; then sequentially adding reagents, namely MIXTURE (17. Mu.L), primer 1 (1. Mu.L) samples (2. Mu.L); and finally sealing the film, and loading the film on a machine to finish.
4. Gene SNP typing was performed using MassARRAT Analyzer system, and the procedure is shown in FIG. 2.
5. Obtaining SNP loci related to AMD by genome-wide SNP association analysis (GWAS) technology, wherein the association analysis comprises the following assumptions:
1) Genotypic Model, assuming A is the minor allele, a is the major allele, 3 different genotypes have different effects.
2) Domino Model (Dominant Model), AA/AA has a different effect than AA genotype.
3) Recessed Model (Recessive Model), i.e. AA has a different influence than AA/AA
4) Alleric Model (Allelic Model), i.e. A and a have different influence
Based on the above assumptions, a chi-square value is calculated.O is the observed frequency and E is the expected frequency. The assumption of (2) that we calculate the difference between the observed frequency and the expected frequency of the AA or AA (both satisfying one) genotype in normal person, divided by the value V1 obtained by the expected frequency, calculate the value V2 of AA or AA in disease according to the calculation method of normal person, and obtain the value V3 of AA in normal person and the value V4 in disease according to the above method, respectively, and calculate the chi-square value as v1+v2+v3+v4. And obtaining a p value of the correlation by using a chi-square value, and screening to obtain 14 correlation sites according to the p value being smaller than 0.05.
The 14 related loci have collinearity chromosomes, 7 loci with large collinearity are eliminated through an algorithm, and the specific algorithm is as follows: making a window (window) of 50 SNP loci, moving 5 SNPs each time, calculating multiple correlation indexes R of 1 locus with other loci 2 Calculate 1/(1-R) 2 ) If the index is greater than 2, these SNP sites are excluded. The sites of rs551397, rs800292, rs10737680, rs3753396, rs1410996, rs2284664 and rs1065489 are eliminated, and finally the sites of rs2284664, rs2071277, rs1999930, rs10490924, rs2338104, rs754203 and rs5749482 are obtained.
After the above procedure, the desired sites were obtained and the information is shown in Table 2.
SNP locus genotype wash data becomes the corresponding value that locates the calculated OR value (Odd ratio) for the 7 relevant loci of AMD. The OR value (Odd ratio) refers to the ratio of the probability of an event occurring to the probability of not occurring. The formula is as follows:
OR=(nA/na)/(mA/ma)=(nA×ma)/(mA×na)
assuming that A is the hypoallele, nA is the number of genes of A in the disease, nA is the number of genes of not A in the disease, mA is the number of genes of A in the control, and mA is the number of genes of not A in the control. It has the following functions:
a) OR > 1, indicates that the frequency of A in the case group is greater than that in the non-case group, i.e., A has a higher risk of developing disease.
b) OR <1 indicates that A is less frequent in the case group than in the non-case group, i.e., A has a protective effect.
c) The more closely the disease is associated with the A allele, the greater the value of the ratio.
TABLE 1 numbering of initially selected SNP loci
(unified numbering of dbSNP of the National Center for Biotechnology Information (NCBI) database)
TABLE 2 genome-wide SNP correlation analysis (GWAS) technique to obtain AMD-related SNP site information
CHR SNP A1 F_A F_U A2 CHISQ P OR SE L95 U95
1 rs2284664 T 0.2687 0.3762 C 4.25 0.03924 0.6091 0.2414 0.3795 0.9777
6 rs2071277 C 0.3672 0.4471 T 7.591 0.022470 0.7175 0.2304 0.4568 1.127
6 rs1999930 T 0.03676 0.004587 C 5.204 0.02253 8.282 1.101 0.9571 71.67
10 rs10490924 T 0.5397 0.4231 G 4.286 0.03842 1.599 0.2273 1.024 2.496
12 rs2338104 G 0.4206 0.2905 C 5.951 0.01471 1.773 0.2359 1.117 2.816
14 rs754203 G 0.2868 0.3773 A 7.925 0.019020 0.6636 0.2352 0.4186 1.052
22 rs5749482 G 0.2353 0.3636 C 6.42 0.01128 0.5385 0.246 0.3325 0.872
The first column CHR is chromosomal information of loci, the second column is the number of SNP loci, the third column (A1) is the hypo-genotype, the fourth column f_a is the frequency observed by A1 genotype disease, the fifth column f_u is the frequency observed by A1 allele in healthy people, the sixth column is the other allele, i.e. the main allele (A2), the seventh column CHISQ is chi square value, the eighth column P is P value obtained by conversion of chi square value, the ninth column OR is OR risk value, and the remaining ten, eleven and twelve are standard errors of OR value and upper and lower values of 95% confidence interval.
The subsequent genotype will be replaced by the OR value for the minor allele, e.g., assuming A is the minor allele, a is the major allele, comprising one minor allele (Aa) replaced by the OR value, comprising two minor alleles (AA) replaced by the OR value squared, and 1 without the minor allele (Aa).
Example 2.
13 clinical investigation data of age, body height and Body Mass Index (BMI), hypertension, hyperlipidemia, diabetes, renal injury, whether it is often outdoors, whether it is vegetarian, never smoked, never drunk, atherosclerosis, eye surgery, sex, etc. were obtained according to the subjects collated in questionnaire.
Example 3.
Machine learning algorithms can be divided into three categories: supervised learning, unsupervised learning, and semi-supervised learning. Supervised learning is the generation of a function by a corresponding relationship between a portion of input data and output data, mapping the input to an appropriate output, such as a classification. The sample data of the application are clinically diagnosed and provided with the classified labels, so that the sample data are explored and selected in a supervised machine learning classification model. The data (SNP) of all samples with only SNP site information, the data (CC) of all samples with only clinical information, and the comprehensive data (SNP+CC) of the combined SNP site and clinical information are taken as input data, and the diagnosis result of the samples is taken as output classification label.
The algorithm construction is carried out according to the following steps:
a) All data were randomly split into 75% training set and 25% test set.
b) A machine learning classifier is constructed. Using snp+cc as input data, logistic regression, random forest, adaboost, and Xgboost were tried successively.
c) Cross-verifying the tuning parameters and selecting the parameters with the best scores.
d) Results were verified with the test set.
e) And (5) evaluating a model. The above procedure was repeated 1000 times and the area under the curve (ROC-AUC) of the mean subject curve of the test set was calculated. Xgboost with the highest ROC-AUC score was chosen as the best model (see FIG. 3).
f) And (5) screening characteristic variables. The clinical information (CC), the site information (SNP) and the site information (SNP+CC) are respectively taken as input data, classified by Xgboost, repeated 1000 times, and the average subject curve of the test set is shown in FIG. 4, so that the ROC-AUC of the SNP+CC is the highest.
g) Feature screening is further optimized. The Xgboost model obtains a Feature-importance (importance-importance) score of the variable Feature (see, for example, fig. 5 for the first 10), optimizes the screening model again according to the score, increases the number of variables from large to small, trains and tests the model one by one, and thus obtains a relationship graph of the number of variables and the ROC-AUC score (see, fig. 6). The results showed that the data input for the 4 most important variables (age, rs754203, rs2338104, diabetes) trained and tested the model, which gave the highest ROC-AUC score.
h) Xgboost was used as a machine learning model, age, rs754203, rs2338104, diabetes as input variables, yielding 1000 average ROC-AUC (0.800+ -0.06).
i) A model is stored for AMD risk prediction of subsequent measurement data.
j) Risk value output: that is, the algorithm model after learning and training predicts the probability of the input test data between 0 (control) and 1 (disease with AMD), and finally confirms the probability value of 1 (disease with AMD) as the risk value, and judges that the risk value exceeds 0.5 as the disease with AMD.
All documents mentioned in this disclosure are incorporated by reference in this disclosure as if each were individually incorporated by reference. Further, it will be appreciated that various changes and modifications may be made by those skilled in the art after reading the above teachings, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.
Sequence listing
<110> Kadsura pepper stem biomedical technology Co., ltd
SHANGHAI BIOTECAN MEDICAL DIAGNOSTICS Co.,Ltd.
<120> A Risk prediction Algorithm model and apparatus for age-related macular degeneration
<130> P2018-2112
<160> 7
<170> SIPOSequenceListing 1.0
<210> 1
<211> 52
<212> DNA
<213> Artificial sequence (Artificial Sequence)
<400> 1
tgaaaaagtt ctaaaattag atagtcggtt atggcctcac aacttgtgaa ta 52
<210> 2
<211> 52
<212> DNA
<213> Artificial sequence (Artificial Sequence)
<400> 2
gtgctgtcct ggggcccagg agcccctggg ggcaaggctc tgccctgttg ct 52
<210> 3
<211> 53
<212> DNA
<213> Artificial sequence (Artificial Sequence)
<400> 3
agaaaaatac cagtctccat agatcagtta aagcaaatag atggtcttaa aat 53
<210> 4
<211> 52
<212> DNA
<213> Artificial sequence (Artificial Sequence)
<400> 4
ggcagtgact gatgcagtgt gtgacagtct aatctccccc ataattacag gc 52
<210> 5
<211> 54
<212> DNA
<213> Artificial sequence (Artificial Sequence)
<400> 5
ataggacaga ttctagattt tccttacgtt gatacagaga aatataagac ataa 54
<210> 6
<211> 52
<212> DNA
<213> Artificial sequence (Artificial Sequence)
<400> 6
tttatcacac tccatgatcc cagctgtcta aaatccacac tgagctctgc tt 52
<210> 7
<211> 52
<212> DNA
<213> Artificial sequence (Artificial Sequence)
<400> 7
tgggaactga ctaatacagc atgtacggaa ctatgaaata tgaattgtgt aa 52

Claims (5)

1. A set of biomarkers, wherein said set comprises 5 biomarkers selected from the group consisting of: rs2338104, rs754203, rs5749482, rs2284664 and rs10490924, wherein,
rs2338104, chromosome position 12:109457363, C at which position is mutated to G;
rs754203, chromosome position 14:99691630, a mutation at this position to G;
rs5749482, chromosome position 22:32663679, the C mutation at this position is G;
rs2284664, chromosome position 1:196733795, mutation of C to T;
rs10490924, chromosome position 10:12245932, where G is mutated to T;
the collection further comprises 2 biomarkers selected from the group consisting of: rs2071277 and rs1999930;
wherein, rs2071277, chromosome position 6:32203906, the T mutation of this position is C;
rs1999930, chromosome position 6:116065971, the C mutation at this position is T.
2. A combination of reagents for use in the assessment or diagnosis of risk of developing age-related macular degeneration (AMD), comprising reagents for detecting each biomarker in the collection of claim 1.
3. A kit comprising the collection of claim 1 and/or the combination of reagents of claim 2.
4. Use of a set of biomarkers for the preparation of a kit for the assessment or diagnosis of risk of age-related macular degeneration (AMD), wherein the set of biomarkers comprises 5 biomarkers selected from the group consisting of: rs2338104, rs754203, rs5749482, rs2284664 and rs10490924;
wherein,
rs2338104, chromosome position 12:109457363, C at which position is mutated to G;
rs754203, chromosome position 14:99691630, a mutation at this position to G;
rs5749482, chromosome position 22:32663679, the C mutation at this position is G;
rs2284664, chromosome position 1:196733795, mutation of C to T;
rs10490924, chromosome position 10:12245932, where G is mutated to T;
the biomarker panel further comprises 2 biomarkers selected from the group consisting of: rs2071277 and rs1999930;
wherein, rs2071277, chromosome position 6:32203906, the T mutation of this position is C;
rs1999930, chromosome position 6:116065971, the C mutation at this position is T.
5. An early-stage auxiliary screening system for age-related macular degeneration (AMD), the system comprising:
(a) An AMD related disease signature input module for inputting AMD related disease signatures of a subject;
wherein said AMD associated disease trait comprises 5 locus information selected from the group consisting of: rs2338104, rs754203, rs5749482, rs2284664 and rs10490924;
(b) The processing module is used for grading the input AMD related disease characteristics according to a preset judgment standard so as to obtain a risk grade; comparing the risk score with a risk threshold value of the AMD related diseases so as to obtain an auxiliary screening result, wherein when the risk score is higher than the risk threshold value, the risk of the subject for the AMD related diseases is prompted to be higher than that of a normal population; and
(c) The auxiliary screening result output module is used for outputting the auxiliary screening result;
wherein,
rs2338104, chromosome position 12:109457363, C at which position is mutated to G;
rs754203, chromosome position 14:99691630, a mutation at this position to G;
rs5749482, chromosome position 22:32663679, the C mutation at this position is G;
rs2284664, chromosome position 1:196733795, mutation of C to T;
rs10490924, chromosome position 10:12245932, where G is mutated to T;
and said AMD related disease trait further comprises 2 locus information selected from the group consisting of: rs2071277 and rs1999930;
wherein, rs2071277, chromosome position 6:32203906, the T mutation of this position is C;
rs1999930, chromosome position 6:116065971, the C mutation at this position is T.
CN201910101067.7A 2019-01-31 2019-01-31 Risk prediction algorithm model and device for age-related macular degeneration Active CN109585017B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910101067.7A CN109585017B (en) 2019-01-31 2019-01-31 Risk prediction algorithm model and device for age-related macular degeneration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910101067.7A CN109585017B (en) 2019-01-31 2019-01-31 Risk prediction algorithm model and device for age-related macular degeneration

Publications (2)

Publication Number Publication Date
CN109585017A CN109585017A (en) 2019-04-05
CN109585017B true CN109585017B (en) 2023-12-12

Family

ID=65918525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910101067.7A Active CN109585017B (en) 2019-01-31 2019-01-31 Risk prediction algorithm model and device for age-related macular degeneration

Country Status (1)

Country Link
CN (1) CN109585017B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110042156B (en) * 2019-04-22 2021-12-28 苏州亿康医学检验有限公司 Method for judging endometrial receptivity and application thereof
US20220373563A1 (en) * 2019-04-23 2022-11-24 Peking Union Medical College Hospital Machine learning-based autism spectrum disorder diagnosis method and device using metabolite as marker
CN111471753A (en) * 2020-04-22 2020-07-31 优生贝(北京)生物技术有限公司 Female fertility genetic risk gene detection method based on risk assessment model
CN114283883B (en) * 2021-12-27 2022-11-22 上海华测艾普医学检验所有限公司 System for screening and risk prediction of liver cancer based on molecular marker and application
CN116179682B (en) * 2022-12-29 2024-02-06 温州谱希基因科技有限公司 Kit for detecting age-related macular degeneration and application thereof

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101173314A (en) * 2006-10-18 2008-05-07 四川省医学科学院(四川省人民医院) Reagent kit for detecting senility macular degeneration disease
CN101501194A (en) * 2006-06-13 2009-08-05 英国贝尔法斯特女王大学 Protection against and treatment of age related macular degeneration
CN101550451A (en) * 2008-03-04 2009-10-07 四川省医学科学院(四川省人民医院) Reagent kit for detecting agedness yellow spot degenerative disease
CN101748189A (en) * 2008-12-22 2010-06-23 上海基康生物技术有限公司 Senile dementia related locus detection method
CN101857899A (en) * 2009-04-03 2010-10-13 四川省医学科学院(四川省人民医院) Kit for detecting senile macular degeneration disease
CN103201393A (en) * 2010-11-01 2013-07-10 霍夫曼-拉罗奇有限公司 Predicting progression to advanced age-related macular degeneration using a polygenic score
CN203307338U (en) * 2012-09-25 2013-11-27 浙江爱易生物医学科技有限公司 Macular degeneration related gene locus detection kit
CN104334173A (en) * 2012-05-01 2015-02-04 特兰斯拉图姆医学公司 Methods for treating and diagnosing blinding eye diseases
CN107974500A (en) * 2018-01-22 2018-05-01 常熟市第二人民医院 LncRNAGAS5 is as the application in age related macular degeneration diagnosis marker

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130151270A1 (en) * 2011-12-12 2013-06-13 Pathway Genomics Genetic Based Health Management Systems for Weight and Nutrition Control

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101501194A (en) * 2006-06-13 2009-08-05 英国贝尔法斯特女王大学 Protection against and treatment of age related macular degeneration
CN101173314A (en) * 2006-10-18 2008-05-07 四川省医学科学院(四川省人民医院) Reagent kit for detecting senility macular degeneration disease
CN101550451A (en) * 2008-03-04 2009-10-07 四川省医学科学院(四川省人民医院) Reagent kit for detecting agedness yellow spot degenerative disease
CN101748189A (en) * 2008-12-22 2010-06-23 上海基康生物技术有限公司 Senile dementia related locus detection method
CN101857899A (en) * 2009-04-03 2010-10-13 四川省医学科学院(四川省人民医院) Kit for detecting senile macular degeneration disease
CN103201393A (en) * 2010-11-01 2013-07-10 霍夫曼-拉罗奇有限公司 Predicting progression to advanced age-related macular degeneration using a polygenic score
CN104334173A (en) * 2012-05-01 2015-02-04 特兰斯拉图姆医学公司 Methods for treating and diagnosing blinding eye diseases
CN203307338U (en) * 2012-09-25 2013-11-27 浙江爱易生物医学科技有限公司 Macular degeneration related gene locus detection kit
CN107974500A (en) * 2018-01-22 2018-05-01 常熟市第二人民医院 LncRNAGAS5 is as the application in age related macular degeneration diagnosis marker

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KCTD10对神经胶质瘤侵袭及迁移的影响;张敏;《中国优秀硕士学位论文全文数据库<医药卫生科技辑>》;20170228;2 *
张敏.KCTD10对神经胶质瘤侵袭及迁移的影响.《中国优秀硕士学位论文全文数据库<医药卫生科技辑>》.2017, *

Also Published As

Publication number Publication date
CN109585017A (en) 2019-04-05

Similar Documents

Publication Publication Date Title
CN109585017B (en) Risk prediction algorithm model and device for age-related macular degeneration
CA3180334A1 (en) Methods for detection of donor-derived cell-free dna
CN104232778B (en) Determine the method and device of fetus haplotype and chromosomal aneuploidy simultaneously
Kriszt et al. Segregation analysis suggests that keratoconus is a complex non‐mendelian disease
WO2017156290A1 (en) A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing
CN114891876A (en) Functional genome area biomarker combination for diagnosing high myopia
CN115011687A (en) Biomarker group, kit and system for predicting adverse prognosis of ischemic stroke patient
EP4232597A1 (en) Methods of assessing risk of developing a disease
CN116287204A (en) Application of mutation condition of detection characteristic gene in preparation of venous thromboembolism risk detection product
CN115029431A (en) Type 2diabetes gene detection kit and type 2diabetes genetic risk assessment system
CN115505638A (en) Application of biomarker combination in risk prediction of highly myopic male susceptible population
CN116741272A (en) Ovarian cancer HRD typing system and method based on genome mutation characteristics and gene set expression characteristics
CN115678986A (en) Biomarker combination for predicting risk of female high myopia and auxiliary diagnosis of female high myopia and application thereof
CN103045722B (en) Detection kit of disease-causing gene CRYGD of crystalline congenital cataract
CN112760365B (en) POAG gene detection kit suitable for prenatal noninvasive and detection method thereof
CN116469552A (en) Method and system for breast cancer polygene genetic risk assessment
US20190002981A1 (en) Method of Testing for Preeclampsia and Treatment Therefor
CN114783613A (en) Myopia prediction analysis method
WO2023197442A2 (en) Group of myopia and high myopia related snp markers and application thereof
Lázaro-Guevara et al. Identification of RP1 as the genetic cause of retinitis pigmentosa in a multi-generational pedigree using Extremely Low-Coverage Whole Genome Sequencing (XLC-WGS)
CN115074439B (en) Group of NK/T cell lymphoma prognosis related genes, genome prognosis model and application thereof
CN114574574A (en) SNP markers related to quantitative traits of right-eye equivalent spherical lens and application thereof
CN114427002B (en) Kit for evaluating risk of type 1 diabetes based on 22 SNP susceptibility sites
CN114525336A (en) SNP (Single nucleotide polymorphism) combined markers for myopia diagnosis and right-eye cylindrical lens screening and application thereof
CN114606313A (en) SNP markers for detecting worst eye equivalent sphere lens and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant