CN109585017B

CN109585017B - Risk prediction algorithm model and device for age-related macular degeneration

Info

Publication number: CN109585017B
Application number: CN201910101067.7A
Authority: CN
Inventors: 王丽君; 高军晖; 袁卫兰; 龚建兵; 刘慧敏; 林灵; 许骋; 张英霞
Original assignee: Shanghai Biotecan Medical Diagnostics Co ltd; Shanghai Biotecan Biology Medicine Technology Co ltd
Current assignee: Shanghai Biotecan Medical Diagnostics Co ltd; Shanghai Biotecan Biology Medicine Technology Co ltd
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2023-12-12
Anticipated expiration: 2039-01-31
Also published as: CN109585017A

Abstract

The application provides a risk prediction algorithm model and a device for Age-related macular degeneration (Age-related macular degeneration, AMD). Specifically, the application provides genotyping of 7 related single nucleotide polymorphisms (Single nucleotide polymorphism, SNPs), converting the genotyping into OR (Odd Ratio) values, combining 7 clinical information, and constructing risk prediction algorithm models and devices by adopting a machine learning method. The application can assist clinic to predict AMD in advance and diagnose early, and has great clinical significance for reducing the incidence of AMD and improving the disease treatment rate.

Description

Risk prediction algorithm model and device for age-related macular degeneration

Technical Field

The application relates to the field of medical biological detection, in particular to a risk prediction algorithm model and a risk prediction device for Age-related macular degeneration (Age-related macular degeneration, AMD).

Background

Age-related macular degeneration (Age-related macular degeneration, AMD) is a major contributor to blindness in the elderly. The disease has complex etiology related to age, sex, smoking, race, heredity and other factors, is irreversible vision loss, and has no effective treatment means at present. AMD has a high incidence, and meta-analysis results show that the total incidence of AMD is 8.01% worldwide, and the incidence of AMD in European, african and Asian populations is 11.2%, 7.1% and 6.8%, respectively. The incidence rates of early AMD and advanced AMD of old people in China are 4.7% -9.2% and 0.2% -1.9% respectively. It was predicted that global AMD patients would reach 1.96 billion and 2.88 billion by 2020 and 2040, respectively. Along with the acceleration of the aging of the population in China, AMD has a remarkable rising trend.

AMD occurs as a result of a combination of environmental and genetic factors, with genetic factors accounting for a high proportion, up to 45-70%, of the risk of developing the disease. AMD is of complex etiology, and its pathogenesis is related to both genetic and environmental factors, as mentioned above, which account for a significant proportion of the risk of developing the disease. Obviously, if genetic and environmental factors are comprehensively considered and conventional and auxiliary AMD examination such as eyesight, ocular tension, fundus examination, fundus angiography, optical tomography and the like are combined, the accurate diagnosis and effective risk assessment of the AMD can be greatly improved, and the method is also beneficial to the prevention of the AMD and early discovery and treatment of the AMD.

Thus, there is an urgent need in the art to develop a reliable method for early prediction and diagnosis of AMD.

Disclosure of Invention

The application aims to provide a risk prediction algorithm model and a risk prediction device for Age-related macular degeneration (Age-related macular degeneration, AMD).

In a first aspect of the application, there is provided a set of biomarkers, said set comprising biomarkers selected from the group consisting of: rs2338104, rs754203, or combinations thereof.

In another preferred embodiment, the biomarker panel is a biomarker panel for diagnosing macular degeneration (AMD) disease, further comprising a biomarker selected from the group consisting of: rs2284664, rs2071277, rs1999930, rs10490924, rs5749482, or a combination thereof.

In another preferred embodiment, the biomarker panel is a biomarker panel for diagnosing A Macular Degeneration (AMD) disease, comprising a biomarker selected from table a:

table A

Numbering device	Chromosome location	Mutant bases
			rs2338104	12:109457363	C>G
rs754203	14:99691630	A>G
			rs2284664	1:196733395	C>T
rs2071277	6:32203906	T>C
			rs1999930	6:116065971	C>T
rs10490924	10:122454932	G>T
			rs5749482	22:32663679	C>G

In another preferred embodiment, the biomarker panel is used for diagnosing macular degeneration (AMD) disease, or for preparing a kit or reagent for assessing the risk (susceptibility) of developing macular degeneration (AMD) disease in a subject or for diagnosing (including early diagnosis and/or assisted diagnosis) macular degeneration (AMD) disease in a subject.

In another preferred embodiment, the collection comprises biomarkers selected from table B:

table B

In another preferred embodiment, the collection comprises biomarkers b 1-b 2.

In another preferred embodiment, the collection further comprises biomarkers b 3-b 7.

In another preferred embodiment, the collection further comprises a biomarker: rs551397, rs800292, rs10737680, rs3753396, rs1410996, rs2284664, rs1065489, or a combination thereof.

In another preferred embodiment, the biomarker or set of biomarkers is derived from blood, plasma, serum, or an oral swab sample.

In another preferred embodiment, each biomarker is detected by PCR.

In another preferred embodiment, amplification of the DNA fragment and single base extension are performed using fluorescent quantitative PCR.

In another preferred embodiment, detection of biological standard is performed using MassARRAT Analyzer system.

In another preferred embodiment, the PCR comprises QPCR, fluorescent quantitative PCR.

In another preferred embodiment, the collection is used for the assessment or diagnosis of risk of developing AMD.

In another preferred embodiment, said assessing the risk of developing AMD in a subject comprises early screening for AMD.

In a second aspect of the application there is provided a combination of reagents for use in the assessment or diagnosis of risk of developing AMD, the combination of reagents comprising reagents for detecting individual biomarkers in a collection according to the first aspect of the application.

In a third aspect of the application there is provided a kit comprising a collection according to the first aspect of the application and/or a combination of reagents according to the second aspect of the application.

In another preferred embodiment, each biomarker in the collection according to the first aspect of the application is used as a standard.

In another preferred embodiment, the kit further comprises a description.

In a fourth aspect of the application, there is provided the use of a biomarker panel for the manufacture of a kit for the assessment or diagnosis of risk of developing AMD, wherein the biomarker panel comprises two biomarkers selected from the group consisting of: rs2338104, rs754203, or combinations thereof.

In another preferred embodiment, for use in the assessment or diagnosis of risk of developing AMD, the biomarker panel further comprises a biomarker selected from the group consisting of: rs2284664, rs2071277, rs1999930, rs10490924, rs5749482, or a combination thereof.

In another preferred embodiment, the evaluating comprises the steps of:

(1) Providing a sample derived from the subject to be tested, and detecting SNP typing values (namely A1 or A2 of the table 2) of each biomarker in the set in the sample;

(2) Comparing the site information measured in step (1) with a reference data set;

preferably, the reference data set comprises the individual biomarkers as in the collection derived from AMD patients and healthy controls;

in another preferred embodiment, the sample is selected from the group consisting of: blood, plasma, serum, and buccal swabs.

In another preferred embodiment, the comparing the site information measured in step (1) with a reference data set further includes the step of creating a multivariate statistical model of supervised machine learning, preferably an Xgboost analysis model, to output the likelihood of illness.

In another preferred embodiment, if the likelihood of disease is > 0.5, the subject is determined to be at risk of or suffering from AMD disease.

In another preferred embodiment, the method further comprises the step of treating the sample prior to step (1).

In a fifth aspect of the application, there is provided a method for assessing or diagnosing risk of developing AMD in a subject, comprising the steps of:

(1) Providing a sample derived from the subject to be tested, and detecting the site information (such as SNP typing values (namely A1 or A2 of table 2)) of each biomarker in the set in the sample;

(2) Comparing the type measured in step (1) with a reference data set;

preferably, the reference data set comprises data derived from individual biomarkers in the collection of AMD patients and healthy controls;

In another preferred embodiment, the comparing the data corresponding to the measured model in step (1) with a reference data set further includes the step of creating a machine learning model with supervised ensemble learning to output the likelihood of illness, and preferably, the machine learning model is an Xgboost analysis model.

In a sixth aspect of the application, there is provided a method of screening for a candidate compound for assessing or diagnosing risk of developing AMD comprising the steps of:

(1) Administering a test compound to a subject to be tested in a test set, detecting the level V1 of each biomarker in a collection in a sample derived from the subject in the test set; in a control group, administering a blank (including vehicle) to a subject to be tested, and detecting the level V2 of each biomarker in the collection in a sample derived from the subject in the control group;

(2) Comparing the level V1 detected in the previous step with the level V2 to determine whether the test compound is a candidate compound for treating AMD, wherein the set comprises two or more biomarkers selected from the group consisting of: rs2338104, rs1999930, rs10490924.

In another preferred embodiment, the subject has AMD.

In another preferred embodiment, if the level V1 of one or more biomarkers selected from subset H is significantly lower than the level V2, the test compound is indicated as a candidate compound for treating AMD.

In another preferred embodiment, the term "substantially lower" means that the ratio of level V1/level V2 is 0.8 or less, preferably 0.6 or less, more preferably 0.4 or less.

In a seventh aspect of the application, there is provided the use of a set of biomarkers for screening candidate compounds for assessing or diagnosing risk of developing AMD and/or for assessing the therapeutic effect of a candidate compound on AMD, wherein the set of biomarkers is selected from the group consisting of two biomarkers of: rs2338104, rs754203, or combinations thereof.

In another preferred embodiment, the biomarker further comprises: rs2284664, rs2071277, rs1999930, rs10490924, rs5749482, or a combination thereof.

In an eighth aspect of the application, there is provided an AMD early stage auxiliary screening system, the system comprising:

(a) An AMD related disease signature input module for inputting AMD related disease signatures of a subject;

wherein the AMD associated disease profile comprises two or more of the following group a of site information (e.g., SNP typing values (i.e., A1 or A2 of table 2): rs2284664, rs2071277, rs1999930, rs10490924, rs2338104, rs754203, rs5749482, or a combination thereof;

(b) The processing module is used for grading the input AMD related disease characteristics according to a preset judgment standard so as to obtain a risk grade; comparing the risk score with a risk threshold value of the AMD related diseases so as to obtain an auxiliary screening result, wherein when the risk score is higher than the risk threshold value, the risk of the subject for the AMD related diseases is prompted to be higher than that of a normal population; and

(c) And the output module is used for outputting the auxiliary screening result.

In another preferred embodiment, in the step (a), the following AMD related disease characteristics are further included: age, diabetes condition, body mass index (BMI index), renal injury condition, atherosclerosis, drinking condition, and whether it is often an outdoor condition.

In another preferred embodiment, the subject is a human.

In another preferred embodiment, the subject comprises an infant, adolescent or adult.

In another preferred embodiment, in the processing module, the risk score processing is performed as follows:

in another preferred embodiment, the feature input module comprises a sample collector.

In another preferred embodiment, the feature input module is selected from the group consisting of: massARRAT Analyzer 4system typing output module, askme module.

In another preferred embodiment, the module for determining and processing the AMD related disease includes a processor and a memory, wherein the memory stores the risk threshold data or model of the AMD related disease based on the AMD related disease characteristics.

In another preferred embodiment, the output module includes a reporting system (e.g., an Askme reporting system).

It is understood that within the scope of the present application, the above-described technical features of the present application and technical features specifically described below (e.g., in the examples) may be combined with each other to constitute new or preferred technical solutions. And are limited to a space, and are not described in detail herein.

Drawings

Fig. 1 shows the technical route of the present application.

FIG. 2 shows the experimental procedure for genotyping of gene SNPs using the MassARRAT Analyzer 4 system.

Figure 3 shows the Logistic regression, random forest, adaboost, and repeated 1000 random splits of the Xgboost classifier training and test sets, the test set average results were ROC curves, and the feature variables contained clinical information and site information (snp+cc).

FIG. 4 shows that Xgboost is repeated 1000 times for learning and prediction, the average prediction result of the test set is an ROC curve, the "CC" is only clinical information data of the characteristic variable, the "SNP" is only SNP sites of the characteristic variable, and the SNP+CC is the characteristic variable and comprises clinical information and site information.

FIG. 5 shows the importance scores of the first 10 feature variables of the Xgboost output.

FIG. 6 shows the number of variables versus the ROC-AUC score. The method comprises the steps of obtaining importance (Feature-import) scores of variable features according to an Xgboost model, optimizing a screening model again according to the scores, increasing the number of Feature variables from large to small one by one according to the importance scores, inputting the Feature variables into the model for training and testing to obtain the number of variables required by the optimal ROC-AUC of the test, wherein the number of variables corresponding to the optimal ROC-AUC shown in the figure is 4, and the first four Feature variables of the importance scores can be used as input variables, and the ROC-AUC score is highest at the moment.

Fig. 7 shows ROC curves with Xgboost as machine learning model, age, rs754203, rs2338104, diabetes as variables, and average of 1000 test sets.

Detailed Description

The present inventors have conducted extensive and intensive studies, and have developed, for the first time, a risk prediction algorithm model and apparatus for Age-related macular degeneration (Age-related macular degeneration, AMD). The application adopts the risk (Odd ratio) values of 7 related SNPs, combines 7 clinical information, and adopts a machine learning method to construct a risk prediction algorithm model and a device. The application can assist clinic to predict AMD in advance and diagnose early, and has great clinical significance for reducing the incidence of AMD and improving the disease treatment rate. The present application has been completed on the basis of this finding.

Terminology

rs2338104: sequence(s)

TGAAAAAGTTCTAAAATTAGATAGT [ C/G ] GTTATGGCCTCACAACTTGTGAATA, chromosome position 12:1094577363, is involved in the gene KCTD10

rs754203: sequence(s)

GTGCTGTCCTGGGGCCCAGGAGCCC [ C/T ] GGGGGCAAGGCTCTGCCCTGTTGCT, chromosome position 14:99691630, is involved in the gene CYP46A1 (GeneView)

rs2284664: sequence(s)

AGAAAAATACCAGTCTCCATAGATC [ A/G/T ] TAAAGCAAATAGATGGTCTTAAAAT, chromosome position 1:196733795, is involved in the gene CFH

rs2071277: sequence(s)

GGCAGTGACTGATGCAGTGTGTGAC [ A/G ] TCTAATCTCCCCCATAATTACAGGC, chromosome position 6:32203906, is involved in the gene NOTCH4

rs1999930: sequence(s)

ATAGGACAGATTCTAGATTTTCCTT [ A/C/G/T ] TGATACAGAGAAATATAAGACATAA, chromosome position 6:116065971, is involved in the gene FRK

rs10490924: sequence(s)

TTTATCACACTCCATGATCCCAGCT [ G/T ] CTAAAATCCACACTGAGCTCTGCTT, chromosome position 10:12245932, are involved in the gene ARMS2

rs5749482: sequence(s)

TGGGAACTGACTAATACAGCATGTA [ C/G ] GAACTATGAAATATGAATTGTGTAA, chromosomal location 32663679, are involved in the genes LOC105373002, SYN3

Age-related macular degeneration (Age-related macular degeneration, AMD)

Is an aging change in the structure of the macular area. The retinal pigment epithelial cells mainly show reduced phagocytic digestion capacity of the extracellular ganglion membrane, so that residual small bodies of the disc membrane which are not completely digested are retained in basal cell primary pulp and discharged outwards, deposited on the Brucella to form drusen, and after secondary pathological changes, macular degeneration is caused or the Brucella is caused to break, and choroidal capillaries enter the lower RPE and the lower retinal nerve epithelium through the broken Brucella to form choroidal neovascularization. Due to the abnormal structure of the newly generated blood vessel wall, the leakage and the bleeding of the blood vessel are caused, and a series of secondary pathological changes are further caused. Age-related macular degeneration mostly occurs over 45 years old, and its prevalence increases with age, and is an important disease that is blinding in the current elderly.

Single nucleotide polymorphism (Single nucleotide polymorphism, SNP)

Mainly refers to DNA sequence polymorphisms at the genomic level caused by single nucleotide variations. SNPs are widely present in the human genome, on average perThere are 1 base pair, and the estimated total number can be 300 ten thousand or more. SNPs are binary markers, caused by single base transitions or transversions, and also by base insertions or deletions. SNPs may be either within the gene sequence or on non-coding sequences outside the gene.

Xgboost

A boosting supervised ensemble learning model is composed of a plurality of associated CART trees. CART is a binary decision tree, each time a branch is made, each threshold value of each feature column is exhausted, a feature column which reduces the impurity to the greatest extent and a threshold value thereof are found according to the GINI coefficient, then the two branches are divided according to the feature column < = threshold value and the feature column > threshold value, and each branch contains samples meeting the branching condition; branching is continued by the same method until all samples under the branching belong to a unified class or a preset termination condition is reached, and if the class in the final leaf node is not unique, the class of most samples is used as the class of the leaf node. Xgboost can be expressed as the following formula:

for the predictor, F denotes all possible CART trees, and F denotes a specific CART tree.

The objective function of the model is the following formula:

as a loss function sum ΣΩ (f _k ) As a regular term, the point where Obj (θ) takes the minimum value is the predicted value of this node, the minimum +.>The function value is the minimum loss function. The Xgboost adopts an addition training method, optimizes an objective function step by step, optimizes a first tree, and optimizes a second tree until k trees are optimized.

ROC-AUC

A method for evaluating accuracy of model, ROC curve is a graph formed by test subject working characteristic curve (Receiver operating characteristic curve), false positive probability (False positive rate) is taken as horizontal axis, true positive (True positive rate) is taken as vertical axis, and is a comprehensive index reflecting sensitivity and specificity continuous variable. AUC is the area under the ROC curve (Area under the curve). The closer the ROC-AUC value is between 1.0 and 0.5, the better the diagnosis effect, the lower the accuracy at 0.5-0.7, the accuracy at 0.7-0.9, and the higher the accuracy at AUC above 0.9. Auc=0.5, indicates that the diagnostic method is completely ineffective and of no diagnostic value. AUC <0.5 does not fit the real situation and rarely occurs in practice.

The main advantages of the application include:

1) The method predicts the AMD risk value by site information and clinical data for the first time in the clinical field, and is suitable for detecting high-flux samples;

2) The application predicts the risk of AMD at future ages, can prompt the change of life habits and other actions on risk values, and has the function of preventing and warning AMD diseases.

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. The experimental procedure, which does not address the specific conditions in the examples below, is generally followed by routine conditions, such as, for example, sambrook et al, molecular cloning: conditions described in the laboratory Manual (New York: cold Spring Harbor Laboratory Press, 1989) or as recommended by the manufacturer. Percentages and parts are weight percentages and parts unless otherwise indicated.

Example 1.

From the 108 candidate SNP locus data, 7 locus data related to AMD diseases required by an algorithm model and a device are screened through statistical analysis.

SNP statistics and clinical informatics analysis are carried out on the recruitment experiment training group and the control group, 108 SNP loci are found through a large number of screening, and the SNP loci are shown in table 1.SNP typing data are obtained by the following steps:

1. sample collection: the two acquisition modes below are adopted.

a) Blood sample collection mode: whole blood was collected in 2-4mL of EDTA anticoagulant tubes.

b) The collecting mode of the oral swab comprises the following steps: the nylon flocking oral swab is used for scraping the oral palate and mucous membranes on two sides of the oral cavity of a person to be tested until all nylon flocking parts of the oral swab are wet, and a sampled oral swab sample is put into a test tube containing sample protection liquid (1-2 mL) for preservation.

2. Sample transportation: an ice bag is added into a foam box with a sample for low-temperature transportation.

3. Amplification of the DNA fragment and single base extension were performed using 7500 fluorescent quantitative PCR. Dye MIX was first configured: 1) When the dye is prepared, a plurality of holes are added, and the dye is put into-20 ℃ for preservation after the preparation is completed; secondly, marking the tube wall of the mixed liquor centrifuge tube by a dye method and a probe method, so that the confusion of the two dyes is avoided; then sequentially adding reagents, namely MIXTURE (17. Mu.L), primer 1 (1. Mu.L) samples (2. Mu.L); and finally sealing the film, and loading the film on a machine to finish.

4. Gene SNP typing was performed using MassARRAT Analyzer system, and the procedure is shown in FIG. 2.

5. Obtaining SNP loci related to AMD by genome-wide SNP association analysis (GWAS) technology, wherein the association analysis comprises the following assumptions:

1) Genotypic Model, assuming A is the minor allele, a is the major allele, 3 different genotypes have different effects.

2) Domino Model (Dominant Model), AA/AA has a different effect than AA genotype.

3) Recessed Model (Recessive Model), i.e. AA has a different influence than AA/AA

4) Alleric Model (Allelic Model), i.e. A and a have different influence

Based on the above assumptions, a chi-square value is calculated.O is the observed frequency and E is the expected frequency. The assumption of (2) that we calculate the difference between the observed frequency and the expected frequency of the AA or AA (both satisfying one) genotype in normal person, divided by the value V1 obtained by the expected frequency, calculate the value V2 of AA or AA in disease according to the calculation method of normal person, and obtain the value V3 of AA in normal person and the value V4 in disease according to the above method, respectively, and calculate the chi-square value as v1+v2+v3+v4. And obtaining a p value of the correlation by using a chi-square value, and screening to obtain 14 correlation sites according to the p value being smaller than 0.05.

The 14 related loci have collinearity chromosomes, 7 loci with large collinearity are eliminated through an algorithm, and the specific algorithm is as follows: making a window (window) of 50 SNP loci, moving 5 SNPs each time, calculating multiple correlation indexes R of 1 locus with other loci ² Calculate 1/(1-R) ² ) If the index is greater than 2, these SNP sites are excluded. The sites of rs551397, rs800292, rs10737680, rs3753396, rs1410996, rs2284664 and rs1065489 are eliminated, and finally the sites of rs2284664, rs2071277, rs1999930, rs10490924, rs2338104, rs754203 and rs5749482 are obtained.

After the above procedure, the desired sites were obtained and the information is shown in Table 2.

SNP locus genotype wash data becomes the corresponding value that locates the calculated OR value (Odd ratio) for the 7 relevant loci of AMD. The OR value (Odd ratio) refers to the ratio of the probability of an event occurring to the probability of not occurring. The formula is as follows:

OR＝(nA/na)/(mA/ma)＝(nA×ma)/(mA×na)

assuming that A is the hypoallele, nA is the number of genes of A in the disease, nA is the number of genes of not A in the disease, mA is the number of genes of A in the control, and mA is the number of genes of not A in the control. It has the following functions:

a) OR > 1, indicates that the frequency of A in the case group is greater than that in the non-case group, i.e., A has a higher risk of developing disease.

b) OR <1 indicates that A is less frequent in the case group than in the non-case group, i.e., A has a protective effect.

c) The more closely the disease is associated with the A allele, the greater the value of the ratio.

TABLE 1 numbering of initially selected SNP loci

(unified numbering of dbSNP of the National Center for Biotechnology Information (NCBI) database)

TABLE 2 genome-wide SNP correlation analysis (GWAS) technique to obtain AMD-related SNP site information

CHR	SNP	A1	F_A	F_U	A2	CHISQ	P	OR	SE	L95	U95
												1	rs2284664	T	0.2687	0.3762	C	4.25	0.03924	0.6091	0.2414	0.3795	0.9777
6	rs2071277	C	0.3672	0.4471	T	7.591	0.022470	0.7175	0.2304	0.4568	1.127
												6	rs1999930	T	0.03676	0.004587	C	5.204	0.02253	8.282	1.101	0.9571	71.67
10	rs10490924	T	0.5397	0.4231	G	4.286	0.03842	1.599	0.2273	1.024	2.496
												12	rs2338104	G	0.4206	0.2905	C	5.951	0.01471	1.773	0.2359	1.117	2.816
14	rs754203	G	0.2868	0.3773	A	7.925	0.019020	0.6636	0.2352	0.4186	1.052
												22	rs5749482	G	0.2353	0.3636	C	6.42	0.01128	0.5385	0.246	0.3325	0.872

The first column CHR is chromosomal information of loci, the second column is the number of SNP loci, the third column (A1) is the hypo-genotype, the fourth column f_a is the frequency observed by A1 genotype disease, the fifth column f_u is the frequency observed by A1 allele in healthy people, the sixth column is the other allele, i.e. the main allele (A2), the seventh column CHISQ is chi square value, the eighth column P is P value obtained by conversion of chi square value, the ninth column OR is OR risk value, and the remaining ten, eleven and twelve are standard errors of OR value and upper and lower values of 95% confidence interval.

The subsequent genotype will be replaced by the OR value for the minor allele, e.g., assuming A is the minor allele, a is the major allele, comprising one minor allele (Aa) replaced by the OR value, comprising two minor alleles (AA) replaced by the OR value squared, and 1 without the minor allele (Aa).

Example 2.

13 clinical investigation data of age, body height and Body Mass Index (BMI), hypertension, hyperlipidemia, diabetes, renal injury, whether it is often outdoors, whether it is vegetarian, never smoked, never drunk, atherosclerosis, eye surgery, sex, etc. were obtained according to the subjects collated in questionnaire.

Example 3.

Machine learning algorithms can be divided into three categories: supervised learning, unsupervised learning, and semi-supervised learning. Supervised learning is the generation of a function by a corresponding relationship between a portion of input data and output data, mapping the input to an appropriate output, such as a classification. The sample data of the application are clinically diagnosed and provided with the classified labels, so that the sample data are explored and selected in a supervised machine learning classification model. The data (SNP) of all samples with only SNP site information, the data (CC) of all samples with only clinical information, and the comprehensive data (SNP+CC) of the combined SNP site and clinical information are taken as input data, and the diagnosis result of the samples is taken as output classification label.

The algorithm construction is carried out according to the following steps:

a) All data were randomly split into 75% training set and 25% test set.

b) A machine learning classifier is constructed. Using snp+cc as input data, logistic regression, random forest, adaboost, and Xgboost were tried successively.

c) Cross-verifying the tuning parameters and selecting the parameters with the best scores.

d) Results were verified with the test set.

e) And (5) evaluating a model. The above procedure was repeated 1000 times and the area under the curve (ROC-AUC) of the mean subject curve of the test set was calculated. Xgboost with the highest ROC-AUC score was chosen as the best model (see FIG. 3).

f) And (5) screening characteristic variables. The clinical information (CC), the site information (SNP) and the site information (SNP+CC) are respectively taken as input data, classified by Xgboost, repeated 1000 times, and the average subject curve of the test set is shown in FIG. 4, so that the ROC-AUC of the SNP+CC is the highest.

g) Feature screening is further optimized. The Xgboost model obtains a Feature-importance (importance-importance) score of the variable Feature (see, for example, fig. 5 for the first 10), optimizes the screening model again according to the score, increases the number of variables from large to small, trains and tests the model one by one, and thus obtains a relationship graph of the number of variables and the ROC-AUC score (see, fig. 6). The results showed that the data input for the 4 most important variables (age, rs754203, rs2338104, diabetes) trained and tested the model, which gave the highest ROC-AUC score.

h) Xgboost was used as a machine learning model, age, rs754203, rs2338104, diabetes as input variables, yielding 1000 average ROC-AUC (0.800+ -0.06).

i) A model is stored for AMD risk prediction of subsequent measurement data.

j) Risk value output: that is, the algorithm model after learning and training predicts the probability of the input test data between 0 (control) and 1 (disease with AMD), and finally confirms the probability value of 1 (disease with AMD) as the risk value, and judges that the risk value exceeds 0.5 as the disease with AMD.

All documents mentioned in this disclosure are incorporated by reference in this disclosure as if each were individually incorporated by reference. Further, it will be appreciated that various changes and modifications may be made by those skilled in the art after reading the above teachings, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Sequence listing

<110> Kadsura pepper stem biomedical technology Co., ltd

SHANGHAI BIOTECAN MEDICAL DIAGNOSTICS Co.,Ltd.

<120> A Risk prediction Algorithm model and apparatus for age-related macular degeneration

<130> P2018-2112

<160> 7

<170> SIPOSequenceListing 1.0

<210> 1

<211> 52

<212> DNA

<213> Artificial sequence (Artificial Sequence)

<400> 1

tgaaaaagtt ctaaaattag atagtcggtt atggcctcac aacttgtgaa ta 52

<210> 2

<211> 52

<212> DNA

<213> Artificial sequence (Artificial Sequence)

<400> 2

gtgctgtcct ggggcccagg agcccctggg ggcaaggctc tgccctgttg ct 52

<210> 3

<211> 53

<212> DNA

<213> Artificial sequence (Artificial Sequence)

<400> 3

agaaaaatac cagtctccat agatcagtta aagcaaatag atggtcttaa aat 53

<210> 4

<211> 52

<212> DNA

<213> Artificial sequence (Artificial Sequence)

<400> 4

ggcagtgact gatgcagtgt gtgacagtct aatctccccc ataattacag gc 52

<210> 5

<211> 54

<212> DNA

<213> Artificial sequence (Artificial Sequence)

<400> 5

ataggacaga ttctagattt tccttacgtt gatacagaga aatataagac ataa 54

<210> 6

<211> 52

<212> DNA

<213> Artificial sequence (Artificial Sequence)

<400> 6

tttatcacac tccatgatcc cagctgtcta aaatccacac tgagctctgc tt 52

<210> 7

<211> 52

<212> DNA

<213> Artificial sequence (Artificial Sequence)

<400> 7

tgggaactga ctaatacagc atgtacggaa ctatgaaata tgaattgtgt aa 52

Claims

1. A set of biomarkers, wherein said set comprises 5 biomarkers selected from the group consisting of: rs2338104, rs754203, rs5749482, rs2284664 and rs10490924, wherein,

rs2338104, chromosome position 12:109457363, C at which position is mutated to G;

rs754203, chromosome position 14:99691630, a mutation at this position to G;

rs5749482, chromosome position 22:32663679, the C mutation at this position is G;

rs2284664, chromosome position 1:196733795, mutation of C to T;

rs10490924, chromosome position 10:12245932, where G is mutated to T;

the collection further comprises 2 biomarkers selected from the group consisting of: rs2071277 and rs1999930;

wherein, rs2071277, chromosome position 6:32203906, the T mutation of this position is C;

rs1999930, chromosome position 6:116065971, the C mutation at this position is T.

2. A combination of reagents for use in the assessment or diagnosis of risk of developing age-related macular degeneration (AMD), comprising reagents for detecting each biomarker in the collection of claim 1.

3. A kit comprising the collection of claim 1 and/or the combination of reagents of claim 2.

4. Use of a set of biomarkers for the preparation of a kit for the assessment or diagnosis of risk of age-related macular degeneration (AMD), wherein the set of biomarkers comprises 5 biomarkers selected from the group consisting of: rs2338104, rs754203, rs5749482, rs2284664 and rs10490924;

wherein,

rs754203, chromosome position 14:99691630, a mutation at this position to G;

rs2284664, chromosome position 1:196733795, mutation of C to T;

rs10490924, chromosome position 10:12245932, where G is mutated to T;

the biomarker panel further comprises 2 biomarkers selected from the group consisting of: rs2071277 and rs1999930;

5. An early-stage auxiliary screening system for age-related macular degeneration (AMD), the system comprising:

wherein said AMD associated disease trait comprises 5 locus information selected from the group consisting of: rs2338104, rs754203, rs5749482, rs2284664 and rs10490924;

(c) The auxiliary screening result output module is used for outputting the auxiliary screening result;

wherein,

rs754203, chromosome position 14:99691630, a mutation at this position to G;

rs2284664, chromosome position 1:196733795, mutation of C to T;

rs10490924, chromosome position 10:12245932, where G is mutated to T;

and said AMD related disease trait further comprises 2 locus information selected from the group consisting of: rs2071277 and rs1999930;