CN115762635A

CN115762635A - System for screening and risk prediction of liver cancer based on molecular marker and application

Info

Publication number: CN115762635A
Application number: CN202211250017.3A
Authority: CN
Inventors: 付小斌
Original assignee: First Affiliated Hospital Of Hebei North University
Current assignee: First Affiliated Hospital Of Hebei North University
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2023-03-07
Also published as: CN114283883B; CN114283883A; CN115719613A

Abstract

The application discloses a method for establishing a liver cancer tumor screening model based on molecular markers, which comprises the steps of obtaining an SNP data set associated with liver cancer; based on the data set, screening to obtain SNPs for modeling as model variables; calculating relative risk values of different genotypes of each model variable; and obtaining the tumor screening model based on the relative risk value. The application discloses a liver cancer screening, risk prediction and/or diagnosis method based on a tumor screening model constructed by the method, and a related kit, system, device, computer readable storage medium and equipment, so that early accurate screening of liver cancer of each stage is realized.

Description

System for screening and risk prediction of liver cancer based on molecular marker and application

Technical Field

The application relates to the technical field of gene detection, in particular to a system for screening and/or predicting risk of liver cancer based on molecular markers.

Background

The primary prevention of liver cancer is the etiological prevention, and the traditional etiological factors of liver cancer mainly comprise virus infection (hepatitis B virus and hepatitis C virus), exposure of aflatoxin and microcystin, smoking, alcoholism and other poor living habits; it is found that non-alcoholic fatty liver has become the main cause of liver cancer in developed countries; in addition, researches show that diabetes is an independent risk factor for liver cancer; and the prevalence rate of liver cancer of the high BMI population is 5 times higher than that of the normal population. Aiming at the traditional causes, the country takes corresponding measures that hepatitis B vaccine is inoculated to all newborns for free in 2005; after standard treatment, the chronic hepatitis B can be effectively controlled; the HCV virus can be completely eliminated after the chronic hepatitis C is treated by the antiviral treatment.

The third-level prevention of liver cancer mainly refers to improvement of clinical treatment methods and research and development of novel medicines. The current liver cancer diagnosis method commonly used in clinic is that the diagnosis method combines clinical symptoms to carry out imaging examination, including ultrasound, X-ray Computed Tomography (CT), magnetic Resonance Imaging (MRI), digital Silhouette Angiography (DSA) and nuclear medicine imaging method (PET CT, SPECT CT), and if necessary, the liver puncture biopsy is carried out. These clinical diagnostic methods, either low sensitivity or costly or invasive, are difficult to meet with the need for early screening of populations. The international health organization recommends that the B-ultrasonic examination and AFP examination are carried out twice a year on patients with liver cirrhosis so as to realize the early diagnosis of liver cancer. However, current research shows that AFP sensitivity is low, and AFP levels are not increased in about 40% of liver cancer patients, especially in early liver cancer patients, and the European liver cancer research Association (EASL) has not recommended AFP as a diagnostic index for liver cancer [ [19-211 ]. The ultrasound, as an imaging diagnosis method, has strong dependence on the technical level of doctors; in addition, the tumor volume accumulation of the patient can be detected by ultrasound only when reaching a certain degree; therefore, the sensitivity of the kit for diagnosing early liver cancer, particularly small liver cancer, is low. Therefore, an objective liver cancer tumor screening mode which is noninvasive, easy to popularize and suitable for early diagnosis of liver cancer is urgently needed to be found.

Disclosure of Invention

In view of the above, the present application aims to provide at least one improved liver cancer screening model, so as to realize a noninvasive, easily popularized and suitable early diagnosis method for liver cancer.

In a first aspect, the embodiment of the present application discloses a method for establishing a liver cancer tumor screening model based on molecular markers, comprising:

obtaining an SNP dataset associated with liver cancer;

based on the data set, screening to obtain SNPs for modeling as model variables;

calculating relative risk values of different genotypes of each model variable; and

obtaining the tumor screening model based on the relative risk value.

In the embodiment of the present application, the step of screening SNPs for modeling specifically includes: and selecting SNP intervals within 50Mb according to the linkage disequilibrium analysis result of each SNP on different chromosomes, and using r2>0.9 of continuous analysis as the SNP constructed by the model.

In the embodiment of the present application, after the step of screening SNPs for modeling, the method further includes:

obtaining the individual effect value and the phenotypic parameter of each SNP locus on the occurrence of the liver cancer according to the individual effect of each SNP in the data set; wherein the individual effect value is a statistical probability of a single SNP to have liver cancer in the data set; the phenotype parameter is the data set, and the phenotype of the single SNP in the genetic process of a single individual is the statistical frequency of the dominant genetic individual suffering from the liver cancer; and

calculating individual effect values of single individuals by using Logistic regression analysis, correcting and weighting to obtain genetic scores; and

according to the individual effect value, the phenotype parameters and the genetic score, the SNPs-based liver cancer weighted risk screening score of each individual can be calculated, and the cancer risk of each individual can be judged according to the liver cancer weighted risk screening score.

In the present embodiment, the SNPs screened for modeling include AT least one of TAGA rs15945924, FBXW rs11744825, RANBP1 rs17033807, GNA rs5741536, TY rs8896114, TGM rs239809, DUOX rs4539964, RE rs4362209, AT rs10819989, ATP7 rs5251533, TY rs4579862.

In the examples of the present application, the SNPs screened for modeling included TAGA rs15945924, FBXW rs11744825, RANBP1 rs17033807, GNA rs5741536, TY rs8896114, TGM rs239809, DUOX rs4539964, RE rs4362209, AT rs10819989, ATP7 rs5251533, TY rs4579862.

In the present embodiment, the SNPs screened for modeling include FBXW rs11744825, GNA rs5741536, TY rs8896114, TGM rs239809, DUOX rs4539964, RE rs4362209, ATP7 rs5251533, TY rs4579862.

In the examples of the present application, SNPs screened for modeling included TAGA rs15945924, FBXW rs11744825, GNA rs5741536, TY rs8896114, TGM rs239809, DUOX rs4539964, RE rs4362209, AT rs10819989, ATP7 rs5251533, MUTY rs4579862.

In a second aspect, the present application discloses a method for screening, risk prediction and/or diagnosis of liver cancer, the method comprising the step of using a screening model of liver cancer tumor, wherein the screening model of liver cancer tumor is constructed by the construction method of the first aspect.

In a third aspect, the present application discloses a kit for screening, risk prediction and/or diagnosis of liver cancer, the kit comprising a reagent for detecting genotyping in a tumor screening model constructed according to the construction method of the first aspect.

In a fourth aspect, embodiments of the present application disclose a system or device for liver cancer screening, risk prediction and/or diagnosis, the system or device comprising:

the acquisition module is used for acquiring an SNP data set associated with the liver cancer;

the screening module is used for screening SNPs used for modeling based on the data set to serve as model variables;

the calculation module is used for calculating the relative risk value of different genotypes of each model variable;

the construction module is used for constructing and obtaining the liver cancer tumor screening model based on the relative risk value; and

and the data analysis module is used for inputting the relative risk value of the SNP used for modeling of the individual to be tested into the liver cancer tumor screening model constructed according to the construction method of the first aspect so as to obtain a prediction result.

Compared with the prior art, the application has at least the following beneficial effects:

the liver cancer tumor screening model provided by the application not only provides a tumor screening model construction method capable of obtaining a more accurate prediction result, but also provides a new tumor prediction index combination, and the prediction effect superior to that of the prior art is realized. In addition, the method for screening the liver cancer tumor by using the model provided by the application does not depend on the progress degree of the tumor, has no obvious difference in the prediction effect of tumor patients in different stages, is applicable to each stage of the tumor, and can solve the problem that the early and extremely early tumors are difficult to screen.

Drawings

FIG. 1 is a graph of the r2 distribution of linkage disequilibrium analysis of 30 SNPs on chromosome 6 provided in the examples of the present application.

FIG. 2 is a r2 distribution diagram of linkage disequilibrium analysis of 30 SNPs on chromosome 12 provided in the examples of the present application.

FIG. 3 is a graph of the r2 distribution of linkage disequilibrium analysis of 30 SNPs on chromosome 5 provided in the examples of the present application.

FIG. 4 is a graph of the r2 distribution of linkage disequilibrium analysis of 30 SNPs on chromosome 10 provided in the examples of the present application.

FIG. 5 is a graph of the r2 distribution of linkage disequilibrium analysis of 30 SNPs on chromosome 19 provided in the examples of the present application.

FIG. 6 is a graph of the r2 distribution of linkage disequilibrium analysis of 30 SNPs on chromosome 20 provided in the examples of the present application.

FIG. 7 is a graph of the r2 distribution of linkage disequilibrium analysis of 30 SNPs on chromosome 11 provided in the examples of the present application.

FIG. 8 is a r2 distribution diagram of linkage disequilibrium analysis of 30 SNPs on chromosome 1 provided in the examples of the present application.

FIG. 9 is a graph of the r2 distribution of linkage disequilibrium analysis of 30 SNPs on chromosome 7 provided in the examples of the present application.

FIG. 10 is a graph of r2 distribution from linkage disequilibrium analysis of 30 SNPs on chromosome 15 provided in the examples of the present application.

FIG. 11 is a r2 distribution diagram of linkage disequilibrium analysis of 30 SNPs on chromosome X provided in the examples of the present application.

FIG. 12 is a graph of the r2 distribution of linkage disequilibrium analysis of 30 SNPs on chromosome 9 provided in the examples of the present application.

FIG. 13 is a graph of r2 distribution from linkage disequilibrium analysis of 30 SNPs on chromosome 17 provided in the examples of the present application.

FIG. 14 is a graph of r2 distribution from linkage disequilibrium analysis of 30 SNPs on chromosome 13 provided in the examples of the present application.

FIG. 15 is a graph of the r2 distribution of linkage disequilibrium analysis of 30 SNPs on chromosome 12 as provided in the examples of the present application.

FIG. 16 is a graph of the r2 distribution of linkage disequilibrium analysis of 30 SNPs on chromosome 18 as provided in the examples of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. Reagents not individually described in detail in this application are conventional and commercially available; methods not specifically described in detail are all routine experimental methods and are known from the prior art.

Liver cancer gene data acquisition

1. Data retrieval

(1) Database for search

Literature searches were performed using NCBI, genBank, pubMed, EMBASE, cochrane Library, web of Science, CNKI (Chinese), wanfang (Chinese), VIP (Chinese), and CBM (Chinese) databases. To further avoid the omission of some potential risk factors, the present application also tests some relevant bibliography, such as Novel markers in viral live disease and platelet cancer, authors: rode, anthony Philip, publication date: 2013.

(2) Retrieval strategy

The combination of the searched Chinese key words is as follows: "risk factors", "cirrhosis", "liver cancer" or "liver", "cancer" or "tumor"; "polymorphism", "single nucleotide polymorphism" and "genetic variation"; "fusion gene", "clinical index" and "gene molecular marker".

The English keywords for retrieval are combined as follows: "risk fault" and "lift cancer", "hazard" and "lift cancer", "pharmaceutical disease", "pharmaceutical", "tum of cancer", "Neoplasms", "carcinoma", "tum", "polymph", "single nucleotide polymph", "SNP", "fused gene", "Clinical index", "Clinical parameters", "variant", "variation", "China", "Chinese", "assay".

(3) Data inclusion criteria

Analyzing the relationship between gene factors and susceptibility of liver cancer according to an Odds ratio (l) by taking human as a research object;

exclusion criteria: meta analysis or review, data studies published in summary form only, sample size of control or case groups less than 10 and Minor Allele Frequency (MAF) of control groups less than 1%.

(4) Document prescreening

And importing the document entries acquired by the online public database into NoteExpress, storing the bibliography, and removing repeated bibliographies among the databases by using a duplicate checking function module. And according to the document title, abstract and key words, and combining the full text, eliminating the documents which do not meet the inclusion standard.

2. Document data extraction

Data extraction is carried out on documents meeting the inclusion standard, and the extracted main information comprises: sample size, study area, sample origin, liver cancer morphotype (block, node and diffuse), and liver cancer histology typing (hepatocyte, cholangiocyte and mixed), study design, genetic factors (genotype, allele, molecular marker, etc.).

3. Document data quality assessment

Quality assessment of incorporated literature data was performed using the Venice Standard (Ioamudis JP, boffetta P, little J, et al. Assessment of social evaluation on genetic associations: internal guidelines [ J ]. Int J. Epidemic, 2008,37 (1): 120-132.). The confidence level of the resulting literature data is determined by the number of documents, heterogeneity and bias three parameters (a = strong, B = neutral and C = weak), and the ranking can be classified as (1) aaa-strong evidence. (2) AAB, ABA, ABB, BAA, BBA, BBB, BAB-intermediate evidence. (3) The remaining classifications will be considered evidence of low confidence.

In addition to the Venice criteria, significant association results are evaluated by calculating the False Positive Reporting Probability (FPRP) (assembling the probability a positive reporting is False: an adaptive fl molecular identity students [ J ]. J Natl Cancer Inst.2004.96 (61: 434-442.) to avoid unreal association probabilities that perform well and that the sample size is large, but that the calculated False positive probability is still high.

4. Statistical analysis

The analysis of association between genetic factors, non-genetic factors and risk of liver cancer was evaluated by analysis of combination using Review Manager 5.3.5 (Cochrane gelation, oxfld, UK). Genetic factors or non-genetic factors are considered as different variables, and if three or more available independent data sets are available for each variable, then the combined analysis is performed. For a variable of a genetic factor, an allele model can be calculated using a genetic model for genome-wide association analysis (GWAS) to discover and validate available independent datasets of the capability of the variable.

Finally, observations and expectations were analyzed and normal distribution analyzed by SPSS 21.0 software via Q-Q (quartz-quartz) plots for inclusion in each available independent data set. A general trend distribution analysis was performed using Visual Studio 2013 to observe the joint l-value and cumulative frequency distribution of possible combinations of all variables.

The Relative Risk (RR) of the exposed part of the risk factor in the population is divided into RR of the exposed and unexposed parts of the population, and the RR of a given exposure effect is obtained. If the incidence is very low (rare disease, generally less than 1/10000), RR is approximately equal to l (RR ≈ l), and the RR estimate is replaced with l for Meta analysis.

The pathogenic effect of each risk factor flow is evaluated by using an Attributable Risk (ARP) and a Population Attributable Risk (PARP) as indexes.

ARP＝|l-1/l|×100％；PARP＝P _e |(l-1)|/P _e (l-1)+1]|×100％；

In the formula, P _e Is the control group or population risk allele (factor) frequency.

Calculating the average risk of a single SNP in the population, i.e., genetic score (Genetic scale), from the genotype frequency of Genetic variation in the haplotype map of the human genome (HapMap) and the integrated l, genetic scale = (1-P) ² l ² +2P(1－P)l+P ² (ii) a P is the risk allele frequency.

The Q-Q graph is used to determine whether two data sets are from a population having a common distribution. All P values were two-sided, with P <0.05 being statistically significant. Regression analysis, sensitivity analysis and publication bias analysis were performed using STATA 13.1 (StataClp College Station, TX, USA).

2. Results

Table 1 shows the statistics of genetic factors and liver cancer risk and evidence rank analysis. Tables 2 and 3 show the statistical results of the correlation analysis of the SNPs of each gene and the risk of liver cancer. As can be seen from Table 2, of the 30 SNPs associated with liver cancer onset, 3 SNPs (LT. RTM. Rs5246916, ZNF 35. RTM. Rs5246916, ARFGAP. RTM. Rs 4718842) were rated as high quality, 27 SNPs (TAGA. RTM. Rs15945924, FBXW. RTM. Rs11744825, HAPLN. RTM. Rs8294854, RHOBTB. RTM. Rs6267063, NLRP. RTM. Rs5545282, MSH. RTM. Rs7995235, RANBP 1. RTM. RS17033807, GNA. RTM. 17033807, TY. RTM. Rs58 zxft 58, CSMD 6258, CASS RS17033807, NRD RS17033807, TFR RS17033807, TGM RS17033807, DUOX RS17033807, A RS17033807, AGPAT RS17033807, TSC RS17033807, BRCA RS17033807, RE RS17033807, AT RS17033807, BRCA RS17033807, CDK RS17033807, PINB SERXFT 6258 zxft 58, ATP7 zxft 6258, POL RS17033807, CDK RS17033807, and the like. The 30 SNPs are named sequentially in Table 2, see column 1 brackets.

TABLE 1 analysis of SNPs and risk and evidence of liver cancer

TABLE 2 analysis of association of SNPs genes with liver cancer Risk (1)

TABLE 3 correlation analysis of SNPs genes with liver cancer Risk (2)

Establishment and evaluation of liver cancer tumor risk screening model

1. Method for producing a composite material

1. Data selection for correlation analysis

After quality control, the remaining 235 individuals (each dataset in table 1 as one individual) and 30 valid SNPs were used for linkage disequilibrium analysis in subsequent studies to obtain association data.

LD measurement: the degree of linkage disequilibrium is usually determined by D' and r ² Measure, this study chooses r ² As a measure of LD. r is ² A relationship indicating the degree of statistical and genetic correlation between two loci (0)<r ² <1)，r ² Is insensitive to the change of gene frequency and shows stable performance. r is ² The calculation formula of (2) is as follows: r is ² ＝(PA1B1－P _A1 ×P _B1 ) ² /P _A1 ×(1－P _A1 )×P _B1 ×(1－P _B1 )；

In the formula, PA1 and PB1 are the frequency of the 1 st allele at the two marker loci, and PA1B1 is the frequency of the haplotype formed between the alleles.

Calculating r between SNPs by adopting H aploview software ² And (4) performing subsequent statistical analysis by using R software.

2. Model building process

(1) According to the SNP spacing and r on

chromosome

6, 12, 5, 10, 19, 20, 11, 1, 7, 15, X, 9, 17, 13, 15, 12 and 18 ² Distribution map, selecting the ones with SNP spacing within 50Mb and continuously analyzing ² >0.9, based on SNP selection as a model construction.

(2) Obtaining an individual effect value (l) and a phenotype parameter (f) of each SNP locus on liver cancer occurrence according to the obtained individual effect of each SNP in the data set, wherein the individual effect value is the statistical probability of the single SNP for suffering from liver cancer in the data set; a phenotypic parameter, which is the statistical frequency of the individual's phenotype of the single SNP in the individual's genetic process to develop liver cancer;

(3) And calculating the individual effect value of a single individual by using Logistic regression analysis, correcting and weighting to obtain the genetic score (W).

(4) Taking a single SNP genotype of an individual as a variable, wherein the genotype comprises an allelic type, a heterozygote type, a homozygote type, a phenotype type and a stealth type as five variables respectively, the genotype of a certain SNP is AA dominant, AA recessive, AB dominant, AB stealth and BB, A is a risk allele, B is a non-risk allele, and the corresponding risk values are l respectively ² ×f ² 、l ² ×(1－f) ² 1X (1-l) X (1-f), 1X (1-l) and (1-l) ² ；

(5) Calculate the relative risk value for each SNP as [ l ² ×f ² +l ² ×(1－f) ² /+l×(1－l)×(1－ f)+l×(1－l)+(1－l) ² ]/W；

(6) The SNPs-based liver cancer weighted risk screening score M of each individual is: SNPn, = SNP1 × SNP2 × SNP3.. Once; SNPn is a relative risk value of each of the n SNPs screened.

Through the steps, the liver cancer weighted risk screening scores of the n SNPs of each individual can be obtained, and the cancer risk of each individual can be judged according to the liver cancer weighted risk screening scores.

3. Logistic regression analysis

The Logistic regression model is suitable for data with dependent variables as classification variables. The model is linear fit by using the logarithm of the ratio of the occurrence probability and the non-occurrence probability of an event as a dependent variable, and the regression coefficient is estimated by a maximum likelihood method.

In the research, logistic regression fitting is carried out in a training set through a glm function in R.3.6.2, a bestglm function in a bestglm package is subjected to variable screening of an optimal model through a ten-fold cross validation method and a minimum BIC (Bayesian information Criterion) Criterion method, and the two models are compared through an anova function. And then, carrying out diagnosis prediction on the verification set sample by using a predict function, forming a four-grid table matrix by using the fusion matrix and the mis ClassErrl functions in the InflimationValue package, and calculating the error rate or the coincidence rate. And finally, drawing a nomogram of the model by utilizing a regplot function in a regplot packet, and drawing an ROC curve by utilizing a pROC packet. The comparison between the two ROC curves uses the following method:

(1) Group comparison: two ROC curves were obtained from different observers, the two samples used were completely independent, and the test formula was:

(2) Pairing and comparing: the two diagnostic methods use the same sample, each observed object is simultaneously detected in two modes, and then the diagnostic effects of the observed objects are compared. The test formula is as follows:

in the formula, al and A2 are areas under two sample ROC curves respectively, SE2 (A1) and SE2 (A2) are standard errors of the areas under the two sample ROC curves, and Cov (A1 and A2) is covariance estimated by the areas of the two samples and can be calculated by a nonparametric method given by Delong. Under the condition of a large sample, Z approximately follows standard normal distribution, under the inspection level alpha, Z > Za/2, two diagnosis methods can be considered to be different, and the two methods can be realized by MedCalc software.

4. Authentication object

Cases recruited from the south, the middle and the north of China are divided into model crowds and verification crowds 1-3, the model crowds are used for dividing the M scores obtained by the established model into risk areas, and the verification crowds are used for verifying the accuracy of the model. The inclusion criteria of all cases of liver cancer patients were diagnosed with primary liver cancer, received no radiotherapy or chemotherapy, and were pathologically confirmed. Case-control population individuals do not have relationships. All patients had signed informed consent and were approved for implementation by the ethics committee.

TABLE 4

5. DNA extraction and genotyping

After 5-10. Mu.g of DNA was extracted from the serum of each case or healthy group case, the DNA was fragmented into fragments of 100-400 bp, and then a Library was constructed using a small DNA fragment Library of 2X 180bp, which was constructed using SeqCap EZ Human Exome Library v3.0 (Roche). The PCR-RFLP method is used for genotyping, and primers of 30 corresponding SNPs are provided by Shanghai Ministry of Engineers.

2. As a result, the

Based on the genes and SNPs obtained by the searches in tables 1 to 3, the genes and SNPs are expressed for the genome r on different chromosomes ² The distribution of (a) was plotted as shown in the figure. For SNPs on different chromosomes, the spacing is within 50Mb and r is continuously analyzed ² >The SNPs of 0.9 were counted, and the results are shown in Table 5.

TABLE 5 analysis of linkage disequilibrium of SNPs

Different weighted genetic risk scoring models can be constructed based on the SNPs selected in table 5, as shown in table 6.

TABLE 6 weighted genetic Risk Scoring model

TABLE 7M quantile distribution of model population

Calculating the M scores of the model population according to the models generated in the examples 1 to 3 and the comparative examples 1 to 2, and counting the arrangement of the M scores, wherein the results are shown in Table 7, and the range of 0.35 to 0.45 does not include 0.35 but includes 0.45 in the Table 7; "/" denotes no case; "+" indicates the number of real cancer cases of the corresponding model population in Table 4, and "-" indicates the number of real health cases of the corresponding model population in Table 4. In Table 7, the percentage of cases with M values exceeding 0.35 to the total number of cases of the model population was used as the prediction positive rate in example 1, and the percentage of cases with M values exceeding 0.3 to the total number of cases of the model population was used as the prediction positive rate in examples 2 to 3 and comparative examples 1 to 2. In Table 7, the positive match rate is equal to the percentage of the predicted positive rate of each example or comparative example to the liver cancer true positive rate (59.39%) of the model population.

As can be seen from the results in Table 7, the positive coincidence rates of comparative examples 1 and 2 all exceed 90%, but the true healthy cases of comparative examples 1 and 2 all appear when the M value is 0.5 to 0.6, which indicates that the accuracy rates of comparative examples 1 and 2 are not as good as those of examples 1 and 3. For the models constructed based on SNPs in the examples of the present application, when the calculated M value is greater than 0.3 or 0.35, it indicates that the individual is at risk of developing cancer, and not vice versa.

The models obtained in examples 1 to 3 and comparative examples 1 to 2 were further verified by using the verified population 1, and the results are shown in table 8 (the predicted positive rate is calculated as an M value greater than 0.3). For the verification population 1, the positive coincidence rates of examples 1 to 3 and comparative examples 1 to 2 were high. However, it was found in the M-quantile distributions provided in comparative examples 1 to 2 that truly positive liver cancer cases occurred in cases where the M value was 0.3 or less, and a large number of healthy control cases occurred in the range of M value from 0.3 to 0.35, and it was found that the M-quantile models provided in comparative examples 1 to 2 had errors in risk evaluations of liver cancer positive cases and healthy negative cases. Whereas the M-quantile model provided by examples 1-3 has a smaller error.

Table 8 verifies M quantile distribution for population 1

The models obtained in examples 1 to 3 and comparative examples 1 to 2 were further verified by using the verified population 2, and the results are shown in table 9 (the predicted positive rate is calculated as an M value greater than 0.3). For the validation population 2, the positive match rates of examples 1 to 3 were all high. Moreover, the M-quantile distribution provided by the comparative examples 1-2 shows that the true positive liver cancer cases appear in the cases with the M value less than or equal to 0.3, and a large number of healthy control cases appear in the range with the M value of 0.3-0.35, so that the M-quantile model provided by the comparative examples 1-2 has certain errors in the risk evaluation of the liver cancer positive cases and the healthy Kang Yinxing cases. While examples 1-3 provide a smaller error for the M-index model.

Table 9 verifies M quantile distribution for population 2

The positive coincidence rates of liver cancer in the verified

populations

1 and 2 are respectively 84.36% and 81.24 by analyzing the Logistic regression model provided by the comparative example 3, which is not as high as the positive coincidence rate of the evaluation model provided by the embodiment of the application.

Therefore, the optimal model for liver cancer risk assessment and screening based on SNPs provided by the embodiment of the application is demonstrated, and the screening accuracy is higher.

The above description is only a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application.

Claims

1. A system for liver cancer screening and risk prediction, the system comprising:

the data analysis module is used for inputting the relative risk value of the SNP used for modeling of the individual to be tested into the liver cancer tumor screening model to obtain a prediction result;

the construction method of the liver cancer tumor screening model comprises the following steps:

obtaining an SNP dataset associated with liver cancer;

obtaining the tumor screening model based on the relative risk value;

the step of screening to obtain SNPs for modeling specifically comprises the following steps:

according to the linkage disequilibrium analysis result of each SNP on different chromosomes, selecting the SNP intervals within 50Mb and continuously analyzing r ² >0.9, to model constructed SNPs; r is ² The calculation formula of (2) is as follows: r is ² = (PA 1B1-PA1 × PB 1) 2/PA1 × (1-PA 1) × PB1 × (1-PB 1); wherein PA1 and PB1 are the frequency of the 1 st allele at two SNP marker loci, and PA1B1 is the haplotype frequency formed between alleles;

wherein the relative risk value is[L ² ×f ² +L ² ×(1－f) ² /+L×(1－L)×(1－f)+L×(1－L)+(1－L) ² ]L represents the individual effect value of liver cancer occurrence, f is a phenotype parameter, W is a genetic score, and the individual effect value of a single individual is calculated by using logistic regression analysis and is obtained after correction and weighting;

the model formula of the tumor screening model is as follows:

M＝SNP2×SNP8×SNP9×SNP14×SNP15×SNP20×SNP26×SNP27；

SNP1 represents the relative risk value for SNP TAGA rs 15945924;

SNP2 represents the relative risk value for SNP FBXW rs 11744825;

SNP7 represents the relative risk value for SNP RANBP1 × rs 17033807;

SNP8 represents the relative risk value for SNP GNA rs 5741536;

SNP9 represents the relative risk value for SNP CSMD rs 3411226;

SNP14 represents the relative risk value for SNP TGM rs 239809;

SNP15 represents the relative risk value for SNP DUOX rs 4539964;

SNP20 represents the relative risk value for SNP RE rs 4362209;

SNP21 represents the relative risk value for SNP AT rs 10819989;

SNP26 represents the relative risk value for SNP ATP7 × rs 5251533;

SNP27 represents the relative risk value for SNP MUTY rs4579862.

2. The system of claim 1, wherein the step of screening for SNPs for modeling further comprises:

and calculating to obtain the SNPs-based liver cancer weighted risk screening score of each individual according to the individual effect value, the phenotypic parameter and the genetic score, and judging the cancer risk of each individual according to the liver cancer weighted risk screening score.

3. The system of claim 1, wherein the SNPs screened for modeling include AT least one of tagasrs 15945924, FBXW rs11744825, RANBP1 rs17033807, GNA rs5741536, TY rs8896114, TGM rs239809, DUOX rs4539964, RE rs4362209, AT rs10819989, ATP7 rs5251533, MUTY rs4579862.