CN117672521A - Construction method of model for evaluating possibility of suffering from hepatocellular carcinoma of subject - Google Patents

Construction method of model for evaluating possibility of suffering from hepatocellular carcinoma of subject Download PDF

Info

Publication number
CN117672521A
CN117672521A CN202311678114.7A CN202311678114A CN117672521A CN 117672521 A CN117672521 A CN 117672521A CN 202311678114 A CN202311678114 A CN 202311678114A CN 117672521 A CN117672521 A CN 117672521A
Authority
CN
China
Prior art keywords
mir
hsa
coefficient
log10
value range
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311678114.7A
Other languages
Chinese (zh)
Inventor
周俭
樊嘉
胡捷
孙云帆
杨欣荣
孙惠川
邱双健
陈丽萌
彭海翔
温冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jusbio Sciences Shanghai Co ltd
Zhongshan Hospital Fudan University
Original Assignee
Jusbio Sciences Shanghai Co ltd
Zhongshan Hospital Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jusbio Sciences Shanghai Co ltd, Zhongshan Hospital Fudan University filed Critical Jusbio Sciences Shanghai Co ltd
Priority to CN202311678114.7A priority Critical patent/CN117672521A/en
Publication of CN117672521A publication Critical patent/CN117672521A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a method, a system and a kit for constructing a model for evaluating the possibility of a subject suffering from hepatocellular carcinoma, and an electronic device and a computer readable medium storing computer program code for applying the construction method. The construction method comprises the steps of obtaining a plurality of independent variables, screening and data processing the plurality of independent variables, further screening effective independent variables in the regression model training process, reducing the number of the independent variables, obtaining the relation among the independent variables through correlation analysis, and replacing and combining the independent variables with replacement relation to obtain a plurality of prediction models. The prediction model can obtain a good prediction result by using fewer effective independent variables, and has the advantages of high prediction efficiency, wide application range and low cost.

Description

Construction method of model for evaluating possibility of suffering from hepatocellular carcinoma of subject
Technical Field
The present application relates generally to the fields of biomedical science, biological research, bioinformatics research technology, hepatocellular carcinoma prediction, etc., and in particular, to a method, a system, a kit, an electronic device and a computer readable medium storing computer program code for evaluating a model of a subject for the possibility of suffering from hepatocellular carcinoma.
Background
Liver cancer is a malignant tumor, and in global cancer death cases in 2020, the number of liver cancer death cases is high at the 3 rd place. Primary liver cancer mainly comprises three different pathological types of hepatocellular carcinoma (Hepatocellular carcinoma, HCC), intrahepatic bile duct carcinoma (Intrahepatic cholangiocarcinoma, ICC) and mixed hepatocellular carcinoma-cholangiocarcinoma (Combined hepatocellular-cholanoarcinoma, cHCC-CCA), and the three are greatly different in pathogenesis, biological behaviors, histopathology, treatment method, prognosis and the like, wherein HCC accounts for 75% -85% and ICC accounts for 10% -15%. HCC is a type of cancer with a poor prognosis. The prognosis of such patients with disease depends on the disease stage. HCC patients were not surgically treated with a 5 year survival rate of <5% and a post-operative survival rate of 60% -70%. Tumor size <2cm and 5 year survival rate of patients with surgical resection can reach 86%. However, the 3-year survival rate of early stage cancer patients (tumor size <5 cm) without any treatment is only 17-21%. This demonstrates that early cancer detection is important for both treatment and patient survival.
Serum Alpha Fetoprotein (AFP), alpha fetoprotein heteroplastid (AFP-L3%), abnormal prothrombin (DCP, also known as PIVKAII) and miRNA7 are all serological markers commonly used for diagnosing HCC at the present stage, but single serological markers have insufficient diagnostic sensitivity and specificity for early liver cancer.
The study is based on retrospective study of phenotype and clinical information of 20000 liver disease patients who were treated in a third-level hospital for 4 years, and statistical analysis is performed on the sensitivity and specificity of a common liver cancer biomarker in liver cell liver cancer and healthy people, and found that the sensitivity of serum Alpha Fetoprotein (AFP) is 48.68%, the specificity is 95.97%, the sensitivity of alpha fetoprotein liposome (AFP-L3) is 45.78%, the specificity is 89.54%, the sensitivity of abnormal prothrombin (DCP) is 69.98%, the specificity is 85.73%, the sensitivity of miRNA7 is 68.44%, and the specificity is 68.86%; in the HCC small liver cancer group with single tumor smaller than 2cm, the detection sensitivity of AFP is only 39.93%, the detection sensitivity of AFP-L3 is only 34.71%, the detection sensitivity of DCP is also only 37.74%, and the detection sensitivity of miRNA7 can reach 69.42%. The miRNA7 has remarkable advantages of detection sensitivity in HCC population, especially small liver cancer population, but has insufficient detection specificity in healthy population, so that the application range of the miRNA7 is smaller.
miRNA7 is a liver cancer diagnosis marker combined by plasma microRNA at present and consists of nucleic acid molecules for encoding hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801 and hsa-miR-1228. And judging the negative and positive of the HCC by taking hsa-miR-1228 as an internal reference and taking a Logit result calculated according to a model formula as a threshold of-0.5 according to the quantitative results of the 8 microRNAs in the blood plasma. Since 8 markers are detected by miRNA7, the reagent cost and the detection cost are high, so that the miRNA7 cannot be widely applied due to the fact that the price is high in practical application, the miRNA7 is simplified and optimized on the basis of the 8 markers, so that the detection cost is reduced, more patients and larger crowds benefit, and the application value of the miRNA can be better exerted.
Therefore, it is desirable to establish a detection method which is high in accuracy, quick and low in cost in various crowds at the same time, so as to facilitate the implementation of conventional screening and monitoring of high-risk crowds of liver cancer, and practically facilitate early discovery, early diagnosis and early treatment of liver cancer, thereby improving the cure rate and survival rate of liver cancer crowds.
Disclosure of Invention
To solve the above technical problem, in one aspect, the present application provides a method for constructing a model for evaluating a possibility of a subject suffering from hepatocellular carcinoma, including: step S10: obtaining a plurality of arguments for a model, the plurality of arguments comprising: the expression quantity of microRNA in blood plasma is any one of hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a and hsa-miR-801, and the Age, sex Gender and tumor marker detection results of a subject, and blood index detection results, wherein the tumor marker detection results comprise alpha fetoprotein AFP, the blood index detection results comprise any one of platelet count PLT, total bilirubin TB, serum glutamic oxaloacetic transaminase GGT, prothrombin time PT, international standardization ratio INR, aspartic acid amino transferase AST and alanine amino transferase ALT, and the independent variables are subjected to structural treatment; step S20: dividing the subject into a liver cell liver cancer group and a non-liver cell liver cancer group, and respectively encoding to 1 and 0 to form dependent variables for a model; step S30: screening the plurality of independent variables to obtain effective independent variables; step S40: performing correlation analysis on the plurality of independent variables to determine the relationship among the plurality of independent variables; step S50: comparing the correlation coefficient between each independent variable before and after performing Log10 conversion and the dependent variable, and determining whether to perform Log10 conversion on the independent variable; step S60: dividing data into a training set and a testing set, constructing a regression model containing the independent variables according to the training set, and testing the performance of the regression model according to the testing set, wherein the data comprises the plurality of independent variables and the dependent variables; step S70: further screening the effective independent variables according to the p-value of each independent variable in the regression analysis result, the coefficient value and any one of the AUC values of the regression model on the training set and the test set to obtain screened effective independent variables, wherein when the p-value of the independent variable is smaller than 0.05, the independent variable with the coefficient value larger than a first threshold value is taken as the effective independent variable; step S80: repeatedly executing the step S70 until the AUC value of the regression model on the training set is reduced to be near 0.8, wherein the number of the screened effective independent variables is more than or equal to 2, and the screened effective independent variables at least comprise one microRNA, so as to obtain a plurality of first models; step S90: and replacing and combining the independent variables with the replacement relation in the plurality of first models according to the relation among the plurality of independent variables obtained in the step S40, and constructing a regression model by adopting the replaced and combined independent variables to obtain a plurality of second models, wherein the models comprise any one of the first models and the second models.
In a second aspect, the present application provides a system for assessing the likelihood of a subject suffering from hepatocellular carcinoma, comprising: the data acquisition module is used for acquiring sample data of a subject, wherein the sample data comprise any one of the expression quantity hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a and hsa-miR-801 of microRNA in blood plasma, and the Age, sex Gender, tumor marker detection results and hematology index detection results of the subject, the tumor marker detection results comprise alpha fetoprotein AFP, and the hematology index detection results comprise any one of platelet count PLT, total bilirubin TB, serum glutamic oxaloacetic transaminase GGT, prothrombin time PT, international standardization ratio INR, aspartic acid amino transferase AST and alanine amino transferase ALT; the data preprocessing module is used for preprocessing the sample data and comprises the steps of removing repeated samples, unifying units and dividing the sample data into different analysis groups according to index types and numbers; the model recommendation module is used for recommending a plurality of prediction models with larger AUC values according to the AUC values from 41 prediction models established by adopting the construction method according to the types and the numbers of indexes contained in different analysis groups; the model selection module is used for providing a selection function for a user to output one or more prediction models for risk assessment and calculation from the prediction models recommended by the model recommendation module; the risk assessment module is used for calculating corresponding predicted values according to one or more predicted models output by the model selection module, giving out a proper Logit threshold or risk score threshold according to department or crowd information corresponding to the sample data, and dividing the risk degree into any one of high possibility, medium possibility and low possibility according to the Logit threshold or the risk score threshold.
In a third aspect, the present application provides a kit for assessing the likelihood of a subject suffering from hepatocellular carcinoma, comprising a predictive model constructed using the construction method described above.
The present application proposes in a fourth aspect an electronic device comprising: a memory for storing instructions executable by the processor; and a processor for executing the instructions to implement the construction method as described above.
The present application proposes in a fifth aspect a computer readable medium storing computer program code which, when executed by a processor, implements a construction method as described above.
The liver cancer prediction model established according to the construction method of the application is a logistic regression prediction model with 41 AUC values between 0.8 and 0.9 on a test set based on data of 20,191 samples of complex populations including benign lesion population, physical examination population, hepatitis B population, liver cirrhosis population, metastatic liver cancer population, liver benign tumor population, cholangiocarcinoma population, hepatocellular carcinoma population and the like. The applicable population of the GALAD model on the market in Roche (Roche) is chronic liver disease patients, and the applicable population of the aMAP Score is also chronic liver disease patients. The model constructed by the method has wider application range, can adapt to crowd characteristics in a larger range, and is applicable to healthy crowd such as physical examination crowd. According to the construction method, the independent variables are screened, and in the training process of the regression model, the independent variables are screened step by step, so that the model can obtain a good prediction result by using fewer effective independent variables, the prediction efficiency is improved, meanwhile, the cost is saved, and the formed model, kit, system and electronic equipment can be applied to wider scenes and crowds and have extremely high social value.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the accompanying drawings:
FIG. 1 is an exemplary flow chart of a method of constructing a model for assessing a subject's likelihood of suffering from hepatocellular carcinoma in accordance with an embodiment of the present application;
FIG. 2 is a matrix diagram of correlation coefficients between various indices involved in an embodiment of the present application;
FIG. 3 is a ROC curve and corresponding AUC results for predictive model 1 over a training set in accordance with an embodiment of the present application;
FIG. 4 is a cross-validated ROC curve and corresponding AUC results for predictive model 1 of an embodiment of the application on a training set;
FIG. 5 is a ROC curve and corresponding AUC results for predictive model 1 of an embodiment of the application over a test set;
FIG. 6 is a block diagram of a system for assessing a subject's likelihood of suffering from hepatocellular carcinoma in accordance with one embodiment of the present application;
fig. 7 is a system block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is obvious to those skilled in the art that the present application may be applied to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.
In addition, the terms "first", "second", etc. are used to define the components, and are merely for convenience of distinguishing the corresponding components, and unless otherwise stated, the terms have no special meaning, and thus should not be construed as limiting the scope of the present application. Furthermore, although terms used in the present application are selected from publicly known and commonly used terms, some terms mentioned in the specification of the present application may be selected by the applicant at his or her discretion, the detailed meanings of which are described in relevant parts of the description herein. Furthermore, it is required that the present application be understood, not simply by the actual terms used but by the meaning of each term lying within.
Flowcharts are used in this application to describe the operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously. At the same time, other operations are added to or removed from these processes.
The term "cancer" as used herein refers to the presence of cells that have characteristics typical of oncogenic cells, such as uncontrolled proliferation, immortality, metastatic potential, rapid growth and proliferation rate, as well as certain characteristic morphological features known in the art. In one embodiment, the "cancer" may be liver cancer or liver cancer. In one embodiment, "cancer" may include premalignant cancer and malignant cancer.
In one embodiment, the method as described herein does not involve steps performed by a physician/physician, as will be appreciated by those skilled in the art. Thus, the results obtained by the methods as described herein require the combination of clinical data and other clinical manifestations before the final diagnosis by the physician can be provided to the subject. The final diagnosis as to whether a subject has liver cancer is a physician's scope and is not considered to be part of the present disclosure. Thus, the terms "determining," "detecting," and "diagnosing" as used herein refer to identifying a subject as having a probability or likelihood of having a disease at any stage of development (e.g., liver cancer) or determining a subject's susceptibility to developing the disease. In one embodiment, "diagnosis," "determination," "detection" is performed prior to manifestation of symptoms. In one embodiment, "diagnosing," "determining," "detecting" allows a clinician/physician (in combination with other clinical manifestations) to confirm liver cancer in a subject suspected of having liver cancer.
As used herein, the term "sample" means a sample collected from a subject for detection of the type and amount of liver cancer markers therein. The subject sample may or may not be from the circulatory system, i.e., from blood. The subject sample may be any sample comprising a suitable marker for detecting liver cancer, sources of which include whole blood, bone marrow, pleural fluid, peritoneal fluid, central spinal fluid, milk, urine, tears, sweat, saliva, organ secretions, irrigation solutions of bronchi, nasal cavities, throats, and the like.
In one embodiment, the subject sample is blood, including, for example, whole blood or any portion or component thereof. Blood samples suitable for use in the present invention may be extracted from any known source including blood cells or components thereof, such as veins, arteries, peripheral, tissue, spinal cord and the like. For example, the obtained sample may be obtained and processed using well known and conventional clinical methods (e.g., procedures for drawing and processing whole blood). In one embodiment, the subject sample is serum. Methods for obtaining serum from blood are well known to those skilled in the art.
Some terms used in the embodiments of the present invention are enumerated below. Within the scope of the present description and claims, the relevant terms are defined as follows. Other terms not listed are defined as commonly used in the art, the meaning of which is well known to those skilled in the art.
Ct value: an important parameter in the real-time fluorescent quantitative PCR detection is the number of cycles required for the PCR reaction to reach the set threshold. Ct refers to the number of PCR cycles for which the fluorescent signal reaches a set threshold. The main factors influencing the Ct value are 1. The higher the initial copy number of the target DNA, the smaller the Ct value. The higher the reaction efficiency, the smaller the Ct value. 3. Threshold value of fluorescence signal setting the lower the threshold value setting the greater the Ct value. The main application of Ct value is 1. Quantitative determination of the relative amount of target DNA. 2. The quality of the PCR reaction was evaluated. 3. Quantitative analysis of the relative changes in gene expression levels. The smaller the Ct value, the higher the number of starting templates for the target DNA. By analyzing the Ct value, experiments such as gene expression quantification, microorganism quantification, residual DNA detection and the like can be performed.
miRNA7 model formula (trade name is miRNA7 TM ): the genes 1-7 are respectively corresponding to the 7 genes of hsa-miR-21, hsa-miR-26a, hsa-miR-27a, hsa-miR-122, hsa-miR-192, hsa-miR-223 and hsa-miR-801, the gene 8 is corresponding to the gene of hsa-miR-1288, the Ct value of the gene 8 is respectively subtracted from the Ct value of the gene 1-7 to obtain corresponding dCT, and then the corresponding dCT is substituted into the following formula to calculate a comprehensive evaluation value (Logit):
Logit(P)=-1.9449+0.10633×dCt gene 1 +0.10219×dCt Gene 2 –0.012441×dCt Gene 3
0.28902×dCt Gene 4 –0.32779×dCt Gene 5 +0.25855×dCt Gene 6 –0.029515×dCt Gene 7
AFP: nail fetoprotein (Alpha-fetoprotein) is common in the medical field. In the field of cancer, AFP (alpha fetoprotein) is mainly used as a tumor marker. Elevated levels of AFP are common in primary liver cancer. AFP >400ng/ml can be used as one of diagnostic criteria for liver cancer. In addition to liver cancer, some gastric, lung and germ cell tumors can also cause elevated AFP. Therefore, AFP elevation cannot be used as a specific liver cancer marker, and comprehensive imaging and other index judgment are needed. A small portion of liver cancer AFP is normal, and additional tumor markers are needed to aid diagnosis. In addition to use in tumor screening and diagnosis, monitoring AFP can also determine tumor recurrence and prognosis. AFP is an important liver cancer marker and is used for screening, diagnosing and treating and monitoring liver cancer. However, AFP is not a specific indicator of liver cancer, and as a result, the interpretation should be carefully combined with clinical situations.
PLT: the english abbreviation of Platelet Count (Platelet Count) is an important item of routine blood cell analysis and can provide important information for screening and diagnosis of many diseases.
TB: the whole term in blood routine is Total Bilirubin (Total bilirubicin). It reflects the level of bilirubin in the blood and is an important hematology indicator for assessing liver and erythrocyte life. Bilirubin is mainly metabolized and excreted in the liver, and elevated serum TB is common in liver diseases such as hepatitis, cirrhosis, etc. Periodic monitoring of TB helps assess liver function and the progression of various diseases.
GGT: english abbreviation for serum glutamic-oxaloacetic transaminase (Gamma-Glutamyl Transferase). It is one of the conventional hematology indices.
PT: the abbreviation of Prothrombin Time (Prothrombin Time) is an important indicator reflecting the state of the coagulation system, and its change suggests that there may be coagulation abnormality or liver function impairment, which is clinically important.
INR: the abbreviation of the international normalized ratio (International Normalized Ratio) is an index calculated from the PT test result. INR is a standardized expression of PT and can more accurately compare PT results in different laboratories. ISI values in the INR formula represent international sensitivity indexes for different hemagglutinating agents. Liver cirrhosis, liver cancer, deficiency of vitamink, etc. can lead to elevated INR. INR is more standardized than PT and is an important index for detecting the coagulation function state.
AST: is an abbreviation of aspartate aminotransferase (Aspartate Aminotransferase) and is an important hematological index for assessing liver and other organ diseases.
ALT: the abbreviation for alanine aminotransferase (Alanine Aminotransferase) is an important indicator of impaired liver function.
Correlation analysis: english Correlation Analysis is a method used in statistics to investigate whether there is a relationship between two or more random variables. The primary purpose of this is to determine if there is a statistical correlation or dependence between two or more variables and to attempt to quantify the degree and form of such correlation. The basic method of correlation analysis includes: linear correlation, rank correlation, distance correlation, etc.
Correlation coefficient: is a statistic used in correlation analysis to quantify how closely the variables are related to one another, reflecting the strength of the linear correlation between the two variables. The larger the absolute value is, the stronger the linear correlation of the two variables is; a near 0 indicates weaker correlation. Common examples are pearson correlation coefficients, spearman correlation coefficients, and the like. The correlation coefficient referred to in this study is referred to as pearson correlation coefficient.
p-value: is collectively Probability Value, which represents the probability that observed data will appear in a hypothetical space. Specifically, p-value represents: when the null hypothesis of the hypothesis test is true, a probability equal to or more extreme to the observed data is obtained. In general, if the p-value is very small, e.g., less than 0.01, it is highly unlikely that the result is a random event under a null hypothesis, and the null hypothesis is rejected, i.e., the result is statistically significant. If the p-value is large, e.g., greater than 0.05, then the null hypothesis cannot be rejected, i.e., the result is not statistically significant. The smaller the p-value, the higher the statistical significance of the result. Common significance determination thresholds are 0.05 and 0.01. Therefore, the p-value reflects the probability of observing the current result on the premise that the zero hypothesis is established, and is an important basis for judging whether the hypothesis test result is obvious or not. The smaller the p-value, the more pronounced the result.
t test: english is t-test, which is a statistical method for checking whether there is a significant difference between the average of two samples. the basic idea of t test is to construct a hypothesis, calculate the t value of the observation statistic, determine p-value according to t distribution, and finally judge whether the original hypothesis is true according to p-value.
Multiple collinearity: multicoolinability means that there is a strong linear correlation between the arguments. Its main problems are: 1. influence the accuracy of least square estimation, make the variance of the regression coefficient become large; 2. the marginal effect of the independent variable on the dependent variable cannot be accurately estimated; 3. the regression coefficients that would result in some independent variables are insignificant; 4. the predictive ability of the regression equation may be reduced. The method for judging the multiple collinearity comprises the following steps: 1. correlation matrix method: observing the correlation coefficient between the independent variables; 2. variance expansion factor method: an excessive value of VIF indicates that there is multiple collinearity; 3. condition number method: if the condition number is too large, there is multiple collinearity. Method for handling multiple collinearity: 1. increasing the sample size; 2. removing independent variables with multiple collinearity; 3. combining or using principal components with independent variables; 4. regularization methods such as ridge regression. Thus, multiple collinearity can adversely affect regression analysis results, requiring identification and processing during modeling.
Logistic regression: logistic Regression is a commonly used classification model that establishes the relationship between independent variables and class dependent variables. The logistic regression model is mainly characterized in that: 1. the predicted dependent variable is a two-class or multi-class discrete variable; 2. converting the value of the linear combination of arguments to a probability between 0 and 1 using a logistic function; 3. classification variables and continuous variables can be processed; 4. parameter estimation typically uses a maximum likelihood method; 5. the influence factors of classification and the weights of the dependent variables can be explained; 6. the classification probability may be calculated and the classification prediction performed. The main steps of establishing a logistic regression model are as follows: (1) collecting data, processing missing values, and the like. (2) selecting an input variable, and processing a classification variable. (3) establishing a logistic regression equation. (4) maximum likelihood estimation parameters. (5) evaluating the overall effect of the model. And (6) performing statistical test to evaluate the influence of each variable. (7) establishing classification rules through probabilities. (8) predicting new data. The logistic regression model can be used for quantitatively analyzing the effect of variables and can also be used for classification prediction, and is a very useful classification analysis method.
ROC curve and AUC values: is a criterion for measuring the quality of the classifier, wherein the ROC (Receiver Operating Characteristic) curve, i.e. the receiver operation characteristic curve, has a false positive rate on the horizontal axis and a true positive rate on the vertical axis, and the area under the ROC curve is the AUC value when the plotted curve is above the y=x line. The greater the AUC, the better the classifier (e.g., logistic regression model) classification performance.
The present application is further described below with reference to examples and figures.
Example 1 for assessing subject suffering fromMethod for constructing model of liver cell liver cancer possibility
Fig. 1 is an exemplary flowchart of a method of constructing a model for assessing a subject's likelihood of suffering from hepatocellular carcinoma in accordance with an embodiment of the present application. Referring to fig. 1, the construction method of this embodiment includes the steps of:
step S10: acquiring a plurality of independent variables for the model, the plurality of independent variables comprising: the method comprises the steps of (1) expressing microRNA in plasma, namely, hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, and any one of Age, sex Gender, tumor marker detection results and hematology index detection results of a subject, wherein the tumor marker detection results comprise alpha fetoprotein AFP, and the hematology index detection results comprise any one of platelet count PLT, total bilirubin TB, serum glutamic-oxaloacetic transaminase GGT, prothrombin time PT, international standardization ratio INR, aspartic acid amino transferase AST and alanine amino transferase ALT, and carrying out structuring treatment on a plurality of independent variables;
step S20: dividing a subject into a liver cell liver cancer group and a non-liver cell liver cancer group, and respectively encoding the liver cell liver cancer group and the non-liver cell liver cancer group as 1 and 0 to form dependent variables for a model;
Step S30: screening the multiple independent variables to obtain effective independent variables;
step S40: performing correlation analysis on the plurality of independent variables to determine the relationship among the plurality of independent variables;
step S50: comparing the correlation coefficient between each independent variable and the dependent variable before and after performing Log10 conversion, and determining whether to perform Log10 conversion on the independent variable;
step S60: dividing data into a training set and a testing set, constructing a regression model containing independent variables according to the training set, and testing the performance of the regression model according to the testing set, wherein the data comprises a plurality of independent variables and dependent variables;
step S70: further screening the effective independent variables according to the p-value of each independent variable in the regression analysis result, the value of the correlation coefficient and any one of the AUC values of the regression model on the training set and the test set to obtain screened effective independent variables, wherein when the p-value of the independent variable is smaller than 0.05, the independent variable with the value of the correlation coefficient larger than a first threshold value of the independent variable is used as the effective independent variable, and the effective independent variable is used as the screened effective independent variable;
step S80: step S70 is repeatedly executed until the AUC value of the regression model on the training set is reduced to be near 0.8, the number of the effective independent variables after screening is more than or equal to 2, and the effective independent variables after screening at least comprise one microRNA, so that a plurality of first models are obtained; in step S30, the effective argument obtained in step S70 is n2, and n2 is not more than n1.
Step S90: and (3) replacing and combining the independent variables with the replacement relation in the plurality of first models according to the relation among the plurality of independent variables obtained in the step (S40), and constructing a regression model by adopting the replaced and combined independent variables to obtain a plurality of second models, wherein the models comprise any one of the first models and the second models.
The arrangement order and the number of the steps S10 to S90 are not limited to the execution order of the respective steps. For example, the execution order of the steps S10, S20, and S30 to S50 may be arbitrary.
The inventor of the application researches on screening of the existing liver cancer biomarkers and simplifying models, and discovers that if the existing liver cancer biomarkers are subjected to variable structuring to form independent variables, and then samples are grouped and encoded according to clinical diagnosis to form dependent variables, the variable screening can be performed through independent variable and dependent variable correlation analysis and statistical inspection. The capability of distinguishing different clinical groups of a certain independent variable or liver cancer biomarker is judged by the size of the related coefficient of the independent variable and the dependent variable, and the larger the related coefficient of the independent variable and the dependent variable is, the stronger the capability of distinguishing different clinical groups of the independent variable or the liver cancer biomarker is, and the higher the capability of distinguishing different clinical groups of the independent variable or the liver cancer biomarker is, and the higher the capability of distinguishing different clinical groups of the independent variable or the liver cancer biomarker is, and the liver cancer biomarker is required to be preferentially selected in constructing a liver cancer prediction model. The magnitude of the distinguishing ability of the independent variable/liver cancer biomarker to the dependent variable or different clinical groups can also be judged by whether the numerical distribution of the independent variable between the two clinical groups is significantly different or not and the magnitude of the p-value. Meanwhile, the inventor also finds that through correlation analysis between independent variables or liver cancer biomarkers, the correlation between the independent variables or liver cancer biomarkers can be found, two independent variables or liver cancer biomarkers with strong correlation can be mutually replaced in model construction or practical application, and only one of the independent variables or liver cancer biomarkers is needed to be selected in model construction, so that the effect of simplifying the model can be achieved. Meanwhile, the inventor finds that by analyzing the correlation between the independent variable before and after performing Log10 conversion, square processing, evolution processing and the dependent variable, whether the Log10 conversion, square, evolution processing and the like are performed on the variable in modeling or practical application can be determined according to the magnitude of the correlation coefficient between the independent variable and the dependent variable obtained by the analysis. In addition, the inventor also finds that if a logistic regression model of all variables including effective variables is constructed on a training set, and performance test of the model is performed on a test set, the variables can be further screened by combining the correlation analysis result and the statistical test result through p-value, coefficient value and AUC value of the model overall of each variable in the logistic regression result, and the set of the effective variables is reduced, so that extremely excellent liver cancer marker screening effect and model simplifying effect are obtained. Based on the findings described above, the present application proposes a method of constructing a model for evaluating the likelihood of a subject suffering from hepatocellular carcinoma.
The data sources in the construction method of the application are the following information from 20000 patients who were accepted by Shanghai Zhongshan hospitals from four years 2018 to 2021:
1) microRNA expression level results: the expression level of the 7 microRNAs of hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a and hsa-miR-801 in blood plasma;
2) Tumor marker detection results: AFP (alpha fetoprotein) detection result and miRNA7 TM The detection result herein refers to the detected numerical result, not the classification result determined according to the correlation threshold.
3) Hematology index detection results: 7 hematological indices of ALT (alanine aminotransferase), AST (aspartate aminotransferase), TB (total bilirubin), PT (prothrombin time), INR (International normalized ratio), PLT (platelet count), GGT (gamma aminobutyric acid detection result);
4) Basic information of patient: sex, age;
5) Clinical diagnosis results.
After the data is obtained, the data is preprocessed, including:
1) Rejecting samples with incomplete information, repeated samples and the like;
2) Processing of non-numeric records in the data, such as changing records greater than the upper limit >65000 to 65000, and further such as changing records below the lower limit, such as <0.1 to 0.1;
3) The variable structuring process, that is, the structuring process of a plurality of independent variables in step S10. For example, gender is encoded, male is encoded as 1, female is encoded as 0, so that subsequent analysis and modeling are facilitated; arranging and arranging information such as microRNA expression quantity results, tumor marker detection results, hematology index detection results, basic information of patients and the like according to a certain sequence to form a plurality of independent variables; then, the samples are classified into a liver cell cancer (HCC) group and a non-liver cell cancer (non-HCC) group according to clinical diagnosis, and are respectively encoded as 1 and 0 to form dependent variables, as by step S20; and finally, the processed independent variables and dependent variables are arranged into a two-dimensional table or data matrix, each column represents a variable, and each row represents a sample, so that the prediction model can be conveniently built by using the information.
The objects processed in steps S30 to S50 are all of a plurality of independent variables. The method of screening the plurality of independent variables to obtain the effective independent variable in step S30 may include the following method one and/or method two:
the method comprises the following steps: calculating a correlation coefficient between each independent variable and the dependent variable, wherein the independent variable with the phase relation number larger than a second threshold value is used as an effective independent variable;
The second method is as follows: and carrying out t-test statistical test analysis on each independent variable between the liver cell liver cancer group and the non-liver cell liver cancer group, and taking the independent variable with the p-value of the analysis result smaller than a third threshold value as an effective independent variable.
The second threshold, the third threshold, and the first threshold in step S70 are not limited herein, and may be determined according to actual data distribution conditions. For example, a first threshold=0.8, a second threshold=0.8, and a third threshold=0.05.
In step S50, the inventors of the present application found that Log10 conversion of some of the independent variables is advantageous for improving the prediction effect. Specifically, the index that can perform Log10 conversion is selected from all the arguments. Wherein, gender and age are not suitable for Log10 conversion; miRNA7 TM The result obtained by calculation of the model formula contains a negative value and is not suitable for Log10 conversion; the results of the 7 microRNAs of hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a and hsa-miR-801 are already Log2 converted results and are not suitable for Log10 conversion; AFP, PLT, ALT, AST, TB, GGT, PT, INR these markers may be Log10 transformed, and thus Log10 transformed. Thereafter, in step S40, the entire argument in step S10, namely Gender, age, miRNA7 TM The correlation analysis among each other was performed by AFP, hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801 and PLT, ALT, AST, TB, GGT, PT, INR, and Log10 (AFP), log10 (PLT), log10 (ALT), log10 (AST), log10 (TB), log10 (GGT), log10 (PT) and Log10 (INR) after Log10 conversion. The results are shown in FIG. 2.
Fig. 2 is a matrix diagram of correlation coefficients between the various indices involved in the embodiments of the present application. Wherein, the independent variables of the diagonal of the matrix from the upper left corner to the lower right corner are as follows: age, GGT, log10 (GGT), log10 (ALT), log10 (AST), ALT, AST, INR, log (INR), PT, log10 (PT), TB, log10 (TB), AFP, log10 (AFP), gender, CLASS.Y., miRNA7 TM (shown as miRNA7 TM), PLT, log10 (PLT), hsa-miR-223 (shown as miR.233), hsa-miR-21 (shown as miR.21), hsa-miR-26a (shown as miR.26 a), hsa-miR-27a (shown as miR.27 a), hsa-miR-801 (shown as miR.801), hsa-miR-122 (shown as miR.122), hsa-miR-192 (shown as miR.192). Wherein CLASS.Y. represents an independent variable or a clinical sample group, and miRNA 7. TM. Represents miRNA7 TM Is a result of detection of (a). Wherein the numerical value in each small square indicates the correlation coefficient between two indexes corresponding to the abscissa and the ordinate thereof, and the absolute value of the correlation coefficient is closer to 1, which indicates that the correlation is higher, and the correlation coefficient on the diagonal is 1. For example, the number in the small square of the first column of the second row is 0.08, indicating that the correlation coefficient of Age and GGT is 0.08, which are hardly correlated. The color bars on the right side of fig. 2 represent correlation coefficients corresponding to small squares in the graph, with the middle white being 0, indicating no correlation; the upper blue is 1, which indicates a complete positive correlation; the lower red is-1, indicating a complete negative correlation.
As shown in FIG. 2, the correlations between hsa-miR-21, hsa-miR-26a, hsa-miR-27a, hsa-miR-122, hsa-miR-192, hsa-miR-223, hsa-miR-801 and clinical diagnosis packets (CLASS.Y.) are in order from top to bottom: hsa-miR-122> hsa-miR-192> hsa-miR-21>
hsa-miR-801> hsa-miR-223> hsa-miR-26a > hsa-miR-27a. The first three microRNAs most relevant to clinical diagnosis groups are hsa-miR-122, hsa-miR-192 and hsa-miR-21, and the correlation of hsa-miR-122 and hsa-miR-192 is especially high and is several times or tens of times that of other microRNAs. The result suggests that the 3 microRNAs of hsa-miR-122, hsa-miR-192 and hsa-miR-21 are preferentially selected in the subsequent modeling, and the hsa-miR-26a and hsa-miR-27a with the weakest correlation with clinical diagnosis groups are preferentially excluded in the subsequent marker reduction.
As shown in FIG. 2, hsa-miR-21, hsa-miR-26a, hsa-miR-27a, hsa-miR-122, hsa-miR-192, hsa-miR-223, hsa-miR-801 and miRNA7 TM The correlation between the detection results is from top to bottom: hsa-miR-122>hsa-miR-192>hsa-miR-223>hsa-miR-21>
hsa-miR-801>hsa-miR-26a>hsa-miR-27a. And miRNA7 TM The first three microRNAs most relevant to the detection result are hsa-miR-122, hsa-miR-192 and hsa-miR-223, and the correlation between hsa-miR-122 and hsa-miR-192 is especially high and is several times or tens times that of other microRNAs. The results suggest that hsa-miR-122 and hsa-m are preferentially selected in subsequent modeling 3 microRNAs of iR-192 and hsa-miR-223 and miRNA7 at the same time TM The hsa-miR-26a and hsa-miR-27a with the weakest correlation between detection results are first removed in the follow-up marker reduction.
As shown in fig. 2, it can also derive correlations between different micrornas. The correlation coefficient of hsa-miR-122 and hsa-miR-192 is 0.84, so that strong correlation exists between the two, and the two can be replaced by each other; the correlation coefficient of hsa-miR-26a and hsa-miR-27a is 0.78, the correlation coefficient of hsa-miR-21 and hsa-miR-27a is 0.72, and the correlation coefficient of hsa-miR-21 and hsa-miR-26a is 0.70, so that strong correlation exists among hsa-miR-21, hsa-miR-26a and hsa-miR-27a, and a certain degree of substitution relation exists; the correlation coefficient between hsa-miR-223 and other microRNAs is at most 0.60, so that the hsa-miR-223 has moderate correlation with other microRNAs, but has no substitution relation; the correlation coefficient between hsa-miR-801 and other microRNAs is at most 0.52, so that the correlation with other microRNAs is moderate, but has no substitution relation.
From the correlation analysis described above, it can be first discovered which micrornas are most effective or least ineffective for distinguishing between different clinical diagnostic groupings; second, it can be found which microRNAs are in miRNA7 TM The logistic regression model plays a key role, and the roles are weaker, so that the logistic regression model can be purposefully simplified; and thirdly, whether strong correlation or substitution relation exists between microRNAs playing a key role or not can be known, and one of the microRNAs with the strong correlation or substitution relation is selected in the modeling process, so that a model can be further simplified, and the maximum distinguishing effect can be achieved under the condition of minimum parameters of the model.
Likewise, miRNA7 for tumor marker detection results and different hematological index and clinical diagnostic groupings (class.y.) in independent variables TM The correlation analysis between the test results, log10 transformed variables, and the various hematology indices can first find which hematology indices are most effective or most ineffective for distinguishing between different clinical diagnostic groupings; second, it can be found which hematological index is for miRNA7 TM There is a degree of substitution or correlation; and thirdly, whether a strong correlation or a substitution relation exists between the hematology indexes playing a key role or not can be known, and the hematology indexes with the strong correlation or substitution relation can be selected in the modeling process, so that the model can be further simplified, and the maximum distinguishing effect can be achieved under the condition of minimum parameters.
As shown in fig. 2, the correlation between the different hematological indices and their Log10 converted variables and the clinical diagnostic group (class.y.) was analyzed as follows:
1) The correlation between AFP, log10AFP, ALT, log10ALT, AST, log10AST, GGT, log GGT, PLT, log10PLT, PT, log10PT, INR, log INR, TB, log10TB and clinical diagnostic packets, from top to bottom, was in order: log10AFP > Log10PLT > PLT ]
Log10INR>Log10PT>Log10AST>INR>PT>AFP>Log10GGT>
Log10ALT > AST > ALT > Log10TB > GGT > TB. The first three hematological indicators most relevant to the clinical diagnostic group are AFP, PLT and INR, with AFP being particularly highly relevant. Therefore, three hematological indices AFP, PLT and INR are preferentially selected in the subsequent modeling. It is noted that the correlation between the AFP detection result after Log10 conversion and the clinical diagnosis packet rises from 0.20 to 0.50 to 2.5 times of the original value, so that Log10 (AFP) is adopted for the index of AFP in the subsequent modeling; the correlation between ALT detection results after Log10 conversion and clinical diagnosis groups is increased to a large extent from original 0.07 to 0.18 which is 2.4 times of the original value, so that Log10 (ALT) is adopted for the index of ALT in the subsequent modeling; the correlation between the AST detection result subjected to Log10 conversion and the clinical diagnosis packet rises from 0.10 to 0.26, which is 2.6 times of the original value, so that Log10 (AST) is adopted for the index of AST in the subsequent modeling; the correlation between the GGT detection result subjected to Log10 conversion and clinical diagnosis grouping is increased to 0.20 from 0.06 to 3.5 times of the original value, so that Log10 (GGT) is adopted for the index of GGT in the subsequent modeling; the correlation rising amplitude between the INR detection result subjected to Log10 conversion and clinical diagnosis groups is smaller, and the correlation rising amplitude is increased from 0.24 to 0.28 and is 1.2 times of the original correlation rising amplitude, so that whether the index of the INR is subjected to Log10 conversion in subsequent modeling has smaller influence on model effect; the PLT detection result and the clinical diagnosis group are in negative correlation, the correlation rising amplitude between the index and the clinical diagnosis group after performing Log10 conversion is smaller, and the rising amplitude is increased from 0.37 to 0.38 which is 1.02 times of the original value, so that whether the index adopts Log10 conversion in subsequent modeling for PLT has smaller influence on the model effect; the correlation rising amplitude between the PT detection result subjected to Log10 conversion and the clinical diagnosis grouping is smaller, and the correlation rising amplitude is increased from 0.23 to 0.27 which is 1.16 times of the original correlation rising amplitude, so that whether the PT index adopts Log10 conversion in the subsequent modeling has smaller influence on the model effect; the correlation between the TB detection result subjected to Log10 conversion and the clinical diagnosis packet rises from 0.01 to 0.07 to 5.2 times of the original correlation, and the correlation is changed into positive correlation from negative correlation, but even though the correlation between the converted Log10TB and the clinical diagnosis packet is still low, so that whether the index adopts Log10 conversion in subsequent modeling for TB has little influence on model effect. Taken together, the use of Log10 conversion for individual hematology indices may better exploit its ability to distinguish between different clinical diagnostic groupings. The correlation pairs between the hematology index and the clinical diagnosis group before and after Log10 conversion are shown in table 1.
TABLE 1 correlation comparison of AFP and hematology indicators before and after Log10 conversion with clinical diagnostic groups
2) AFP and different hematological indices and miRNA7 TM Correlation analysis between detection results: as shown in FIG. 2, the correlation between different hematology indices is determined byThe height is as follows: log10ALT>
Log10AST>Log10GGT>CLASS.Y.>ALT>Log10AFP>GGT>Log10PLT>AST>PLT>Gender>Log10PT>Log10INR>AFP>PT>Log10TB>INR>Age>TB. And miRNA7 TM The first three hematology indexes most relevant to the detection result are ALT, AST and GGT. Thus, in the subsequent miRNA7 TM The improvement effect which can be brought by the indexes in the joint modeling with other indexes is smaller, the indexes are considered in the subsequent modeling process, and the indexes are not taken as indexes to be considered preferentially. Notably, only PLT is the indicator of these hematological indicators and miRNA7 TM The detection result is in negative correlation, can be preferentially considered in the subsequent modeling, is beneficial to balancing the performance of the model, and improves the overall performance of the model.
3) Correlation analysis between AFP and different hematology indices: as shown in fig. 2 and table 2, the correlation coefficient of AFP before and after Log10 conversion is 0.61, that is, the AFP and Log10AFP have only moderate correlation, and a certain degree of substitution relation exists between the two, but the influence on the subsequent modeling is larger, and by combining the analysis results, log10AFP is obviously more suitable for the subsequent modeling than AFP; the correlation coefficient of PLT and Log10PLT is 0.94, which belongs to strong correlation, the PLT and the Log10PLT can be mutually replaced, whether PLT is subjected to Log10 conversion or not in subsequent modeling has small influence on the final model effect; the correlation coefficient of ALT and Log10ALT is 0.73, strong correlation exists between the ALT and the Log10ALT, and a certain degree of substitution relation exists, but the analysis result is combined, so that the ALT of the Log10 is obviously more suitable for subsequent modeling than the ALT; the correlation coefficient of the AST and the Log10AST is 0.65, and the AST and the Log10AST have moderate correlation and a certain degree of substitution relation, but the Log10AST is obviously more suitable for subsequent modeling compared with the AST by combining the analysis results; the correlation coefficient of TB and Log10TB is 0.73, and the strong correlation exists between the TB and the Log10TB, and a certain degree of substitution relation exists, but the analysis result is combined, so that the Log10TB is obviously more suitable for subsequent modeling than the TB; the correlation coefficient of GGT and Log10GGT is 0.76, strong correlation exists between the GGT and the Log10GGT, and a certain degree of substitution relation exists, but the analysis result is combined, so that the Log10GGT is obviously more suitable for subsequent modeling than the GGT; the correlation coefficient of the INR and the Log10INR is 0.98, and the INR and the Log10INR belong to strong correlation, can be mutually replaced, and have little influence on the final model effect no matter whether the INR is subjected to Log10 conversion or not in subsequent modeling; the correlation coefficient of two variables, namely glutamic pyruvic transaminase (ALT) and glutamic oxaloacetic transaminase (AST), is 0.76, so that stronger correlation exists between the two variables, the two variables can be mutually replaced to a certain extent, and even if the correlation coefficient of Log10ALT and Log10AST after the two variables are converted by Log10 reaches 0.68, medium correlation still exists; the correlation coefficient of two variables of Prothrombin Time (PT) and international standardization ratio (INR) is 0.98, so that strong correlation exists between the Prothrombin Time (PT) and the international standardization ratio (INR), the Prothrombin Time (PT) and the international standardization ratio (INR) can be replaced by each other, even if the correlation coefficient of the Log10INR and the Log10PT after Log10 conversion is as high as 0.97, the Prothrombin Time (PT) and the international standardization ratio (INR) can be replaced by each other, and the INR is a better choice in modeling.
TABLE 2 correlation analysis results between AFP and different hematology indices
In step S30, a second method is adopted, in which t-test statistical test analysis is performed between the liver cell liver cancer group and the non-liver cell liver cancer group for each independent variable, and the independent variable with the p-value of the analysis result smaller than the third threshold value is used as the effective independent variable. The results were as follows:
the statistical test analysis of t-test is carried out on the expression quantity of different microRNAs in the HCC group and the non-HCC group by using a geom_sign () function in R language, and the specific operation steps comprise: 1. inputting two groups of expression quantity data of a certain microRNA in an HCC group and a non-HCC group, and respectively taking the two groups of expression quantity data as two samples of t test; 2. calculating the mean value and standard deviation of the two samples; 3. assuming that the two samples are independent, calculating the degree of freedom; 4. calculating t statistics; 5. searching a corresponding P value according to the t distribution table; 6. if the P value is less than the significance level (default 0.05), the difference is considered statistically significant, and the smaller the value of the P-value of the statistical test result, the greater the effect of the argument on sample classification. The results of t.test statistical tests of all microRNAs are shown in Table 3, and from the numerical value of p-value, hsa-miR-26a and hsa-miR-27a are two microRNAs with the weakest distinguishing ability for liver cell liver cancer groups and non-liver cell liver cancer groups in all 7 microRNAs.
TABLE 3 t-test results of different microRNAs between liver cell carcinoma group (HCC) and non-liver cell carcinoma group (non-HCC)
Sequence number microRNA group p-value
1 hsa-miR-21_HCC v.s.hsa-miR-21_non-HCC p<2.22e-16
2 hsa-miR-26a_HCC v.s.hsa-miR-26a_non-HCC 4.5e-06
3 hsa-miR-27a_HCC v.s.hsa-miR-27a_non-HCC 0.0029
4 hsa-miR-122_HCC v.s.hsa-miR-122_non-HCC p<2.22e-16
5 hsa-miR-192_HCC v.s.hsa-miR-192_non-HCC p<2.22e-16
6 hsa-miR-223_HCC v.s.hsa-miR-223_non-HCC p<2.22e-16
7 hsa-miR-801_HCC v.s.hsa-miR-801_non-HCC p<2.22e-16
In step S70, care is taken to avoid multiple collinearity between variables when modeling using multiple logistic regression (Multinomial Logistic Regression), which would otherwise affect the stability of the model. From the above analysis results and conclusions: the correlation of AFP and hematology indexes before and after taking Log10 is high, and the AFP and hematology indexes are not suitable for modeling together; PT, INR, log10PT, log10INR have strong correlation with each other and are therefore not suitable for modeling together, but INR overall effect is better than PT, so subsequent modeling selects either INR or Log10INR; there is a strong correlation between the two microRNAs hsa-miR-122 and hsa-miR-192, so these two variables are also not suitable for modeling together. For a more comprehensive evaluation of the distinguishing effect of individual independent variables on dependent variables, consideration is not given here, although there is also a different degree of correlation between other independent variables.
The variables Gender, age, AFP, hsa-miR-122, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801 and PLT, ALT, AST, TB, GGT, INR are independent variables and dependent variables to construct a multiple logistic regression model 0-1, the obtained OR values, p-values and 95% confidence intervals of the OR values of the variables are sorted into the following table 4, the table contents are ordered from small to large according to p-values, the earlier variables are the ones which need to be prioritized in subsequent modeling, and the first 10 variables are PLT, hsa-miR-122 and Gender, age, AFP, INR, TB, hsa-miR-27a, ALT, GGT in sequence from small to large according to p-value. It is worth noting that the p-value of hsa-miR-21 is significantly higher than that of hsa-miR-26a and hsa-miR-27a, and obvious access exists from the previous correlation analysis conclusion and t.test conclusion, and the evaluation of the hsa-miR-21, hsa-miR-26a and hsa-miR-27a is influenced due to the fact that the three variables have strong correlation and collinearity.
TABLE 4 results of significance analysis of different variables (AFP and hematology index without Log10 conversion) in multiple logistic regression model
The variables Gender, age, log AFP, hsa-miR-122, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, log10PLT, log10ALT, log10AST, log10TB, log10GGT and Log10INR are taken as independent variables to construct a multiple logistic regression model 0-2 together with dependent variables, confidence intervals of 95% of the OR value, p-value and OR value of the obtained variables are sorted into the following table 5, the table contents are ordered from small to large according to p-value, the earlier variables are required to be prioritized in subsequent modeling, the independent variables with significance arranged in front of the intercept items can be prioritized in subsequent modeling, and the variables are ordered from small to large according to p-value, and are Log10AFP, log10PLT, gender, age, hsa-miR-122, log10INR and Log10TB. It is worth noting that the p-value of hsa-miR-21 is significantly higher than that of hsa-miR-26a and hsa-miR-27a, and obvious access exists from the previous correlation analysis conclusion and t.test conclusion, and the evaluation of the hsa-miR-21, hsa-miR-26a and hsa-miR-27a is influenced due to the fact that the three variables have strong correlation and collinearity.
TABLE 5 results of significance analysis of different variables (AFP and hematology index transformed by Log 10) in multiple logistic regression model
Combining the modeling results of the two logistic regressions, the first 7 most important variables in model 0-1 are PLT, hsa-miR-122, gender, age, AFP, INR and TB, the first 7 most important variables in model 0-2 are Log10AFP, log10PLT, gender, age, hsa-miR-122, log10INR and Log10TB, and although the variable names are different, the indexes involved are consistent, namely Gender (Gender), age (Age), alpha Fetoprotein (AFP), platelet count (PLT), international Normalized Ratio (INR), total Bilirubin (TB) and hsa-miR-122, which are considered as priorities in the subsequent construction of predictive models.
The study collected data for a total of 20,191 samples, which were divided into training and test sets by year of acquisition in step S60. Specifically, sample data collected from 2018 to 2020 are used as training sets, and total 11450 samples are contained; sample data collected in 2021 was taken as a separate test set, containing 8741 samples in total. The data of the last year is independently divided into test sets, so that the generalization capability of the model can be better verified, and the problem of overfitting caused by excessive dependence of the model on data distribution of a training set is avoided. The effect of the model can be evaluated more comprehensively and fairly by adopting the dividing method of the training set and the testing set.
According to the above-described step S80, 12 first models are obtained, on the basis of the relationship between the plurality of independent variables obtained in step S40, independent variables having a substitution relationship in the first models are substituted and combined, and a regression model is constructed using the substituted and combined independent variables, so that 29 second models are obtained in total, and therefore, the present application obtains 41 models in total, and the 41 models are described one by one in connection with examples 2 to 42. It should be noted that the training set and the test set used in each model are the same. Receiver Operating Characteristics (ROC) curves and Areas Under Curves (AUCs) were used as evaluation indices for each model and the calculations of these indices were performed on training sets, validation sets, and test sets. The ROC curve can intuitively reflect the classification effect of the model by comprehensively judging the identification capability of the model to the positive and negative samples; the AUC value can quantify the ROC curve, and further evaluate the sensitivity and specificity of model prediction. The effect of the prediction model constructed by the user can be comprehensively evaluated by ROC and AUC analysis, and basis is provided for subsequent model optimization and clinical application.
Example 2 establishment and verification of logistic regression prediction model 1
On the training set, a logistic regression model containing two independent variables of log10 (AFP) and hsa-miR-122 is constructed and used for predicting the possibility of liver cancer of a sample. Firstly, according to the expression data of AFP and hsa-miR-122 and the label of a sample, a prediction model 1 is obtained through a logistic regression algorithm. The mathematical expression of the predictive model 1 is: logit (P) =β1×log10 (AFP) +β2×hsa-miR-122+β0, wherein Logit (P) represents the logarithmic probability of classifying samples into positive examples, AFP and hsa-miR-122 respectively represent AFP detection results and expression amounts of hsa-miR-122, β1 and β2 are regression coefficients corresponding to AFP and hsa-miR-122, and β0 is an intercept term. Through model training, the value range of beta 1 is [1.515,1.68], the value range of beta 2 is [ -0.325, -0.269], and the value range of beta 0 is [5.027,6.354]. Preferably, β1 is 1.60, β2 is-0.30, and β0 is 5.69, resulting in optimal prediction results. The established logistic regression model can predict based on the AFP detection result and the hsa-miR-122 expression quantity result of the new sample, and judges the possibility of the new sample for suffering from liver cancer.
The ROC curve and corresponding AUC results for predictive model 1 on the training set are shown in fig. 3. As can be seen from the ROC curve of the model on the training set and the corresponding AUC result graph, the ROC curve of the model shows that the overall effect is better, the AUC value is also higher and is 0.81, which indicates that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples.
In order to comprehensively verify the stability and generalization capability of the model, a 5-fold cross-validation method is adopted to perform internal cross-validation of the model. Specifically, the training set data is divided into 5 mutually exclusive subsets, 4 subsets of which are used as training sets each time, and 1 subset is used as verification set, and 5 rounds of training and verification are performed in total. The stability of the model over different subsets of data can be measured by 5-fold cross-validation. The ROC curve and corresponding AUC results for the 5-fold cross-validation of the model on the training set are shown in fig. 4. As can be seen from the ROC curve of 5-fold cross validation and the corresponding AUC result graph, the final model obtains the results with the average AUC of 0.81 and the standard deviation of 0.01 on the validation set through 5-fold cross validation, achieves the set performance target, proves that the model is stable and effective, and provides guarantee for practical application of the model.
After a satisfactory training set effect is obtained, the prediction model 1 is applied to an independent test set to perform prediction, and various evaluation indexes of the model on the test set are calculated. The process can verify the generalization capability of the model from the training set to the test set, and check whether the model has a fitting problem on an unseen sample, so that the generalization performance of the model is further evaluated. The ROC curve and corresponding AUC results for this predictive model 1 on the test set are shown in fig. 5. As can be seen from the ROC curve of the prediction model 1 on the test set and the corresponding AUC result graph, the ROC curve of the prediction model 1 shows that the overall effect of the model on the test set is good, the corresponding AUC value is also high and is 0.81, which indicates that the prediction model 1 still keeps good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, so that the model is not fitted too much, and has good generalization capability. In combination, the predictive model 1 has good predictive performance on the test set and can be clinically verified and applied.
According to the prediction model 1, the possibility of suffering from liver cell liver cancer can be accurately predicted only by obtaining two indexes of log10 (AFP) and hsa-miR-122 of a tested person, and the cost is low and the efficiency is high.
The same method as in example 3 to example 41 was used to obtain ROC curves and corresponding AUC results on training set, ROC curves and corresponding AUC results on 5-fold cross-validation, ROC curves and corresponding AUC results on test set, and related ROC graphs for evaluating predictive performance of predictive models 2 to 41. These results will be described only hereinafter, and the same contents will not be repeated in conjunction with the drawings.
Example 3 establishment and verification of logistic regression prediction model 2
Based on the prediction model 1, the variable hsa-miR-122 is replaced by hsa-miR-192 in the prediction model 2, and the mathematical expression of the prediction model 2 is as follows: logit (P) =β1×log10 (AFP) +β2×hsa-miR-192+β0, wherein the value range of β1 obtained by model training is [1.526,1.690], the value range of β2 is [ -0.327, -0.252], and the value range of β0 is [5.255,7.23]. Preferably, β1 is 1.61, β2 is-0.29, and β0 is 6.24, resulting in optimal prediction results. The established logistic regression model can predict based on the AFP detection result and the hsa-miR-192 expression quantity result of the new sample, and judges the possibility of the new sample for suffering from liver cancer.
The ROC curve of the prediction model 2 on the training set shows that the overall effect is better, the AUC value is also higher and is 0.79, which indicates that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 2 on the training set is 0.79 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 2 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.80, which shows that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, so that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 4 establishment and verification of logistic regression prediction model 3
The prediction model 3 adds the variable sex Gender to the model on the basis of the prediction model 1, and the mathematical expression of the prediction model 2 is: logit (P) =β1×Gender+β2×log10 (AFP) +β3×hsa-miR-122+β0, wherein the value range of the coefficient β1 is [0.957,1.171]; the value range of the coefficient beta 2 is [1.520,1.687]; the value range of the coefficient beta 3 is [ -0.282, -0.225]; the value range of the coefficient beta 0 is [3.153,4.549]; preferably, the coefficient β1 is 1.06; the coefficient β2 is 1.60; the coefficient beta 3 is-0.25; the coefficient β0 is 3.85. The established logistic regression model can predict based on the Gender and AFP detection results and the hsa-miR-122 expression quantity results of the new sample, and judges the possibility of the new sample to suffer from liver cancer.
The ROC curve of the prediction model 3 on the training set shows that the overall effect is better, the AUC value is also higher and is 0.83, which indicates that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 3 on the training set is 0.83 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The overall effect of the prediction model 3 on the test set is good, the corresponding AUC value is also high and is 0.83, which shows that the model still keeps good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, so that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 5 establishment and verification of logistic regression prediction model 4
Based on a prediction model 3, the variable hsa-miR-122 is replaced by hsa-miR-192, and the mathematical expression of the prediction model 4 is as follows: logit (P) =β1×Gender+β2×log10 (AFP) +β3×hsa-miR-192+β0, wherein the value range of the coefficient β1 is [1.012,1.225]; the value range of the coefficient beta 2 is [1.538,1.706]; the value range of the coefficient beta 3 is [ -0.265, -0.189]; the value range of the coefficient beta 0 is [2.723,4.78]; preferably, the coefficient β1 is 1.12; the coefficient β2 is 1.62; the coefficient beta 3 is-0.23; the coefficient β0 is 3.75. The established logistic regression model can predict based on the Gender and AFP detection results and the hsa-miR-192 expression level results of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 4 on the training set shows that the overall effect is better, the AUC value is also higher and is 0.82, which indicates that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 4 on the training set is 0.82 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 4 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.83, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 6 establishment and verification of logistic regression prediction model 5
Prediction model 5 the variable Gender was replaced with the variable log10 (PLT) based on the prediction model 3, the mathematical expression of the prediction model 5 being: logit (P) =β1×log10 (AFP) +β2×hsa-miR-122+β3×log10 (PLT) +β0, wherein the value range of the coefficient β1 is [1.446,1.615]; the value range of the coefficient beta 2 is [ -0.297, -0.238]; the value range of the coefficient beta 3 is [ -4.515, -3.983]; the value range of the coefficient beta 0 is [13.611, 15.439]; preferably, the coefficient β1 is 1.53; the coefficient beta 2 is-0.27; the coefficient beta 3 is-4.25; the coefficient β0 is 14.52. The established logistic regression model can predict based on PLT, AFP detection results and hsa-miR-122 expression quantity results of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 5 on the training set shows that the overall effect is better, the AUC value is also higher and is 0.86, which indicates that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 5 on the training set is 0.86 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 5 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.86, which shows that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, so that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 7 establishment and verification of logistic regression prediction model 6
Based on a prediction model 5, the variable hsa-miR-122 is replaced by hsa-miR-192, and the mathematical expression of the prediction model 6 is as follows: logit (P) =β1×log10 (AFP) +β2×hsa-miR-192+β3×log10 (PLT) +β0, wherein the value range of the coefficient β1 is [1.457,1.625]; the value range of the coefficient beta 2 is [ -0.302, -0.223]; the value range of the coefficient beta 3 is [ -4.628, -4.094]; the value range of the coefficient beta 0 is [14.111, 16.522]; preferably, the coefficient β1 is 1.54; the coefficient beta 2 is-0.26; the coefficient beta 3 is-4.36; the coefficient β0 is 15.32. The established logistic regression model can predict based on PLT, AFP detection results and hsa-miR-192 expression level results of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 6 on the training set shows that the overall effect is better, the AUC value is also higher and is 0.86, which indicates that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 6 on the training set is 0.86 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the test model 6 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.86, which shows that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, so that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 8 establishment and verification of logistic regression prediction model 7
On the basis of the prediction model 3, the independent variable Age is added to the prediction model 7, and the mathematical expression of the prediction model 7 is as follows: logit (P) =β1×Gender+β2×Age+β3×log10 (AFP) +β4×hsa-miR-122+β0, wherein the coefficient β1 has a value in the range of [0.975,1.196]; the value range of the coefficient beta 2 is [0.041,0.050]; the value range of the coefficient beta 3 is [1.559,1.731]; the value range of the coefficient beta 4 is [ -0.335, -0.275]; the value range of the coefficient beta 0 is [1.709,3.157]; preferably, the coefficient β1 is 1.09; the coefficient β2 is 0.05; the coefficient β3 is 1.64; the coefficient beta 4 is-0.30; the coefficient β0 is 2.43. The established logistic regression model can predict based on Age, gender, AFP detection results of the new sample and expression level results of hsa-miR-122, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 7 on the training set shows that the overall effect is better, the AUC value is also higher and is 0.85, which shows that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 7 on the training set is 0.85 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 7 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.86, which shows that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, so that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 9 establishment and verification of logistic regression prediction model 8
Based on a prediction model 7, the variable hsa-miR-122 is replaced by hsa-miR-192 in the prediction model 8, and the mathematical expression of the prediction model 8 is as follows: logit (P) =β1×Gender+β2×Age+β3×log10 (AFP) +β4×hsa-miR-192+β0, wherein the coefficient β1 has a value in the range of [1.003,1.222]; the value range of the coefficient beta 2 is [0.044,0.052]; the value range of the coefficient beta 3 is [1.568,1.739]; the value range of the coefficient beta 4 is [ -0.39, -0.308]; the value range of the coefficient beta 0 is [3.146,5.262]; preferably, the coefficient β1 is 1.11; the coefficient β2 is 0.05; the coefficient β3 is 1.65; the coefficient beta 4 is-0.35; the coefficient β0 is 4.20. The established logistic regression model can predict based on Age, gender, AFP detection results of the new sample and expression level results of hsa-miR-192, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 8 on the training set shows that the overall effect is better, the AUC value is also higher and is 0.84, which indicates that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 8 on the training set is 0.84 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 8 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.86, which shows that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, so that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 10 establishment and verification of logistic regression prediction model 9
Prediction model 9 the variable Age is replaced by log10 (PLT) on the basis of the prediction model 7, the mathematical expression of the prediction model 9 being: logit (P) =β1×log10 (AFP) +β2×Gender+β3×
hsa-miR-122+Beta4×log10 (PLT) +Beta 0, wherein the value range of the coefficient beta 1 is [0.975,1.196]; the value range of the coefficient beta 2 is [0.041,0.050]; the value range of the coefficient beta 3 is [1.559,1.731];
the value range of the coefficient beta 4 is [ -0.335, -0.275]; the value range of the coefficient beta 0 is [1.709,3.157]; preferably, the coefficient β1 is 1.53; the coefficient β2 is 0.99; the coefficient beta 3 is-0.23; the coefficient beta 4 is-4.13; the coefficient β0 is 12.66. The established logistic regression model can predict based on PLT, gender, AFP detection results of the new sample and expression level results of hsa-miR-122, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 9 on the training set shows that the overall effect is better, the AUC value is also higher and is 0.87, which indicates that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 9 on the training set is 0.87 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 9 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.87, which shows that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, so that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 11 establishment and verification of logistic regression prediction model 10
Based on a prediction model 9, the variable hsa-miR-122 is replaced by hsa-miR-192, and the mathematical expression of the prediction model 10 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×hsa-miR-192+β4×log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [1.461,1.631]; the value range of the coefficient beta 2 is [0.914,1.144]; the value range of the coefficient beta 3 is [ -0.250, -0.169]; the value range of the coefficient beta 4 is [ -4.490, -3.950]; the value range of the coefficient beta 0 is [11.584, 14.070]; preferably, the coefficient β1 is 1.55; the coefficient β2 is 1.03; the coefficient beta 3 is-0.21; the coefficient beta 4 is-4.22; the coefficient β0 was 12.83. The established logistic regression model can predict based on PLT, gender, AFP detection results of the new sample and expression level results of hsa-miR-192, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 10 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.87, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 10 on the training set is 0.87 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 10 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.87, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 12 establishment and verification of logistic regression prediction model 11
The prediction model 11 adds a variable Age to the model based on the prediction model 9, and the mathematical expression of the prediction model 11 is: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [1.481,1.654]; the value range of the coefficient beta 2 is [0.893,1.128]; the value range of the coefficient beta 3 is [0.033,0.041]; the value range of the coefficient beta 4 is [ -0.309, -0.247]; the value range of the coefficient beta 5 is [ -4.050, -3.512]; the value range of the coefficient beta 0 is [9.842, 11.781]; preferably, the coefficient β1 is 1.57; the coefficient β2 is 1.01; the coefficient β3 is 0.04; the coefficient beta 4 is-0.28; the coefficient beta 5 is-3.78; the coefficient β0 is 10.81. The established logistic regression model can predict based on Age, PLT, gender, AFP detection results of the new sample and expression level results of hsa-miR-122, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 11 on the training set shows that the overall effect is better, the AUC value is also higher and is 0.88, which indicates that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 11 on the training set is 0.88 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 2 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which shows that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, so that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 13 establishment and verification of logistic regression prediction model 12
Based on a prediction model 11, the variable hsa-miR-122 is replaced by hsa-miR-192, and the mathematical expression of the prediction model 12 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [1.489,1.663]; the value range of the coefficient beta 2 is [0.916,1.150]; the value range of the coefficient beta 3 is [0.034,0.043]; the value range of the coefficient beta 4 is [ -0.354, -0.268]; the value range of the coefficient beta 5 is [ -4.109, -3.568]; the value range of the coefficient beta 0 is [11.149, 13.662]; preferably, the coefficient β1 is 1.58; the coefficient β2 is 1.03; the coefficient β3 is 0.04; the coefficient beta 4 is-0.31; the coefficient beta 5 is-3.84; the coefficient β0 is 12.41. The established logistic regression model can predict based on Age, PLT, gender, AFP detection results of the new sample and expression level results of hsa-miR-192, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 12 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 12 on the training set is 0.88 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 12 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 14 establishment and verification of logistic regression prediction model 13
The prediction model 13 is based on the prediction model 11, variables Age and PLT are removed from the model, and simultaneously, the variables hsa-miR-21 and hsa-miR-223 are added into the model, and the mathematical expression of the prediction model 13 is as follows: logit (P) =β1×Gender+β2×log10 (AFP) +β3×hsa-miR-21+β4×hsa-miR-122+β5×hsa-miR-223+β0, wherein the value range of the coefficient β1 is [0.925,1.143]; the value range of the coefficient beta 2 is [1.510,1.679]; the value range of the coefficient beta 3 is [ -0.076,0.034]; the value range of the coefficient beta 4 is [ -0.335, -0.269]; the value range of the coefficient beta 5 is [0.234,0.317]; the value range of the coefficient beta 0 is [ -2.089,0.526]; preferably, the coefficient β1 is 1.03; the coefficient β2 is 1.59; the coefficient beta 3 is-0.02; the coefficient beta 4 is-0.30; the coefficient β5 is 0.28; the coefficient β0 is-0.78. The established logistic regression model can predict based on the data of Gender, AFP, hsa-miR-21, hsa-miR-122 and hsa-miR-223 of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 13 on the training set shows that the overall effect is better, the AUC value is also higher and is 0.84, which indicates that the model has better prediction performance in the aspect of distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 13 on the training set is 0.84 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the stability and effectiveness of the model are proved, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 13 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.84, which shows that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, so that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 15 establishment and verification of logistic regression prediction model 14
Based on the prediction model 13, the variable hsa-miR-122 is replaced by hsa-miR-192, and the mathematical expression of the prediction model 14 is as follows: logit (P) =β1×Gender+β2×log10 (AFP) +β3×hsa-miR-21+β4×hsa-miR-192+β5×hsa-miR-223+β0, wherein the value range of the coefficient β1 is [0.963,1.178]; the value range of the coefficient beta 2 is [1.516,1.685];
the value range of the coefficient beta 3 is [ -0.101,0.016]; the value range of the coefficient beta 4 is [ -0.372, -0.274]; the value range of the coefficient beta 5 is [0.257,0.341]; the value range of the coefficient beta 0 is [ -0.760,1.905]; preferably, the coefficient β1 is 1.07; the coefficient β2 is 1.60; the coefficient beta 3 is-0.04; the coefficient beta 4 is-0.32; the coefficient β5 is 0.30; the coefficient β0 is 0.57. The established logistic regression model can predict based on the data of Gender, AFP, hsa-miR-21, hsa-miR-192 and hsa-miR-223 of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 14 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.83, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 14 on the training set is 0.83 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 14 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.84, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 16 establishment and verification of logistic regression prediction model 15
The prediction model 15 adds a variable Age to the model based on the prediction model 13, and the mathematical expression of the prediction model 15 is: logit (P) =β1×Gender+β2×Age+β3×log10 (AFP) +β4×hsa-miR-21+β5×hsa-miR-122+β6×hsa-miR-223+β0, wherein the value range of the coefficient β1 is [0.949,1.172]; the value range of the coefficient beta 2 is [0.039,0.048]; the value range of the coefficient beta 3 is [1.548,1.720]; the value range of the coefficient beta 4 is [ -0.011,0.102]; the value range of the coefficient beta 5 is [ -0.397, -0.327]; the value range of the coefficient beta 6 is [0.172,0.257]; the value range of the coefficient beta 0 is [ -3.687, -1.011]; preferably, the coefficient β1 is 1.06; the coefficient β2 is 0.04; the coefficient β3 is 1.63; the coefficient β4 is 0.05; the coefficient beta 5 is-0.36; the coefficient β6 is 0.21; the coefficient β0 is-2.35. The established logistic regression model can predict based on the data of Age, gender, AFP, hsa-miR-21, hsa-miR-122 and hsa-miR-223 of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 15 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.85, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 2 on the training set is 0.85 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 15 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.86, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 17 establishment and verification of logistic regression prediction model 16
Based on a prediction model 15, the variable hsa-miR-122 is replaced by hsa-miR-192, and the mathematical expression of the prediction model 16 is as follows: logit (P) =β1×Gender+β2×Age+β3×log10 (AFP) +β4×hsa-miR-21+β5×hsa-miR-192+β6×hsa-miR-223+β0, wherein the value range of the coefficient β1 is [0.959,1.182]; the value range of the coefficient beta 2 is [0.044,0.052]; the value range of the coefficient beta 3 is [1.544,1.716]; the value range of the coefficient beta 4 is [0.031,0.153]; the value range of the coefficient beta 5 is [ -0.549, -0.443]; the value range of the coefficient beta 6 is [0.203,0.290]; the value range of the coefficient beta 0 is [ -1.567,1.171]; preferably, the coefficient β1 is 1.07; the coefficient β2 is 0.05; the coefficient β3 is 1.63; the coefficient β4 is 0.09; the coefficient beta 5 is-0.50; the coefficient β6 is 0.25; the coefficient beta 0 is-0.20. The established logistic regression model can predict based on the data of Age, gender, AFP, hsa-miR-21, hsa-miR-192 and hsa-miR-223 of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the predictive model 16 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.85, which indicates that the model has better predictive performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 16 on the training set is 0.85 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 16 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.87, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 18 establishment and verification of logistic regression prediction model 17
Prediction model 17 based on prediction model 15, variables hsa-miR-21 and hsa-miR-223 are removed from the model, and simultaneously, variable log10 (PLT) and variable INR are added into the model, wherein the mathematical expression of the prediction model 17 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×log10 (PLT) +β6×INR+β0, wherein the coefficient β1 has a value in the range of [1.469,1.642]; the value range of the coefficient beta 2 is [0.88,1.116]; the value range of the coefficient beta 3 is [0.033,0.041]; the value range of the coefficient beta 4 is [ -0.31, -0.247]; the value range of the coefficient beta 5 is [ -3.854, -3.284]; the value range of the coefficient beta 6 is [0.548,1.447]; the value range of the coefficient beta 0 is [8.136, 10.487]; preferably, the coefficient β1 is 1.56; the coefficient β2 is 1.00; the coefficient β3 is 0.04; the coefficient beta 4 is-0.28; the coefficient beta 5 is-3.57; the coefficient β6 is 1.00; the coefficient β0 is 9.31. The established logistic regression model can predict based on the data of Age, gender, AFP, PLT, INR and hsa-miR-122 of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 17 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 17 on the training set is 0.88 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 17 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 19 establishment and verification of logistic regression prediction model 18
Based on a prediction model 17, the variable hsa-miR-122 is replaced by hsa-miR-192, and the mathematical expression of the prediction model 18 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×log10 (PLT) +β6×INR+β0, wherein the coefficient β1 has a value in the range of [1.474,1.648]; the value range of the coefficient beta 2 is [0.902,1.136]; the value range of the coefficient beta 3 is [0.033,0.041]; the value range of the coefficient beta 4 is [0.034,0.043]; the value range of the coefficient beta 5 is [ -3.896, -3.324]; the value range of the coefficient beta 6 is [0.639,1.557]; the value range of the coefficient beta 0 is [9.440, 12.261]; preferably, the coefficient β1 is 1.56; the coefficient β2 is 1.02; the coefficient β3 is 0.04; the coefficient beta 4 is-0.32; the coefficient beta 5 is-3.61; the coefficient β6 is 1.10; the coefficient β0 is 10.85. The established logistic regression model can predict based on the data of Age, gender, AFP, PLT, INR and hsa-miR-192 of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the predictive model 18 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, indicating that the model has better predictive performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 18 on the training set is 0.88 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 18 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 20 establishment and verification of logistic regression prediction model 19
The prediction model 19 adds the variable TB to the model on the basis of the prediction model 17, the mathematical expression of the prediction model 19 being: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×log10 (PLT) +β6×INR+β7×TB+β0, wherein the coefficient β1 has a value in the range of [1.471,1.645]; the value range of the coefficient beta 2 is [0.884,1.121]; the value range of the coefficient beta 3 is [0.034,0.043]; the value range of the coefficient beta 4 is [ -0.328, -0.265]; the value range of the coefficient beta 5 is [ -3.769, -3.194]; the value range of the coefficient beta 6 is [1.081,2.037]; the value range of the coefficient beta 7 is [ -0.01, -0.007]; the value range of the coefficient beta 0 is [7.832, 10.218]; preferably, the coefficient β1 is 1.56; the coefficient β2 is 1.00; the coefficient β3 is 0.04; the coefficient beta 4 is-0.30; the coefficient beta 5 is-3.48; the coefficient β6 is 1.56; the coefficient beta 7 is-0.01; the coefficient β0 is 9.02. The established logistic regression model can predict based on the data of Age, gender, AFP, PLT, INR, TB and hsa-miR-122 of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 19 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 19 on the training set is 0.88 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 19 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 21 establishment and verification of logistic regression prediction model 20
Based on a prediction model 19, the variable hsa-miR-122 is replaced by hsa-miR-192, and the mathematical expression of the prediction model 20 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×log10 (PLT) +β6×INR+β7×TB+β0, wherein the coefficient β1 has a value in the range of [1.475,1.650]; the value range of the coefficient beta 2 is [0.906,1.142]; the value range of the coefficient beta 3 is [0.036,0.045]; the value range of the coefficient beta 4 is [ -0.383, -0.296]; the value range of the coefficient beta 5 is [ -3.812, -3.235]; the value range of the coefficient beta 6 is [1.179,2.158]; the value range of the coefficient beta 7 is [ -0.01, -0.007]; the value range of the coefficient beta 0 is [9.339, 12.190]; preferably, the coefficient β1 is 1.56; the coefficient β2 is 1.02; the coefficient β3 is 0.04; the coefficient beta 4 is-0.34; the coefficient beta 5 is-3.52; the coefficient β6 is 1.67; the coefficient beta 7 is-0.01; the coefficient β0 is 10.76. The established logistic regression model can predict based on the data of Age, gender, AFP, PLT, INR, TB and hsa-miR-192 of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 20 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 20 on the training set is 0.88 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 20 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 22 establishment and verification of logistic regression prediction model 21
Prediction model 21 based on the prediction model 19, the variable hsa-miR-223 is added into the model, and the mathematical expression of the prediction model 21 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×hsa-miR-223+β6×log10 (PLT) +β7×INR+
β8×tb+β0, wherein the coefficient β1 has a value in the range of [1.467,1.642]; the value range of the coefficient beta 2 is [0.871,1.109]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [ -0.355, -0.289]; the value range of the coefficient beta 5 is [0.086,0.168]; the value range of the coefficient beta 6 is [ -3.641, -3.061]; the value range of the coefficient beta 7 is [1.039,1.99]; the value range of the coefficient beta 8 is [ -0.01, -0.007]; the value range of the coefficient beta 0 is [5.060,7.940]; preferably, the coefficient β1 is 1.55; the coefficient β2 is 0.99; the coefficient β3 is 0.04; the coefficient beta 4 is-0.32; the coefficient β5 is 0.13; the coefficient beta 6 is-3.35; the coefficient β7 is 1.51; the coefficient beta 8 is-0.01; the coefficient β0 is 6.50. The established logistic regression model can predict based on the data of Age, gender, AFP, PLT, INR, TB, hsa-miR-223 and hsa-miR-122 of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 21 on the training set shows a better overall effect, and the AUC value is also higher and is 0.89, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 21 on the training set is 0.88 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 21 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 23 establishment and verification of logistic regression prediction model 22
Based on the prediction model 21, the variable hsa-miR-122 is replaced by hsa-miR-192, and the mathematical expression of the prediction model 22 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×hsa-miR-223+β6×log10 (PLT) +β7×INR+β8×TB+β0, where the coefficient β1 has a value in the range [1.464,1.639]; the value range of the coefficient beta 2 is [0.883,1.119]; the value range of the coefficient beta 3 is [0.035,0.044]; the value range of the coefficient beta 4 is [ -0.448, -0.353]; the value range of the coefficient beta 5 is [0.113,0.199]; the value range of the coefficient beta 6 is [ -3.647, -3.064]; the value range of the coefficient beta 7 is [1.153,2.127]; the value range of the coefficient beta 8 is [ -0.01, -0.007]; the value range of the coefficient beta 0 is [6.888,9.999]; preferably, the coefficient β1 is 1.55; the coefficient β2 is 1.00; the coefficient β3 is 0.04; the coefficient beta 4 is-0.40; the coefficient β5 is 0.16; the coefficient beta 6 is-3.36; the coefficient β7 is 1.64; the coefficient beta 8 is-0.01; the coefficient β0 is 8.44. The established logistic regression model can predict based on the data of Age, gender, AFP, PLT, INR, TB, hsa-miR-223 and hsa-miR-192 of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 22 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 22 on the training set is 0.88 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the stability and effectiveness of the model are proved, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 22 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 24 establishment and verification of logistic regression prediction model 23
The prediction model 23 is based on the prediction model 21, the variable hsa-miR-21 is added into modeling, and the mathematical expression of the prediction model 23 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-122+β6×hsa-miR-223+β7×log10 (PLT) +β8×INR+β9×TB+β0, wherein the coefficient β1 has a value in the range of [1.469,1.644]; the value range of the coefficient beta 2 is [0.873,1.111]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [ -0.03,0.088]; the value range of the coefficient beta 5 is [ -0.366, -0.293]; the value range of the coefficient beta 6 is [0.074,0.163]; the value range of the coefficient beta 7 is [ -3.64, -3.059]; the value range of the coefficient beta 8 is [1.043,1.995]; the value range of the coefficient beta 9 is [ -0.01, -0.007]; the value range of the coefficient beta 0 is [4.317,7.761]; preferably, the coefficient β1 is 1.56; the coefficient β2 is 0.99; the coefficient β3 is 0.04; the coefficient β4 is 0.03; the coefficient beta 5 is-0.33; the coefficient β6 is 0.12; the coefficient beta 7 is-3.35; the coefficient β8 is 1.52; the coefficient beta 9 is-0.01; the coefficient β0 is 6.04. The established logistic regression model can predict based on the data of Age, gender, AFP, PLT, INR, TB, hsa-miR-21, hsa-miR-122 and hsa-miR-223 of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 23 on the training set shows a better overall effect, and the AUC value is also higher and is 0.89, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 23 on the training set is 0.89 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is proved to be stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 23 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 25 establishment and verification of logistic regression prediction model 24
Based on a prediction model 23, the variable hsa-miR-122 is replaced by hsa-miR-192, and the mathematical expression of the prediction model 24 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-192+β6×hsa-miR-223+β7×log10 (PLT) +β8×INR+β9×TB+β0, wherein the coefficient β1 has a value in the range of [1.465,1.640]; the value range of the coefficient beta 2 is [0.885,1.121]; the value range of the coefficient beta 3 is [0.036,0.045]; the value range of the coefficient beta 4 is [0.001,0.129]; the value range of the coefficient beta 5 is [ -0.487, -0.375]; the value range of the coefficient beta 6 is [0.096,0.187]; the value range of the coefficient beta 7 is [ -3.638, -3.054]; the value range of the coefficient beta 8 is [1.174,2.151]; the value range of the coefficient beta 9 is [ -0.01,
-0.006]; the value range of the coefficient beta 0 is [5.921,9.392]; preferably, the coefficient β1 is 1.55; the coefficient β2 is 1.00; the coefficient β3 is 0.04; the coefficient β4 is 0.07; the coefficient beta 5 is-0.43; the coefficient β6 is 0.14; the coefficient beta 7 is-3.35; the coefficient β8 is 1.66; the coefficient beta 9 is-0.01; the coefficient β0 is 7.66. The established logistic regression model can predict based on the data of Age, gender, AFP, PLT, INR, TB, hsa-miR-21, hsa-miR-192 and hsa-miR-223 of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 24 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 24 on the training set is 0.88 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 24 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 26 establishment and verification of logistic regression prediction model 25
Based on the prediction model 23, the variables hsa-miR-122, hsa-miR-21 and hsa-miR-223 are replaced by miRNA7, and the mathematical expression of the prediction model 25 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β5×log10 (PLT) +β6×INR+β7×TB+β0, wherein the coefficient β1 has a value in the range [1.483,1.657]; the value range of the coefficient beta 2 is [0.882,1.119]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [0.605,0.756]; the value range of the coefficient beta 5 is [ -3.542, -2.962]; the value range of the coefficient beta 6 is [1.065,2.019]; the value range of the coefficient beta 7 is [ -0.01, -0.006]; the value range of the coefficient beta 0 is [0.823,2.815]; preferably, the coefficient β1 is 1.57; the coefficient β2 is 1.00; the coefficient β3 is 0.04; the coefficient β4 is 0.68; the coefficient beta 5 is-3.25; the coefficient β6 is 1.54; the coefficient beta 7 is-0.01; the coefficient β0 is 1.82. The established logistic regression model can predict based on the Age, gender, AFP, PLT, INR, TB and miRNA7 data of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 25 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 25 on the training set is 0.88 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 25 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 27 establishment and verification of logistic regression prediction model 26
The predictive model 26 removes the variable TB from the model on the basis of the predictive model 25, the mathematical expression of the predictive model 26 being: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β5×log10 (PLT) +β6×INR+β0, wherein the coefficient β1 has a value in the range of [1.478,1.652]; the value range of the coefficient beta 2 is [0.876,1.112]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [0.582,0.732]; the value range of the coefficient beta 5 is [ -3.618, -3.042]; the value range of the coefficient beta 6 is [0.592,1.493]; the value range of the coefficient beta 0 is [1.458,3.394]; preferably, the coefficient β1 is 1.57; the coefficient β2 is 0.99; the coefficient β3 is 0.04; the coefficient β4 is 0.66; the coefficient beta 5 is-3.33; the coefficient β6 is 1.04; the coefficient β0 is 2.43. The established logistic regression model can predict based on the Age, gender, AFP, PLT, INR and miRNA7 data of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the predictive model 26 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, indicating that the model has better predictive performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 26 on the training set is 0.88 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 26 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 28 establishment and verification of logistic regression prediction model 27
The prediction model 27 is based on the prediction model 26, the variable INR is removed from the model, and the mathematical expression of the prediction model 27 is: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β5×log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [1.492,1.665]; the value range of the coefficient beta 2 is [0.89,1.125]; the value range of the coefficient beta 3 is [0.033,0.041]; the value range of the coefficient beta 4 is [0.579,0.728]; the value range of the coefficient beta 5 is [ -3.824, -3.28]; the value range of the coefficient beta 0 is [3.321,4.705]; preferably, the coefficient β1 is 1.58; the coefficient β2 is 1.01; the coefficient β3 is 0.04; the coefficient β4 is 0.65; the coefficient beta 5 is-3.35; the coefficient β0 is 4.01. The established logistic regression model can predict based on the Age, gender, AFP, PLT and miRNA7 data of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 27 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 27 on the training set is 0.88 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 27 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 29 establishment and verification of logistic regression prediction model 28
The predictive model 28 removes the variable PLT from the model on the basis of the predictive model 27, the mathematical expression of the predictive model 28 being: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β0, wherein the coefficient β1 has a value in the range of [1.548,1.72]; the value range of the coefficient beta 2 is [0.944,1.167]; the value range of the coefficient beta 3 is [0.042,0.051]; the value range of the coefficient beta 4 is [0.755,0.899]; the value range of the coefficient beta 0 is [ -4.699, -4.138]; preferably, the coefficient β1 is 1.63; the coefficient β2 is 1.06; the coefficient β3 is 0.05; the coefficient β4 is 0.83; the coefficient β0 is-4.42. The established logistic regression model can predict based on the Age, gender, AFP and miRNA7 data of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the predictive model 28 on the training set showed a better overall effect, and the AUC value was also higher, 0.85, indicating that the model had better predictive performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 28 on the training set is 0.85 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 28 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.87, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 30 establishment and verification of logistic regression prediction model 29
The prediction model 29 removes the variable Age from the model on the basis of the prediction model 28, and the mathematical expression of the prediction model 29 is: logit (P) =β1×log10 (AFP) +β2×Gender+β3×miRNA7+β0, wherein the coefficient β1 has a value in the range of [1.509,1.676]; the value range of the coefficient beta 2 is [0.931,1.147]; the value range of the coefficient beta 3 is [0.635,0.773]; the value range of the coefficient beta 0 is [ -1.915, -1.665]; preferably, the coefficient β1 is 1.59; the coefficient β2 is 1.04; the coefficient β3 is 0.70; the coefficient β0 is-1.79. The established logistic regression model can predict based on the data of Gender, AFP and miRNA7 of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 29 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.83, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 29 on the training set is 0.83 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the stability and effectiveness of the model are proved, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 29 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.84, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 31 establishment and verification of logistic regression prediction model 30
Prediction model 30 variable Gender is removed from the model based on prediction model 29, and the mathematical expression of prediction model 30 is: logit (P) =β1×log10 (AFP) +β2×miRNA7+β0, wherein the coefficient β1 has a value in the range of [1.504,1.669]; the value range of the coefficient beta 2 is [0.74,0.874]; the value range of the coefficient beta 0 is [ -1.043, -0.876]; preferably, the coefficient β1 is 1.59; the coefficient β2 is 0.81; the coefficient beta 0 is-0.96. The established logistic regression model can predict based on the AFP and miRNA7 data of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 30 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.81, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 30 on the training set is 0.81 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the stability and effectiveness of the model are proved, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 30 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.83, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 32 establishment and verification of logistic regression prediction model 31
The prediction model 31 adds the variable log10 (PLT) to the model based on the prediction model 30, and the mathematical expression of the prediction model 31 is: logit (P) =β1×log10 (AFP) +β2×miRNA7+β3
X log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [1.455,1.624]; the value range of the coefficient beta 2 is [0.555,0.695]; the value range of the coefficient beta 3 is [ -4.314, -3.776]; the value range of the coefficient beta 0 is [7.415,8.614]; preferably, the coefficient β1 is 1.54; the coefficient β2 is 0.63; the coefficient beta 5 is-4.05; the coefficient β0 is 8.01. The established logistic regression model can predict based on the AFP, PLT and miRNA7 data of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 31 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.86, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 31 on the training set is 0.86 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 31 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.86, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 33 establishment and verification of logistic regression prediction model 32
The predictive model 32 converts the variable INR and the variable TB into Log10 based on the predictive model 19, and then re-models the variable INR and the variable TB, and the mathematical expression of the predictive model 32 is: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×log10 (PLT) +β6×log10 (INR) +β7×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.465,1.640]; the value range of the coefficient beta 2 is [0.910,1.149]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [ -0.333, -0.269]; the value range of the coefficient beta 5 is [ -3.782, -3.204]; the value range of the coefficient beta 6 is [5.071,7.767]; the value range of the coefficient beta 7 is [ -1.262, -0.853]; the value range of the coefficient beta 0 is [10.656,12.794]; preferably, the coefficient β1 is 1.55; the coefficient β2 is 1.03; the coefficient β3 is 0.04; the coefficient beta 4 is-0.30; the coefficient beta 5 is-3.49; the coefficient β6 is 6.42; the coefficient beta 7 is-1.06; the coefficient β0 is 11.72. The established logistic regression model can predict based on the data of AFP, PLT, gender, age, hsa-miR-122, INR and TB of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the predictive model 32 on the training set showed a better overall effect, and the AUC value was also higher, 0.89, indicating that the model had better predictive performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 32 on the training set is 0.89 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 32 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 34 establishment and verification of logistic regression prediction model 33
Based on the prediction model 32, the variable hsa-miR-122 is replaced by the variable miRNA7 in the prediction model 33, and the mathematical expression of the prediction model 33 is as follows: logit (P) =β1×log10 (AFP) +×02×1Gender+ ×23×3Age+ ×44×miRNA7+ ×55×log10 (PLT) +β6×log10 (INR) +β7×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.477,1.652 ]]The method comprises the steps of carrying out a first treatment on the surface of the The value range of the coefficient beta 2 is [0.905,1.143 ]]The method comprises the steps of carrying out a first treatment on the surface of the The value range of the coefficient beta 3 is [0.033,0.042 ]]The method comprises the steps of carrying out a first treatment on the surface of the The value range of the coefficient beta 4 is [0.609,0.76 ]]The method comprises the steps of carrying out a first treatment on the surface of the Taking the coefficient beta 5The value range is [ -3.538, -2.954]The method comprises the steps of carrying out a first treatment on the surface of the The value range of the coefficient beta 6 is [4.994,7.692 ]]The method comprises the steps of carrying out a first treatment on the surface of the The coefficient beta 7 is within the range of [ -1.144, -0.737]The method comprises the steps of carrying out a first treatment on the surface of the The value range of the coefficient beta 0 is [3.455,5.028 ]]The method comprises the steps of carrying out a first treatment on the surface of the Preferably, the coefficient β1 is 1.56; the coefficient β2 is 1.02; the coefficient β3 is 0.04; the coefficient β4 is 0.68; the coefficient beta 5 is-3.25; the coefficient β6 is 6.34; the coefficient beta 7 is-0.94; the coefficient β0 is 4.24. The logistic regression model established can be based on AFP, PLT, gender, age, INR, TB and miRNA7 of the new sample TM And (3) predicting the data of the new sample and judging the possibility of suffering from liver cancer.
The ROC curve of the prediction model 33 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.89, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 33 on the training set is 0.88 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the stability and effectiveness of the model are proved, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 33 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 35 establishment and verification of logistic regression prediction model 34
The prediction model 34 adds the variable hsa-miR-223 into the model based on the prediction model 32, and the mathematical expression of the prediction model 34 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×hsa-miR-223+β6×log10 (PLT) +β7×log10 (INR) +β8×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.461,1.636]; the value range of the coefficient beta 2 is [0.897,1.136]; the value range of the coefficient beta 3 is [0.032,0.041]; the value range of the coefficient beta 4 is [ -0.357, -0.291]; the value range of the coefficient beta 5 is [0.074,0.157]; the value range of the coefficient beta 6 is [ -3.664, -3.081]; the value range of the coefficient beta 7 is [4.916,7.611]; the value range of the coefficient beta 8 is [ -1.236, -0.828]; the value range of the coefficient beta 0 is [7.993,10.708]; preferably, the coefficient β1 is 1.55; the coefficient β2 is 1.02; the coefficient β3 is 0.04; the coefficient beta 4 is-0.32; the coefficient β5 is 0.12; the coefficient beta 6 is-3.37; the coefficient β7 is 6.26; the coefficient beta 8 is-1.03; the coefficient β0 is 9.35. The established logistic regression model can predict based on the data of AFP, PLT, gender, age, INR, TB, hsa-miR-122 and hsa-miR-223 of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 34 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.89, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 34 on the training set is 0.89 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 34 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 36 establishment and verification of logistic regression prediction model 35
Based on the prediction model 32, the variable hsa-miR-122 is replaced by the variable hsa-miR-192 in the prediction model 35, and the mathematical expression of the prediction model 35 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×log10 (PLT) +β6×log10 (INR) +β7×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.465,1.64]; the value range of the coefficient beta 2 is [0.931,1.168]; the value range of the coefficient beta 3 is [0.036,0.045]; the value range of the coefficient beta 4 is [ -0.402, -0.314]; the value range of the coefficient beta 5 is [ -3.811, -3.232]; the value range of the coefficient beta 6 is [5.548,8.281]; the value range of the coefficient beta 7 is [ -1.298, -0.882]; the value range of the coefficient beta 0 is [12.569,15.297]; preferably, the coefficient β1 is 1.55; the coefficient β2 is 1.05; the coefficient β3 is 0.04; the coefficient beta 4 is-0.36; the coefficient beta 5 is-3.52; the coefficient β6 is 6.91; the coefficient beta 7 is-1.09; the coefficient β0 is 13.93. The established logistic regression model can predict based on the data of AFP, PLT, gender, age, INR, TB and hsa-miR-192 of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 35 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.88, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 35 on the training set is 0.88 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the stability and effectiveness of the model are proved, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 35 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 37 establishment and verification of logistic regression prediction model 36
The prediction model 36 adds the variable hsa-miR-223 into the model on the basis of the prediction model 35, and the mathematical expression of the prediction model 36 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×hsa-miR-223+β6×log10 (PLT) +β7×log10 (INR) +β8×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.454,1.629]; the value range of the coefficient beta 2 is [0.908,1.146]; the value range of the coefficient beta 3 is [0.035,0.044]; the value range of the coefficient beta 4 is [ -0.463, -0.368]; the value range of the coefficient beta 5 is [0.105,0.192]; the value range of the coefficient beta 6 is [ -3.655, -3.069]; the value range of the coefficient beta 7 is [5.441,8.172]; the value range of the coefficient beta 8 is [ -1.284, -0.869]; the value range of the coefficient beta 0 is [10.174,13.187]; preferably, the coefficient β1 is 1.54; the coefficient β2 is 1.03; the coefficient β3 is 0.04; the coefficient beta 4 is-0.42; the coefficient β5 is 0.15; the coefficient beta 6 is-3.36; the coefficient β7 is 6.81; the coefficient beta 8 is-1.08; the coefficient β0 is 11.68. The established logistic regression model can predict based on the data of AFP, PLT, gender, age, INR, TB, hsa-miR-223 and hsa-miR-192 of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the predictive model 36 on the training set showed a better overall effect, and the AUC value was also higher, 0.89, indicating that the model had better predictive performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 36 on the training set is 0.89 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 36 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 38 establishment and verification of logistic regression prediction model 37
The prediction model 37 is based on the prediction model 36, the variable hsa-miR-21 is added into the model, and the mathematical expression of the prediction model 37 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-192+β6×hsa-miR-223+β7×log10 (PLT) +β8×log10 (INR) +β9×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.455,1.63]; the value range of the coefficient beta 2 is [0.91,1.148]; the value range of the coefficient beta 3 is [0.036,0.045]; the value range of the coefficient beta 4 is [0.007,0.134]; the value range of the coefficient beta 5 is [ -0.505, -0.392]; the value range of the coefficient beta 6 is [0.087,0.178]; the value range of the coefficient beta 7 is [ -3.643, -3.056]; the value range of the coefficient beta 8 is [5.499,8.233]; the value range of the coefficient beta 9 is [ -1.271, -0.856]; the value range of the coefficient beta 0 is [9.145,12.523]; preferably, the coefficient β1 is 1.54; the coefficient β2 is 1.03; the coefficient β3 is 0.04; the coefficient β4 is 0.07; the coefficient beta 5 is-0.45; the coefficient β6 is 0.13; the coefficient beta 7 is-3.35; the coefficient β8 is 6.87; the coefficient beta 9 is-1.06; the coefficient β0 is 10.83. The established logistic regression model can predict based on the data of AFP, PLT, gender, age, INR, TB, hsa-miR-21, hsa-miR-223 and hsa-miR-192 of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 37 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.89, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 37 on the training set is 0.88 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 37 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 39 establishment and verification of logistic regression prediction model 38
Based on the prediction model 37, the variable hsa-miR-192 is replaced by the variable hsa-miR-122, and the mathematical expression of the prediction model 38 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-122+β6×hsa-miR-223+β7×log10 (PLT) +β8×log10 (INR) +β9×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.462,1.638]; the value range of the coefficient beta 2 is [0.899,1.138]; the value range of the coefficient beta 3 is [0.032,0.042]; the value range of the coefficient beta 4 is [ -0.033,0.085]; the value range of the coefficient beta 5 is [ -0.367, -0.294]; the value range of the coefficient beta 6 is [0.063,0.153]; the value range of the coefficient beta 7 is [ -3.662, -3.079]; the value range of the coefficient beta 8 is [4.92,7.615]; the value range of the coefficient beta 9 is [ -1.229, -0.819]; the value range of the coefficient beta 0 is [7.267,10.596]; preferably, the coefficient β1 is 1.55; the coefficient β2 is 1.02; the coefficient β3 is 0.04; the coefficient β4 is 0.03; the coefficient beta 5 is-0.33; the coefficient β6 is 0.11; the coefficient beta 7 is-3.37; the coefficient β8 is 6.27; the coefficient beta 9 is-1.02; the coefficient β0 is 8.93. The established logistic regression model can predict based on the data of AFP, PLT, gender, age, INR, TB, hsa-miR-21, hsa-miR-223 and hsa-miR-122 of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the predictive model 38 on the training set showed a better overall effect, and the AUC value was also higher, 0.89, indicating that the model had better predictive performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 38 on the training set is 0.89 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 38 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 40 establishment and verification of logistic regression prediction model 39
Prediction model 39 based on prediction model 38, variables hsa-miR-27a, log10 (ALT), log10 (AST) and GGT are added to the model, and the mathematical expression of prediction model 39 is: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-27a+β6×hsa-miR-122+β7×hsa-miR-223+β8×log10 (PLT) +β9×log10 (INR) +
β10×log10 (TB) +β11×log10 (ALT) +β12×log10 (AST) +β13×ggt+β0, wherein the coefficient β1 has a value in the range [1.461,1.639]; the value range of the coefficient beta 2 is [0.96,1.208]; the value range of the coefficient beta 3 is [0.034,0.043]; the value range of the coefficient beta 4 is [0.047,0.222]; the value range of the coefficient beta 5 is [ -0.263, -0.074]; the value range of the coefficient beta 6 is [ -0.447, -0.361]; the value range of the coefficient beta 7 is [0.119,0.222]; the value range of the coefficient beta 8 is [ -3.495, -2.905]; the value range of the coefficient beta 9 is [4.68,7.448]; the value range of the coefficient beta 10 is [ -0.997, -0.556]; the value range of the coefficient beta 11 is [ -0.329, -0.134]; the value range of the coefficient beta 12 is [0.078,0.325]; the value range of the coefficient beta 13 is [ -0.002, -0.001]; the value range of the coefficient beta 0 is [8.546,12.332]; preferably, the coefficient β1 is 1.55; the coefficient β2 is 1.08; the coefficient β3 is 0.04; the coefficient β4 is 0.13; the coefficient beta 5 is-0.17; the coefficient beta 6 is-0.40; the coefficient β7 is 0.17; the coefficient beta 8 is-3.20; the coefficient β9 is 6.06; the coefficient beta 10 is-0.78; the coefficient beta 11 is-0.23; the coefficient β12 is 0.20; the coefficient beta 13 is-0.0013; the coefficient β0 is 10.44. The established logistic regression model can predict based on the data of AFP, PLT, gender, age, INR, TB, ALT, AST, GGT, hsa-miR-21, hsa-miR-27a, hsa-miR-223 and hsa-miR-122 of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the predictive model 39 on the training set showed better overall effect, and the AUC value was also higher, 0.89, indicating that the model had better predictive performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 39 on the training set is 0.89 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 39 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 41 establishment and verification of logistic regression prediction model 40
Based on a prediction model 39, the variable hsa-miR-122 is replaced by the variable hsa-miR-192, and the mathematical expression of the prediction model 40 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-27a+β6×hsa-miR-192+
β7×hsa-miR-223+β8×log10 (PLT) +β9×log10 (INR) +β10×log10 (TB) +β11×log10 (ALT) +β12×log10 (AST) +β13×ggt+β0, wherein the coefficient β1 has a value in the range of [1.453,1.63]; the value range of the coefficient beta 2 is [0.946,1.192]; the value range of the coefficient beta 3 is [0.037,0.047]; the value range of the coefficient beta 4 is [0.069,0.249]; the value range of the coefficient beta 5 is [ -0.241, -0.047]; the value range of the coefficient beta 6 is [ -0.538, -0.414]; the value range of the coefficient beta 7 is [0.126,0.233]; the value range of the coefficient beta 8 is [ -3.532, -2.939]; the value range of the coefficient beta 9 is [5.291,8.103]; the value range of the coefficient beta 10 is [ -1.151, -0.708]; the value range of the coefficient beta 11 is [ -0.177,0.01]; the value range of the coefficient beta 12 is [0.043,0.291]; the value range of the coefficient beta 13 is [ -0.002, -0.001]; the value range of the coefficient beta 0 is [9.347,13.233]; preferably, the coefficient β1 is 1.54; the coefficient β2 is 1.07; the coefficient β3 is 0.04; the coefficient β4 is 0.16; the coefficient beta 5 is-0.14; the coefficient beta 6 is-0.48; the coefficient β7 is 0.18; the coefficient beta 8 is-3.24; the coefficient β9 is 6.70; the coefficient beta 10 is-0.93; the coefficient beta 11 is-0.08; the coefficient β12 is 0.17; the coefficient beta 13 is-0.0012; the coefficient β0 is 11.29. The established logistic regression model can predict based on the data of AFP, PLT, gender, age, INR, TB, ALT, AST, GGT, hsa-miR-21, hsa-miR-27a, hsa-miR-223 and hsa-miR-192 of the new sample, and judges the possibility of the new sample suffering from liver cancer.
The ROC curve of the predictive model 40 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.89, indicating that the model has better predictive performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the predictive model 40 on the training set is 0.89 and the standard deviation is 0.00 achieves the set performance target, and meanwhile, the model is stable and effective, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 40 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.89, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
Example 42 establishment and verification of logistic regression prediction model 41
Based on a prediction model 40, the prediction model 41 replaces four variables of hsa-miR-21, hsa-miR-27a, hsa-miR-192 and hsa-miR-223 with variable miRNA7, and the mathematical expression of the prediction model 41 is as follows: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β5×log10 (PLT) +β6×log10 (INR) +β7×log10 (TB) +β8×log10 (ALT) +β9×log10 (AST) +β10×GGT+β0, wherein the coefficient β1 has a value in the range [1.472,1.649]; the value range of the coefficient beta 2 is [0.955,1.2]; the value range of the coefficient beta 3 is [0.034,0.043]; the value range of the coefficient beta 4 is [0.666,0.841]; the value range of the coefficient beta 5 is [ -3.384, -2.793]; the value range of the coefficient beta 6 is [4.526,7.28]; the value range of the coefficient beta 7 is [ -0.982, -0.543]; the value range of the coefficient beta 8 is [ -0.242, -0.054]; the value range of the coefficient beta 9 is [0.088,0.328]; the value range of the coefficient beta 10 is [ -0.002, -0.001]; the value range of the coefficient beta 0 is [2.595,4.325]; preferably, the coefficient β1 is 1.56; the coefficient β2 is 1.08; the coefficient β3 is 0.04; the coefficient β4 is 0.75; the coefficient beta 5 is-3.09; the coefficient β6 is 5.90; the coefficient beta 7 is-0.76; the coefficient beta 8 is-0.15; the coefficient β9 is 0.21; the coefficient beta 10 is-0.0013; the coefficient β0 is 3.46. The established logistic regression model can predict based on the AFP, PLT, gender, age, INR, TB, ALT, AST, GGT and miRNA7 data of the new sample, and judge the possibility of the new sample suffering from liver cancer.
The ROC curve of the prediction model 41 on the training set shows a better overall effect, and the AUC value is also higher, which is 0.89, which indicates that the model has better prediction performance in distinguishing liver cancer samples from non-liver cancer samples. The result that the average AUC corresponding to the ROC curve of the 5-fold cross validation of the prediction model 41 on the training set is 0.89 and the standard deviation is 0.01 achieves the set performance target, and meanwhile, the stability and effectiveness of the model are proved, and the guarantee is provided for the practical application of the model. The ROC curve of the prediction model 41 on the test set has a good overall effect, and the corresponding AUC value is also high and is 0.90, which indicates that the model still has good capability of distinguishing liver cancer samples from non-liver cancer samples on the test set, and the AUC of the test set is different from the AUC of the training set, which indicates that the model is not fitted and has good generalization capability. In combination, the model has good prediction performance on the test set, and can be clinically verified and applied.
The indexes related to the 41 models are different, and the model can be applied to a plurality of different classification groups obtained according to indexes and the number. Preferably, the prediction models 1 and 2 can obtain accurate prediction results by adopting a minimum of 2 indexes, and are simple, convenient and low in cost.
In some embodiments, after step S90 shown in fig. 1, the method further includes adjusting a logic threshold, a risk score threshold, or a cutoff value of the model prediction result to ensure detection accuracy of the model in different people.
Fig. 6 is a block diagram of a system for assessing a subject's likelihood of suffering from hepatocellular carcinoma in accordance with one embodiment of the present application. Referring to FIG. 6, the system of this embodiment includes a data acquisition module 610, a data preprocessing module 620, a model recommendation module 630, a model selection module 640, and a risk assessment module 650.
The data acquisition module 610 is configured to acquire sample data of a subject, where the sample data includes an expression level hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801 of microRNA in plasma, and any one of an Age, a Gender, a tumor marker detection result and a hematology index detection result of the subject, the tumor marker detection result includes alpha fetoprotein AFP, and the hematology index detection result includes any one of platelet count PLT, total bilirubin TB, serum glutamic oxaloacetic transaminase GGT, prothrombin time PT, international standardization ratio INR, aspartic acid amino transferase AST and alanine amino transferase ALT. These sample data and corresponding subjects may be from different departments or groups of people. The population may include healthy people, hepatocellular carcinoma patients, liver benign patients, metastatic liver carcinoma patients, hepatitis patients, cirrhosis patients, liver benign tumor patients, cholangiocarcinoma patients, etc. In some embodiments, the AFP is a non-null value, one of hsa-miR-122 and hsa-miR-192 is non-null, the other index may be null, and the more preferred non-null index is Gender (Gender), age (Age), platelet count (PLT), total Bilirubin (TB), alpha Fetoprotein (AFP), international Normalized Ratio (INR), hsa-miR-122, hsa-miR-21, hsa-miR-223.
The data preprocessing module 620 is configured to preprocess sample data, including removing duplicate samples, unifying units, and classifying the sample data into different analysis groups according to index types and numbers. The unified unit means a unit of unified concentration or mass. In some embodiments, the data preprocessing module 620 is further configured to process the data according to miRNA7 TM Model formula calculation miRNA7 TM And performing Log10 conversion on the detection result, the AFP detection result and the PLT detection result, labeling samples which do not meet the analysis requirements, and the like.
The model recommendation module 630 is configured to recommend, for different analysis groups, a plurality of prediction models with larger AUC values according to the magnitude of the AUC values from 41 prediction models established by the aforementioned construction method according to the types and numbers of the indexes included in the analysis groups. And preferentially recommending the model formula with the largest AUC value, and finally displaying the model formula on a user operation interface in a mode of a model formula list, and sequencing from large to small according to the AUC value of the model.
The model selection module 640 is used to provide a selection function to the user to output one or more predictive models for risk assessment and calculation from among the predictive models recommended by the model recommendation module 630. The user may not select, and the system may default to selecting the model with the greatest AUC value.
The risk assessment module 650 is configured to calculate a corresponding prediction value according to one or more prediction models output by the model selection module 640, and give an appropriate logic threshold or risk score threshold according to department or crowd information corresponding to the sample data, and divide the risk level into any one of high likelihood, medium likelihood, and low likelihood according to the logic threshold or risk score threshold. In some embodiments, the logic threshold or risk score threshold comprises 1 threshold for bipartite, i.e., dividing high and low likelihood. In some embodiments, the logic threshold or risk score threshold comprises 2 thresholds for three scores, i.e., to score a high likelihood, a medium likelihood, a low likelihood. In other embodiments, further subdivisions may be made, which are not limiting in this application.
The system 600 can quantitatively obtain biomarker information in a subject sample, input the biomarker information into a model for liver cancer risk assessment, avoid subjective assessment errors, and effectively assess and predict the possibility of different populations suffering from liver cell liver cancer. Unlike the existing liver cell liver cancer screening system, the system 600 not only can quantitatively, automatically and continuously give out the liver cancer risk level, but also can improve the accuracy and efficiency of assessment; the system 600 can provide the most suitable prediction model in consideration of the fact that partial samples cannot obtain the most complete indexes in practical application; in addition, the system 600 also provides a model recommending module 630 and a model selecting module 640, so that the system is more flexible and meets the diversified application requirements of clients; in addition, the system 600 may apply the most appropriate threshold for risk assessment based on the department or population from which the sample was derived.
The application also provides a kit for evaluating the possibility of suffering from liver cancer of liver cells of a subject, wherein the kit comprises a prediction model established by the method for constructing the model. The prediction model can be specifically integrated in an integrated circuit in the kit, taking the prediction model 1 as an example, the kit can be used for receiving AFP and hsa-miR-122, and calculating a prediction result according to the built-in prediction model 1, so as to give a possibility result of the hepatocellular carcinoma of the tested liver. The kit can simultaneously comprise 41 different prediction models 1, so as to be suitable for different crowds and have very wide applicability.
The application also includes an electronic device including a memory and a processor. Wherein the memory is for storing instructions executable by the processor; the processor is configured to execute the instructions to implement the method of constructing the model for assessing the likelihood of a subject suffering from hepatocellular carcinoma as described above.
Fig. 7 is a system block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 7, the electronic device 700 may include an internal communication bus 701, a processor 702, a Read Only Memory (ROM) 703, a Random Access Memory (RAM) 704, and a communication port 705. The electronic device 700 may also include a hard disk 706 when applied to a personal computer. Internal communication bus 701 may enable data communication between the components of the electronic device 700. The processor 702 may make the determination and issue a prompt. In some embodiments, the processor 702 may be comprised of one or more processors. The communication port 705 may enable the electronic device 700 to communicate data with the outside. In some embodiments, the electronic device 700 may send and receive information and data from a network through the communication port 705. The electronic device 700 may also include various forms of program storage elements and data storage elements such as hard disk 706, read Only Memory (ROM) 703 and Random Access Memory (RAM) 704 capable of storing various data files for computer processing and/or communication, as well as possible program instructions for execution by the processor 702. The processor executes these instructions to implement the main part of the method. The result processed by the processor is transmitted to the user equipment through the communication port and displayed on the user interface.
The above-described method of constructing a model may be implemented as a computer program, stored in the hard disk 706, and loaded into the processor 702 for execution, so as to implement the method of constructing a model of the present application.
The present application also includes a computer readable medium storing computer program code which, when executed by a processor, implements the method of constructing a model as described above.
When the method for constructing a model for evaluating the possibility of a subject suffering from hepatocellular carcinoma is implemented as a computer program, it may also be stored in a computer-readable storage medium as an article of manufacture. For example, computer-readable storage media may include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips), optical disks (e.g., compact Disk (CD), digital Versatile Disk (DVD)), smart cards, and flash memory devices (e.g., electrically erasable programmable read-only memory (EPROM), cards, sticks, key drives). Moreover, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" can include, without being limited to, wireless channels and various other media (and/or storage media) capable of storing, containing, and/or carrying code and/or instructions and/or data.
It should be understood that the embodiments described above are illustrative only. The embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processors may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, and/or other electronic units designed to perform the functions described herein, or a combination thereof.
Some aspects of the present application may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system. The processor may be one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital signal processing devices (DAPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, or a combination thereof. Furthermore, aspects of the present application may take the form of a computer product, comprising computer-readable program code, embodied in one or more computer-readable media. For example, computer-readable media can include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, tape … …), optical disk (e.g., compact disk CD, digital versatile disk DVD … …), smart card, and flash memory devices (e.g., card, stick, key drive … …).
The computer readable medium may comprise a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take on a variety of forms, including electro-magnetic, optical, etc., or any suitable combination thereof. A computer readable medium can be any computer readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer readable medium may be propagated through any suitable medium, including radio, cable, fiber optic cable, radio frequency signals, or the like, or a combination of any of the foregoing.
While the basic concepts have been described above, it will be apparent to those skilled in the art that the above disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations of the present application may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this application, and are therefore within the spirit and scope of the exemplary embodiments of this application.
Meanwhile, the present application uses specific words to describe embodiments of the present application. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present application. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present application may be combined as suitable.
In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.

Claims (48)

1. A method of constructing a model for assessing a subject's likelihood of developing hepatocellular carcinoma, comprising:
step S10: obtaining a plurality of arguments for a model, the plurality of arguments comprising: the expression quantity of microRNA in blood plasma is any one of hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a and hsa-miR-801, and the Age, sex Gender and tumor marker detection results of a subject, and blood index detection results, wherein the tumor marker detection results comprise alpha fetoprotein AFP, the blood index detection results comprise any one of platelet count PLT, total bilirubin TB, serum glutamic oxaloacetic transaminase GGT, prothrombin time PT, international standardization ratio INR, aspartic acid amino transferase AST and alanine amino transferase ALT, and the independent variables are subjected to structural treatment;
step S20: dividing the subject into a liver cell liver cancer group and a non-liver cell liver cancer group, and respectively encoding to 1 and 0 to form dependent variables for a model;
step S30: screening the plurality of independent variables to obtain effective independent variables;
step S40: performing correlation analysis on the plurality of independent variables to determine the relationship among the plurality of independent variables;
Step S50: comparing the correlation coefficient between each independent variable before and after performing Log10 conversion and the dependent variable, and determining whether to perform Log10 conversion on the independent variable;
step S60: dividing data into a training set and a testing set, constructing a regression model containing the independent variables according to the training set, and testing the performance of the regression model according to the testing set, wherein the data comprises the plurality of independent variables and the dependent variables;
step S70: further screening the effective independent variables according to any one of p-value, correlation coefficient and AUC value of the regression model on a training set and a test set of each independent variable in regression analysis results to obtain screened effective independent variables, wherein when the p-value of the independent variable is smaller than 0.05, the independent variable with the correlation coefficient larger than a first threshold value is taken as the effective independent variable;
step S80: repeatedly executing the step S70 until the AUC value of the regression model on the training set is reduced to be near 0.8, wherein the number of the screened effective independent variables is more than or equal to 2, and the screened effective independent variables at least comprise one microRNA, so as to obtain a plurality of first models;
Step S90: and replacing and combining the independent variables with the replacement relation in the plurality of first models according to the relation among the plurality of independent variables obtained in the step S40, and constructing a regression model by adopting the replaced and combined independent variables to obtain a plurality of second models, wherein the models comprise any one of the first models and the second models.
2. The construction method according to claim 1, wherein the method of screening the plurality of independent variables to obtain the effective independent variable in the step S30 comprises the following method one and/or method two:
the method comprises the following steps: calculating a correlation coefficient between each independent variable and the dependent variable, wherein the independent variable with the phase relation number larger than a second threshold value is used as an effective independent variable;
the second method is as follows: and carrying out t-test statistical test analysis on each independent variable between the liver cell liver cancer group and the non-liver cell liver cancer group, and taking the independent variable with the p-value of the analysis result smaller than a third threshold value as an effective independent variable.
3. The construction method according to claim 1, wherein the relationship between the plurality of arguments determined according to the step S40 includes: hsa-miR-122 has a substitution relation with hsa-miR-192, hsa-miR-21, hsa-miR-26a and hsa-miR-27a have a substitution relation, hsa-miR-223 has no substitution relation with other microRNAs, hsa-miR-801 has no substitution relation with other microRNAs, ALT and AST have substitution relation, and PT has a substitution relation with INR.
4. The method of construction of claim 1, wherein the model has effective independent variables of log10 (AFP) and hsa-miR-122, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×hsa-miR-122+β0, wherein the value range of the coefficient β1 is [1.515,1.68]; the value range of the coefficient beta 2 is [ -0.325, -0.269]; the coefficient β0 has a value in the range of [5.027,6.354].
5. The method of construction of claim 1, wherein the effective independent variables of the model are log10 (AFP) and hsa-miR-192, and the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×hsa-miR-192+β0, wherein the value range of the coefficient β1 is [1.526,1.690]; the value range of the coefficient beta 2 is [ -0.327, -0.252]; the coefficient β0 has a value in the range of [5.255,7.23].
6. The method of claim 1, wherein the model has effective arguments of Gender, log10 (AFP) and hsa-miR-122, and wherein the model has a predictive formula of: logit (P) =β1×Gender+β2×log10 (AFP) +β3×hsa-miR-122+β0, wherein the value range of the coefficient β1 is [0.957,1.171]; the value range of the coefficient beta 2 is [1.520,1.687]; the value range of the coefficient beta 3 is [ -0.282, -0.225]; the coefficient β0 has a value in the range of [3.153,4.549].
7. The method of claim 1, wherein the model has effective arguments of Gender, log10 (AFP) and hsa-miR-192, and wherein the model has a predictive formula of: logit (P) =β1×Gender+β2×log10 (AFP) +β3×hsa-miR-192+β0, wherein the value range of the coefficient β1 is [1.012,1.225]; the value range of the coefficient beta 2 is [1.538,1.706]; the value range of the coefficient beta 3 is [ -0.265, -0.189]; the coefficient β0 has a value in the range of [2.723,4.78].
8. The method of claim 1, wherein the model has effective independent variables of log10 (AFP), hsa-miR-122, and log10 (PLT), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×hsa-miR-122+β3×log10 (PLT) +β0, wherein the value range of the coefficient β1 is [1.446,1.615]; the value range of the coefficient beta 2 is [ -0.297, -0.238]; the value range of the coefficient beta 3 is [ -4.515, -3.983]; the coefficient β0 has a value in the range of [13.611,15.439].
9. The method of claim 1, wherein the model has effective arguments of log10 (AFP), hsa-miR-192, and log10 (PLT), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×hsa-miR-192+β3×log10 (PLT) +β0, wherein the value range of the coefficient β1 is [1.457,1.625]; the value range of the coefficient beta 2 is [ -0.302, -0.223]; the value range of the coefficient beta 3 is [ -4.628, -4.094]; the coefficient β0 has a value in the range of [14.111,16.522].
10. The method of claim 1, wherein the model has effective arguments of Gender, age, log10 (AFP) and hsa-miR-122, and wherein the model has a predictive formula of: logit (P) =β1×Gender+β2×Age+β3×log10 (AFP) +β4×hsa-miR-122+β0, wherein the coefficient β1 has a value in the range of [0.975,1.196]; the value range of the coefficient beta 2 is [0.041,0.050]; the value range of the coefficient beta 3 is [1.559,1.731]; the value range of the coefficient beta 4 is [ -0.335, -0.275]; the coefficient β0 has a value in the range of [1.709,3.157].
11. The method of claim 1, wherein the model has effective arguments of Gender, age, log10 (AFP) and hsa-miR-192, and wherein the model has a predictive formula of: logit (P) =β1×Gender+β2×Age+β3×log10 (AFP) +β4×hsa-miR-192+β0, wherein the coefficient β1 has a value in the range of [1.003,1.222]; the value range of the coefficient beta 2 is [0.044,0.052]; the value range of the coefficient beta 3 is [1.568,1.739]; the value range of the coefficient beta 4 is [ -0.39, -0.308]; the coefficient β0 has a value in the range of [3.146,5.262].
12. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, hsa-miR-122, and log10 (PLT), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×hsa-miR-122+β4×log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [0.975,1.196]; the value range of the coefficient beta 2 is [0.041,0.050]; the value range of the coefficient beta 3 is [1.559,1.731]; the value range of the coefficient beta 4 is [ -0.335, -0.275]; the coefficient β0 has a value in the range of [1.709,3.157].
13. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, hsa-miR-192, and log10 (PLT), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×hsa-miR-192+β4×log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [1.461,1.631]; the value range of the coefficient beta 2 is [0.914,1.144]; the value range of the coefficient beta 3 is [ -0.250, -0.169]; the value range of the coefficient beta 4 is [ -4.490, -3.950]; the coefficient β0 has a value in the range of [11.584,14.070].
14. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-122, and log10 (PLT), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [1.481,1.654]; the value range of the coefficient beta 2 is [0.893,1.128]; the value range of the coefficient beta 3 is [0.033,0.041]; the value range of the coefficient beta 4 is [ -0.309, -0.247]; the value range of the coefficient beta 5 is [ -4.050, -3.512]; the coefficient β0 has a value in the range of [9.842,11.781].
15. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-192, and log10 (PLT), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [1.489,1.663]; the value range of the coefficient beta 2 is [0.916,1.150]; the value range of the coefficient beta 3 is [0.034,0.043]; the value range of the coefficient beta 4 is [ -0.354, -0.268]; the value range of the coefficient beta 5 is [ -4.109, -3.568]; the coefficient β0 has a value in the range of [11.149,13.662].
16. The method of construction of claim 1, wherein the model has effective arguments of Gender, log10 (AFP), hsa-miR-21, hsa-miR-122, and hsa-miR-223, and wherein the model has a predictive formula of: logit (P) =β1×Gender+β2×log10 (AFP) +β3×hsa-miR-21+β4×hsa-miR-122+β5×hsa-miR-223+β0, wherein the value range of the coefficient β1 is [0.925,1.143]; the value range of the coefficient beta 2 is [1.510,1.679]; the value range of the coefficient beta 3 is [ -0.076,0.034]; the value range of the coefficient beta 4 is [ -0.335, -0.269]; the value range of the coefficient beta 5 is [0.234,0.317]; the value range of the coefficient beta 0 is [ -2.089,0.526].
17. The method of construction of claim 1, wherein the model has effective arguments of Gender, log10 (AFP), hsa-miR-21, hsa-miR-192, and hsa-miR-223, and wherein the model has a predictive formula of: logit (P) =β1×Gender+β2×log10 (AFP) +β3×hsa-miR-21+β4×hsa-miR-192+β5×hsa-miR-223+β0, wherein the value range of the coefficient β1 is [0.963,1.178]; the value range of the coefficient beta 2 is [1.516,1.685]; the value range of the coefficient beta 3 is [ -0.101,0.016]; the value range of the coefficient beta 4 is [ -0.372, -0.274]; the value range of the coefficient beta 5 is [0.257,0.341]; the value range of the coefficient beta 0 is [ -0.760,1.905].
18. The method of construction of claim 1, wherein the model has effective arguments of Gender, age, log10 (AFP), hsa-miR-21, hsa-miR-122, and hsa-miR-223, and wherein the model has a predictive formula of: logit (P) =β1×Gender+β2×Age+β3×log10 (AFP) +β4×hsa-miR-21+β5×hsa-miR-122+β6×hsa-miR-223+β0, wherein the value range of the coefficient β1 is [0.949,1.172]; the value range of the coefficient beta 2 is [0.039,0.048]; the value range of the coefficient beta 3 is [1.548,1.720]; the value range of the coefficient beta 4 is [ -0.011,0.102]; the value range of the coefficient beta 5 is [ -0.397, -0.327]; the value range of the coefficient beta 6 is [0.172,0.257]; the value range of the coefficient beta 0 is [ -3.687, -1.011].
19. The method of construction of claim 1, wherein the model has effective arguments of Gender, age, log10 (AFP), hsa-miR-21, hsa-miR-192, and hsa-miR-223, and wherein the model has a predictive formula of: logit (P) =β1×Gender+β2×Age+β3×log10 (AFP) +β4×hsa-miR-21+β5×hsa-miR-192+β6×hsa-miR-223+β0, wherein the value range of the coefficient β1 is [0.959,1.182]; the value range of the coefficient beta 2 is [0.044,0.052]; the value range of the coefficient beta 3 is [1.544,1.716]; the value range of the coefficient beta 4 is [0.031,0.153]; the value range of the coefficient beta 5 is [ -0.549, -0.443]; the value range of the coefficient beta 6 is [0.203,0.290]; the value range of the coefficient beta 0 is [ -1.567,1.171].
20. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-122, log10 (PLT), and INR, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×log10 (PLT) +β6×INR+β0, wherein the coefficient β1 has a value in the range of [1.469,1.642]; the value range of the coefficient beta 2 is [0.88,1.116]; the value range of the coefficient beta 3 is [0.033,0.041]; the value range of the coefficient beta 4 is [ -0.31, -0.247]; the value range of the coefficient beta 5 is [ -3.854, -3.284]; the value range of the coefficient beta 6 is [0.548,1.447]; the coefficient β0 has a value in the range of [8.136,10.487].
21. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-192, log10 (PLT), and INR, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×log10 (PLT) +β6×INR+β0, wherein the coefficient β1 has a value in the range of [1.474,1.648]; the value range of the coefficient beta 2 is [0.902,1.136]; the value range of the coefficient beta 3 is [0.033,0.041]; the value range of the coefficient beta 4 is [0.034,0.043]; the value range of the coefficient beta 5 is [ -3.896, -3.324]; the value range of the coefficient beta 6 is [0.639,1.557]; the coefficient β0 has a value in the range of [9.440,12.261].
22. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-122, log10 (PLT), INR, and TB, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×log10 (PLT) +β6×INR+β7×TB+β0, wherein the coefficient β1 has a value in the range of [1.471,1.645]; the value range of the coefficient beta 2 is [0.884,1.121]; the value range of the coefficient beta 3 is [0.034,0.043]; the value range of the coefficient beta 4 is [ -0.328, -0.265]; the value range of the coefficient beta 5 is [ -3.769, -3.194]; the value range of the coefficient beta 6 is [1.081,2.037]; the value range of the coefficient beta 7 is [ -0.01, -0.007]; the coefficient β0 has a value in the range of [7.832,10.218].
23. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-192, log10 (PLT), INR, and TB, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×log10 (PLT) +β6×INR+β7×TB+β0, wherein the coefficient β1 has a value in the range of [1.475,1.650]; the value range of the coefficient beta 2 is [0.906,1.142]; the value range of the coefficient beta 3 is [0.036,0.045]; the value range of the coefficient beta 4 is [ -0.383, -0.296]; the value range of the coefficient beta 5 is [ -3.812, -3.235]; the value range of the coefficient beta 6 is [1.179,2.158]; the value range of the coefficient beta 7 is [ -0.01, -0.007]; the coefficient β0 has a value in the range of [9.339,12.190].
24. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-122, hsa-miR-223, log10 (PLT), INR, and TB, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×hsa-miR-223+β6×log10 (PLT) +β7×INR+β8×TB+β0, where the coefficient β1 has a value in the range [1.467,1.642]; the value range of the coefficient beta 2 is [0.871,1.109]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [ -0.355, -0.289]; the value range of the coefficient beta 5 is [0.086,0.168]; the value range of the coefficient beta 6 is [ -3.641, -3.061]; the value range of the coefficient beta 7 is [1.039,1.99]; the value range of the coefficient beta 8 is [ -0.01, -0.007]; the coefficient β0 has a value in the range of [5.060,7.940].
25. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-192, hsa-miR-223, log10 (PLT), INR, and TB, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×hsa-miR-223+β6×log10 (PLT) +β7×INR+β8×TB+β0, where the coefficient β1 has a value in the range [1.464,1.639]; the value range of the coefficient beta 2 is [0.883,1.119]; the value range of the coefficient beta 3 is [0.035,0.044]; the value range of the coefficient beta 4 is [ -0.448, -0.353]; the value range of the coefficient beta 5 is [0.113,0.199]; the value range of the coefficient beta 6 is [ -3.647, -3.064]; the value range of the coefficient beta 7 is [1.153,2.127]; the value range of the coefficient beta 8 is [ -0.01, -0.007]; the coefficient β0 has a value in the range of [6.888,9.999].
26. The method of construction of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-21, hsa-miR-122, hsa-miR-223, log10 (PLT), INR, and TB, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-122+β6×hsa-miR-223+β7×log10 (PLT) +β8×INR+β9×TB+β0, wherein the coefficient β1 has a value in the range of [1.469,1.644]; the value range of the coefficient beta 2 is [0.873,1.111]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [ -0.03,0.088]; the value range of the coefficient beta 5 is [ -0.366, -0.293]; the value range of the coefficient beta 6 is [0.074,0.163]; the value range of the coefficient beta 7 is [ -3.64, -3.059]; the value range of the coefficient beta 8 is [1.043,1.995]; the value range of the coefficient beta 9 is [ -0.01, -0.007]; the coefficient β0 has a value in the range of [4.317,7.761].
27. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-21, hsa-miR-192, hsa-miR-223, log10 (PLT), INR, and TB, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-192+β6×hsa-miR-223+β7×log10 (PLT) +β8×INR+β9×TB+β0, wherein the coefficient β1 has a value in the range of [1.465,1.640]; the value range of the coefficient beta 2 is [0.885,1.121]; the value range of the coefficient beta 3 is [0.036,0.045]; the value range of the coefficient beta 4 is [0.001,0.129]; the value range of the coefficient beta 5 is [ -0.487, -0.375]; the value range of the coefficient beta 6 is [0.096,0.187]; the value range of the coefficient beta 7 is [ -3.638, -3.054]; the value range of the coefficient beta 8 is [1.174,2.151]; the value range of the coefficient beta 9 is [ -0.01, -0.006]; the coefficient β0 has a value in the range of [5.921,9.392].
28. The method of construction of claim 1, wherein the model has effective independent variables of log10 (AFP), gender, age, miRNA7, log10 (PLT), INR, and TB, wherein miRNA7 is a liver cancer diagnostic marker combined from nucleic acid molecules encoding hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, and hsa-miR-1228, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β5×log10 (PLT) +β6×INR+β7×TB+β0, wherein the coefficient β1 has a value in the range [1.483,1.657]; the value range of the coefficient beta 2 is [0.882,1.119]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [0.605,0.756]; the value range of the coefficient beta 5 is [ -3.542, -2.962]; the value range of the coefficient beta 6 is [1.065,2.019]; the value range of the coefficient beta 7 is [ -0.01, -0.006]; the coefficient β0 has a value in the range of [0.823,2.815].
29. The method of construction of claim 1, wherein the model has effective independent variables of log10 (AFP), gender, age, miRNA7, log10 (PLT), and INR, and wherein miRNA7 is a liver cancer diagnostic marker combined from nucleic acid molecules encoding hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, and hsa-miR-1228, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β5×log10 (PLT) +β6×INR+β0, wherein the coefficient β1 has a value in the range of [1.478,1.652]; the value range of the coefficient beta 2 is [0.876,1.112]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [0.582,0.732]; the value range of the coefficient beta 5 is [ -3.618, -3.042]; the value range of the coefficient beta 6 is [0.592,1.493]; the coefficient β0 has a value in the range of [1.458,3.394].
30. The method of construction of claim 1, wherein the model has effective independent variables of log10 (AFP), gender, age, miRNA7, and log10 (PLT), wherein miRNA7 is a liver cancer diagnostic marker composed of nucleic acid molecules encoding hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, and hsa-miR-1228, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β5×log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [1.492,1.665]; the value range of the coefficient beta 2 is [0.89,1.125]; the value range of the coefficient beta 3 is [0.033,0.041]; the value range of the coefficient beta 4 is [0.579,0.728]; the value range of the coefficient beta 5 is [ -3.824, -3.28]; the coefficient β0 has a value in the range of [3.321,4.705].
31. The method of construction of claim 1, wherein the effective independent variables of the model are log10 (AFP), gender, age, and miRNA7, wherein miRNA7 is a liver cancer diagnostic marker combined from nucleic acid molecules encoding hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, and hsa-miR-1228, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β0, wherein the coefficient β1 has a value in the range of [1.548,1.72]; the value range of the coefficient beta 2 is [0.944,1.167]; the value range of the coefficient beta 3 is [0.042,0.051]; the value range of the coefficient beta 4 is [0.755,0.899]; the coefficient beta 0 has a value ranging from-4.699 to-4.138.
32. The method of construction of claim 1, wherein the effective independent variables of the model are log10 (AFP), sex Gender, and miRNA7, wherein miRNA7 is a liver cancer diagnostic marker combined from nucleic acid molecules encoding hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, and hsa-miR-1228, and wherein the model has a predictive formula: logit (P) =β1×log10 (AFP) +β2×Gender+β3×miRNA7+β0, wherein the coefficient β1 has a value in the range of [1.509,1.676]; the value range of the coefficient beta 2 is [0.931,1.147]; the value range of the coefficient beta 3 is [0.635,0.773]; the coefficient beta 0 has a value ranging from-1.915 to-1.665.
33. The method of construction of claim 1, wherein the effective independent variables of the model are log10 (AFP) and miRNA7, wherein miRNA7 is a liver cancer diagnostic marker combined from nucleic acid molecules encoding hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, and hsa-miR-1228, and wherein the model has a predictive formula: logit (P) =β1×log10 (AFP) +β2×miRNA7+β0, wherein the coefficient β1 has a value in the range of [1.504,1.669]; the value range of the coefficient beta 2 is [0.74,0.874]; the coefficient beta 0 has a value ranging from-1.043 to-0.876.
34. The method of construction of claim 1, wherein the effective independent variables of the model are log10 (AFP), miRNA7, and log10 (PLT), wherein miRNA7 is a liver cancer diagnostic marker combined from nucleic acid molecules encoding hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, and hsa-miR-1228, and wherein the model has a predictive formula: logit (P) =β1×log10 (AFP) +β2×miRNA7+β3×log10 (PLT) +β0, wherein the coefficient β1 has a value in the range of [1.455,1.624]; the value range of the coefficient beta 2 is [0.555,0.695]; the value range of the coefficient beta 3 is [ -4.314, -3.776]; the coefficient β0 has a value in the range of [7.415,8.614].
35. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-122, log10 (INR), log10 (PLT), and log10 (TB), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×log10 (PLT) +β6×log10 (INR) +β7×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.465,1.640]; the value range of the coefficient beta 2 is [0.910,1.149]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [ -0.333, -0.269]; the value range of the coefficient beta 5 is [ -3.782, -3.204]; the value range of the coefficient beta 6 is [5.071,7.767]; the value range of the coefficient beta 7 is [ -1.262, -0.853]; the coefficient β0 has a value in the range of [10.656,12.794].
36. The method of construction of claim 1, wherein the model has effective independent variables of log10 (AFP), gender, age, miRNA7, log10 (PLT), log10 (INR), and log10 (TB), wherein miRNA7 is a liver cancer diagnostic marker combined from nucleic acid molecules encoding hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, and hsa-miR-1228, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β5×log10 (PLT) +β6×log10 (INR) +β7×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.477,1.652]; the value range of the coefficient beta 2 is [0.905,1.143]; the value range of the coefficient beta 3 is [0.033,0.042]; the value range of the coefficient beta 4 is [0.609,0.76]; the value range of the coefficient beta 5 is [ -3.538, -2.954]; the value range of the coefficient beta 6 is [4.994,7.692]; the value range of the coefficient beta 7 is [ -1.144, -0.737]; the coefficient β0 has a value in the range of [3.455,5.028].
37. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-122, hsa-miR-223, log10 (PLT), log10 (INR), and log10 (TB), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-122+β5×hsa-miR-223+β6×log10 (PLT) +β7×log10 (INR) +β8×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.461,1.636]; the value range of the coefficient beta 2 is [0.897,1.136]; the value range of the coefficient beta 3 is [0.032,0.041]; the value range of the coefficient beta 4 is [ -0.357, -0.291]; the value range of the coefficient beta 5 is [0.074,0.157]; the value range of the coefficient beta 6 is [ -3.664, -3.081]; the value range of the coefficient beta 7 is [4.916,7.611]; the value range of the coefficient beta 8 is [ -1.236, -0.828]; the coefficient β0 has a value in the range of [7.993,10.708].
38. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-192, log10 (PLT), log10 (INR), and log10 (TB), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×log10 (PLT) +β6×log10 (INR) +β7×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.465,1.64]; the value range of the coefficient beta 2 is [0.931,1.168]; the value range of the coefficient beta 3 is [0.036,0.045]; the value range of the coefficient beta 4 is [ -0.402, -0.314]; the value range of the coefficient beta 5 is [ -3.811, -3.232]; the value range of the coefficient beta 6 is [5.548,8.281]; the value range of the coefficient beta 7 is [ -1.298, -0.882]; the coefficient β0 has a value in the range of [12.569,15.297].
39. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-192, hsa-miR-223, log10 (PLT), log10 (INR), and log10 (TB), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-192+β5×hsa-miR-223+β6×log10 (PLT) +β7×log10 (INR) +β8×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.454,1.629]; the value range of the coefficient beta 2 is [0.908,1.146]; the value range of the coefficient beta 3 is [0.035,0.044]; the value range of the coefficient beta 4 is [ -0.463, -0.368]; the value range of the coefficient beta 5 is [0.105,0.192]; the value range of the coefficient beta 6 is [ -3.655, -3.069]; the value range of the coefficient beta 7 is [5.441,8.172]; the value range of the coefficient beta 8 is [ -1.284, -0.869]; the coefficient β0 has a value in the range of [10.174,13.187].
40. The method of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-21, hsa-miR-192, hsa-miR-223, log10 (PLT), log10 (INR), and log10 (TB), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-192+β6×hsa-miR-223+β7×log10 (PLT) +β8×log10 (INR) +β9×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.455,1.63]; the value range of the coefficient beta 2 is [0.91,1.148]; the value range of the coefficient beta 3 is [0.036,0.045]; the value range of the coefficient beta 4 is [0.007,0.134]; the value range of the coefficient beta 5 is [ -0.505, -0.392]; the value range of the coefficient beta 6 is [0.087,0.178]; the value range of the coefficient beta 7 is [ -3.643, -3.056]; the value range of the coefficient beta 8 is [5.499,8.233]; the value range of the coefficient beta 9 is [ -1.271, -0.856]; the coefficient β0 has a value in the range of [9.145,12.523].
41. The method of construction of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-21, hsa-miR-122, hsa-miR-223, log10 (PLT), log10 (INR), and log10 (TB), and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-122+β6×hsa-miR-223+β7×log10 (PLT) +β8×log10 (INR) +β9×log10 (TB) +β0, wherein the coefficient β1 has a value in the range [1.462,1.638]; the value range of the coefficient beta 2 is [0.899,1.138]; the value range of the coefficient beta 3 is [0.032,0.042]; the value range of the coefficient beta 4 is [ -0.033,0.085]; the value range of the coefficient beta 5 is [ -0.367, -0.294]; the value range of the coefficient beta 6 is [0.063,0.153]; the value range of the coefficient beta 7 is [ -3.662, -3.079]; the value range of the coefficient beta 8 is [4.92,7.615]; the value range of the coefficient beta 9 is [ -1.229, -0.819]; the coefficient β0 has a value in the range of [7.267,10.596].
42. The method of construction of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-21, hsa-miR-27a, hsa-miR-122, hsa-miR-223, log10 (PLT), log10 (INR), log10 (TB), log10 (ALT), log10 (AST), and GGT, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-27a+β6×hsa-miR-122+β7×hsa-miR-223+β8×log10 (PLT) +β9×log10 (INR) +β10×log10 (TB) +β11×log10 (ALT) +β12×log10 (AST) +β13×GGT+β0, wherein the coefficient β1 has a value in the range of [1.461,1.639]; the value range of the coefficient beta 2 is [0.96,1.208]; the value range of the coefficient beta 3 is [0.034,0.043]; the value range of the coefficient beta 4 is [0.047,0.222]; the value range of the coefficient beta 5 is [ -0.263, -0.074]; the value range of the coefficient beta 6 is [ -0.447, -0.361]; the value range of the coefficient beta 7 is [0.119,0.222]; the value range of the coefficient beta 8 is [ -3.495, -2.905]; the value range of the coefficient beta 9 is [4.68,7.448]; the value range of the coefficient beta 10 is [ -0.997, -0.556]; the value range of the coefficient beta 11 is [ -0.329, -0.134]; the value range of the coefficient beta 12 is [0.078,0.325]; the value range of the coefficient beta 13 is [ -0.002, -0.001]; the coefficient β0 has a value in the range of [8.546,12.332].
43. The method of construction of claim 1, wherein the model has effective arguments of log10 (AFP), gender, age, hsa-miR-21, hsa-miR-27a, hsa-miR-192, hsa-miR-223, log10 (PLT), log10 (INR), log10 (TB), log10 (ALT), log10 (AST), and GGT, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×hsa-miR-21+β5×hsa-miR-27a+β6×hsa-miR-192+β7×hsa-miR-223+β8×log10 (PLT) +β9×log10 (INR) +β10×log10 (TB) +β11×log10 (ALT) +β12×log10 (AST) +β13×GGT+β0, wherein the coefficient β1 has a value in the range of [1.453,1.63]; the value range of the coefficient beta 2 is [0.946,1.192]; the value range of the coefficient beta 3 is [0.037,0.047]; the value range of the coefficient beta 4 is [0.069,0.249]; the value range of the coefficient beta 5 is [ -0.241, -0.047]; the value range of the coefficient beta 6 is [ -0.538, -0.414]; the value range of the coefficient beta 7 is [0.126,0.233]; the value range of the coefficient beta 8 is [ -3.532, -2.939]; the value range of the coefficient beta 9 is [5.291,8.103]; the value range of the coefficient beta 10 is [ -1.151, -0.708]; the value range of the coefficient beta 11 is [ -0.177,0.01]; the value range of the coefficient beta 12 is [0.043,0.291]; the value range of the coefficient beta 13 is [ -0.002, -0.001]; the coefficient β0 has a value in the range of [9.347,13.233].
44. The method of construction of claim 1, wherein the model has an effective argument of log10 (AFP), gender, age, miRNA7, log10 (PLT), log10 (INR), log10 (TB), log10 (ALT), log10 (AST), and GGT, wherein miRNA7 is a liver cancer diagnostic marker combined from nucleic acid molecules encoding hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a, hsa-miR-801, and hsa-miR-1228, and wherein the model has a predictive formula of: logit (P) =β1×log10 (AFP) +β2×Gender+β3×Age+β4×miRNA7+β5×log10 (PLT) +β6×log10 (INR) +β7×log10 (TB) +β8×log10 (ALT) +β9×log10 (AST) +β10×GGT+β0, wherein the coefficient β1 has a value in the range [1.472,1.649]; the value range of the coefficient beta 2 is [0.955,1.2]; the value range of the coefficient beta 3 is [0.034,0.043]; the value range of the coefficient beta 4 is [0.666,0.841]; the value range of the coefficient beta 5 is [ -3.384, -2.793]; the value range of the coefficient beta 6 is [4.526,7.28]; the value range of the coefficient beta 7 is [ -0.982, -0.543]; the value range of the coefficient beta 8 is [ -0.242, -0.054]; the value range of the coefficient beta 9 is [0.088,0.328]; the value range of the coefficient beta 10 is [ -0.002, -0.001]; the coefficient β0 has a value in the range of [2.595,4.325].
45. A system for assessing the likelihood of a subject suffering from hepatocellular carcinoma, comprising:
the data acquisition module is used for acquiring sample data of a subject, wherein the sample data comprise any one of the expression quantity hsa-miR-122, hsa-miR-192, hsa-miR-21, hsa-miR-223, hsa-miR-26a, hsa-miR-27a and hsa-miR-801 of microRNA in blood plasma, and the Age, sex Gender, tumor marker detection results and hematology index detection results of the subject, the tumor marker detection results comprise alpha fetoprotein AFP, and the hematology index detection results comprise any one of platelet count PLT, total bilirubin TB, serum glutamic oxaloacetic transaminase GGT, prothrombin time PT, international standardization ratio INR, aspartic acid amino transferase AST and alanine amino transferase ALT;
the data preprocessing module is used for preprocessing the sample data and comprises the steps of removing repeated samples, unifying units and dividing the sample data into different analysis groups according to index types and numbers;
a model recommendation module, configured to recommend, for different analysis groups, a plurality of prediction models with larger AUC values according to the magnitude of the AUC values from 41 prediction models established by the construction method according to any one of claims 1 to 44 according to the types and numbers of the indexes included in the analysis groups;
The model selection module is used for providing a selection function for a user to output one or more prediction models for risk assessment and calculation from the prediction models recommended by the model recommendation module;
the risk assessment module is used for calculating corresponding predicted values according to one or more predicted models output by the model selection module, giving out a proper Logit threshold or risk score threshold according to department or crowd information corresponding to the sample data, and dividing the risk degree into any one of high possibility, medium possibility and low possibility according to the Logit threshold or the risk score threshold.
46. A kit for assessing the likelihood of a subject suffering from hepatocellular carcinoma comprising a predictive model constructed using the construction method of any one of claims 1-44.
47. An electronic device, comprising:
a memory for storing instructions executable by the processor;
a processor for executing the instructions to implement the build method of any one of claims 1-44.
48. A computer readable medium storing computer program code which, when executed by a processor, implements the construction method of any one of claims 1-44.
CN202311678114.7A 2023-12-07 2023-12-07 Construction method of model for evaluating possibility of suffering from hepatocellular carcinoma of subject Pending CN117672521A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311678114.7A CN117672521A (en) 2023-12-07 2023-12-07 Construction method of model for evaluating possibility of suffering from hepatocellular carcinoma of subject

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311678114.7A CN117672521A (en) 2023-12-07 2023-12-07 Construction method of model for evaluating possibility of suffering from hepatocellular carcinoma of subject

Publications (1)

Publication Number Publication Date
CN117672521A true CN117672521A (en) 2024-03-08

Family

ID=90084236

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311678114.7A Pending CN117672521A (en) 2023-12-07 2023-12-07 Construction method of model for evaluating possibility of suffering from hepatocellular carcinoma of subject

Country Status (1)

Country Link
CN (1) CN117672521A (en)

Similar Documents

Publication Publication Date Title
JP5184087B2 (en) Methods and computer program products for analyzing and optimizing marker candidates for cancer prognosis
Oh et al. Diabetic retinopathy risk prediction for fundus examination using sparse learning: a cross-sectional study
CN106202968B (en) Cancer data analysis method and device
JP2009535644A (en) Method and apparatus for identifying disease status using biomarkers
JP2016503301A (en) How to determine the presence or absence of aggressive prostate cancer
CN110189824B (en) Prognosis grouping method, device and system for primary liver cancer radical resection
CN105603101A (en) Application of system for detecting expression quantity of eight miRNAs in preparation of product for diagnosing or assisting in diagnosing hepatocellular carcinoma
US20240002949A1 (en) Panel of mirna biomarkers for diagnosis of ovarian cancer, method for in vitro diagnosis of ovarian cancer, uses of panel of mirna biomarkers for in vitro diagnosis of ovarian cancer and test for in vitro diagnosis of ovarian cancer
CN111833963A (en) cfDNA classification method, device and application
Wetstein et al. Deep learning assessment of breast terminal duct lobular unit involution: towards automated prediction of breast cancer risk
CN115424666A (en) Method and system for screening pan-cancer early-screening molecular marker based on whole genome bisulfite sequencing data
CN113128654A (en) Improved random forest model for coronary heart disease pre-diagnosis and pre-diagnosis system thereof
CN114864080A (en) Method, system, equipment and medium for establishing liver cancer diagnosis model C-GALAD II
CN108559777A (en) A kind of New molecular marker and its application in preparing for the kit of clear cell carcinoma of kidney diagnosis and prognosis
Martinez et al. Deep learning algorithms for the early detection of breast cancer: A comparative study with traditional machine learning
CN117672521A (en) Construction method of model for evaluating possibility of suffering from hepatocellular carcinoma of subject
CN107121551A (en) Biomarker combinations, detection kit and the application of nasopharyngeal carcinoma
Jia et al. Machine Learning and Bioinformatics Analysis for Laboratory Data in Pan‐Cancers Detection
CN116087530B (en) Protein composition, device, apparatus and storage medium for detecting pancreatic cancer
Liu et al. Uncovering nasopharyngeal carcinoma from chronic rhinosinusitis and healthy subjects using routine medical tests via machine learning
CN111263965A (en) System and method for improving disease diagnosis using measurement of analytes
CN116047082B (en) Application of FGL1 protein in preparing kit for diagnosing chronic kidney disease
WO2023102786A1 (en) Application of gene marker in prediction of premature birth risk of pregnant woman
Qian et al. Radiogenomics-based risk prediction of glioblastoma multiforme with clinical relevance
TWI661198B (en) Methods for making diagnosis and/or prognosis of human oral cancer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination