CN116047074A - Marker for diagnosing and/or predicting lung cancer, diagnostic model and construction method thereof - Google Patents

Marker for diagnosing and/or predicting lung cancer, diagnostic model and construction method thereof Download PDF

Info

Publication number
CN116047074A
CN116047074A CN202211389513.7A CN202211389513A CN116047074A CN 116047074 A CN116047074 A CN 116047074A CN 202211389513 A CN202211389513 A CN 202211389513A CN 116047074 A CN116047074 A CN 116047074A
Authority
CN
China
Prior art keywords
model
lung cancer
data
marker
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211389513.7A
Other languages
Chinese (zh)
Inventor
胡文滕
蔡谦谦
王鸣源
马敏杰
孟文勃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
First Hospital of Lanzhou University
Original Assignee
First Hospital of Lanzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by First Hospital of Lanzhou University filed Critical First Hospital of Lanzhou University
Priority to CN202211389513.7A priority Critical patent/CN116047074A/en
Publication of CN116047074A publication Critical patent/CN116047074A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57407Specifically defined cancers
    • G01N33/57423Specifically defined cancers of lung
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57473Immunoassay; Biospecific binding assay; Materials therefor for cancer involving carcinoembryonic antigen, i.e. CEA
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57484Immunoassay; Biospecific binding assay; Materials therefor for cancer involving compounds serving as markers for tumor, cancer, neoplasia, e.g. cellular determinants, receptors, heat shock/stress proteins, A-protein, oligosaccharides, metabolites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2333/00Assays involving biological materials from specific organisms or of a specific nature
    • G01N2333/435Assays involving biological materials from specific organisms or of a specific nature from animals; from humans
    • G01N2333/46Assays involving biological materials from specific organisms or of a specific nature from animals; from humans from vertebrates
    • G01N2333/47Assays involving proteins of known structure or function as defined in the subgroups
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30096Tumor; Lesion
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a marker for diagnosing and/or predicting lung cancer, which at least comprises one or more of CEA, proGRP, CYFRA-1, SCC, IBIL, APTT and age; the lung cancer diagnosis model based on random forest is also established, the prediction discrimination index of the model comprises sensitivity, specificity and AUC which are respectively 0.723, 0.786 and 0.840, the calibration curve shows that the model prediction accuracy is good, the clinical decision curve and the clinical influence curve also show that the model prediction accords with the clinical application condition, the model group outside verification result shows that the model prediction is good, the sensitivity, specificity and AUC of the ROC curve are respectively 0.961, 0.750 and 0.906, the calibration curve shows that the model prediction accuracy is good, and the clinical decision curve and the clinical influence curve also verify the model prediction accuracy.

Description

Marker for diagnosing and/or predicting lung cancer, diagnostic model and construction method thereof
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a marker for diagnosing and/or predicting lung cancer, a diagnostic model and a construction method thereof.
Background
Lung cancer is a relatively high disease with cancer mortality worldwide, with a survival rate of only 16.1% for 5 years, patients usually have long disease course, no atypical symptoms and signs, and early stages are mainly represented by ground-glass opacity nodule (GGN) of the lung. In the aspect of lung cancer diagnosis, CT examination is mostly adopted, but CT has no obvious difference on tumor state and parting, especially small cell lung cancer, and most lung cancers are accompanied with infiltration and growth, so that early diagnosis is extremely difficult.
Early screening of disease is important in order to discover lung cancer as early as possible and to intervene in the treatment. Early screening methods for lung cancer include chest X-ray, low Dose Computed Tomography (LDCT) and tumor marker assessment. The X-ray examination has low radiation dosage, little harm to human body, and the examination result is clear at a glance, so that a doctor can immediately interpret the examination, and the method is the most direct and rapid examination method, but the chest X-ray photography is often not found in time due to too small tumor or being shielded by bones and hearts; the radiation dose of the low-dose computer tomography is higher (three times and four times of X-ray), but the picture resolution (0.1 cm) is more sensitive, the lung lesions with the diameter of 0.3 cm can be detected, no dead angle exists in examination, and no matter where the focus tumor is in the lung, the focus tumor is shaped. At present, the main algorithm relies on a CT image morphology method to identify and predict lung cancer, and has limited pathological diagnosis prediction for lung cancer. However, as people's knowledge of tumor markers continues to increase, tumor marker screening for lung cancer has become a major approach. Researchers have also spent a lot of manpower and resources on screening tumor markers, for example, invention patent CN112834748A discloses a biomarker combination, a kit containing the same and application thereof, wherein the biomarker combination comprises PLG, APEX1, PARP1, PGP9.5, TP53 and MAGEA1. The invention patent CN106680511B discloses application of a serum molecular marker combination as a lung cancer diagnosis and curative effect monitoring marker, and the content of proteins (OPN, SAA, CRP, CEA, CYFRA 21.1.1, MIF, AGP, HGF, E-selectin, GRO and NSE) in eleven kinds of serum is measured by a Luminex protein chip diagnosis technology. Eight serum protein molecular markers were OPN, SAA, CRP, CYFRA 21.1.21, CEA, NSE, AGP and HGF. The eight serum protein molecular markers have remarkable promotion effects on the occurrence of non-small cell lung cancer (NSCLC) and Small Cell Lung Cancer (SCLC); the three protein detection combination consisting of OPN, CEA and another protein (CRP, SAA, CYFAR 21.1.21.1 or NSE) has excellent diagnostic potential for NSCLC.
Machine Learning (ML) is a brand new branch of artificial intelligence development in modern information disciplines, and can learn and summarize the problem-inducing and reasoning-related conceptual relation from a large amount of data by simulating the learning mode of human brain thinking, so that the main methods can be divided into two major categories of supervised learning and unsupervised learning in principle at present. Currently common medically available machine learning algorithms mainly include support vector machines (support vector machines, SVM), random Forest (RF), neural convolutional networks (artificial neural network, ANN), etc. The random forest algorithm is an information processing statistical technology designed specially for building classification decision tree. Diversity creates a class set by introducing a random vote break in the classifier construction. The machine learning has advantages in the aspects of big data processing, statistics, standardization of calculation learning process, discrimination and accuracy of prediction ending and the like, and has important application in the aspects of diagnosis and stage identification in the field of thoracic surgery lung cancer surgery, surgery scheme formulation, prognosis prediction and the like. Artificial intelligence models are a step forward from automatic nodule diagnosis because they typically do not require nodule measurements or data input. Current radioactivity prediction models include meo models, VA models, university of bruck models, and the university of beijing people hospital model (PKUPH). However, these models are primarily focused on CT manifestations of lung nodules, but do not accurately predict lung cancer in combination with conventional blood detection data and with radiological models of pathology data.
Aiming at the technical problems, the invention provides a marker for diagnosing and/or predicting lung cancer, wherein the marker is from clinical parameters, 7 optimal parameters are screened out through a random forest model, the optimal parameters comprise CEA, serum ProGRP, CYFRA211, SCC, IBIL, APTT and age, the probability of malignant lung cancer risk of a test sample is predicted, the probability of lung cancer risk of a patient is predicted, and a basis is provided for diagnosis and treatment of a clinician; the alignment chart prediction tool is simple to operate, realizes quick analysis, and can rapidly predict samples and output results.
Disclosure of Invention
It is a primary object of the present invention to provide a marker for diagnosing and/or prognosticating lung cancer, said marker comprising at least one or several of carcinoembryonic antigen (CEA), gastrin releasing peptide (ProGRP), soluble fragment of cytokeratin 19 (CYFRA 21-1), squamous cell carcinoma antigen (SCC), indirect Bilirubin (IBIL), activated Partial Thrombin Time (APTT) and age.
Preferably, the marker comprises one or more of carcinoembryonic antigen (CEA), gastrin releasing peptide (ProGRP), soluble fragment of cytokeratin 19 (CYFRA 21-1), squamous cell carcinoma antigen (SCC), indirect Bilirubin (IBIL), activated Partial Thrombin Time (APTT) and age.
Preferably, the markers include carcinoembryonic antigen (CEA), gastrin releasing peptide (ProGRP), soluble fragment of cytokeratin 19 (CYFRA 21-1), squamous cell carcinoma antigen (SCC), indirect Bilirubin (IBIL), activated Partial Thrombin Time (APTT), and age.
A second object of the present invention is to provide the use of the marker for diagnosing and/or predicting lung cancer in the preparation of a reagent product, kit or database for diagnosing and/or predicting lung cancer.
A third object of the present invention is to provide a reagent product or kit comprising the standard for diagnosing and/or prognosticating a marker of lung cancer.
The fourth object of the invention is to provide a construction method of lung cancer diagnosis and/or prediction model based on random forest combined Logistic regression, comprising the following steps:
(1) Acquiring a sample set: collecting clinical data of a lung cancer patient to form a sample set;
(2) Random forest variable screening: identifying potential blood routine data by a machine learning method, the selected variables and baseline characteristic variables of the patient being used as candidate parameters for model development;
(3) Multivariable logistic regression prediction modeling: and (3) discussing the correlation between each clinical data and lung cancer by using a Logistic regression analysis model, selecting meaningful variables in single-factor analysis to carry out single-factor multi-factor regression analysis, carrying out model prediction area division and accuracy verification, and establishing a Nomogram Nomogram model by using a 'gg-plot 2' package in R language.
Preferably, the machine learning method of step (2) is implemented by R (version 4.1.1), using a machine learning method comprising Lasso regression, which deals with the multiple collinearity problem of the available features, and RF, which can implement a variable selection procedure based on its effect on outcome prediction, RF parameters are optimized around their default values in logarithmic steps (using 500 trees and random subspaces of rounded values with dimensions equal to the square of the feature numbers), using ten-fold cross-validation and external test set validation to verify the reliability of the model.
Preferably, the specific method for screening random forest variables in the step (2) is as follows:
s1, constructing an initial model: establishing an initial classification model with all the features, optimizing the initial model, and calculating importance ranking of all the features according to the optimized model;
s2, feature selection and model secondary optimization: according to the importance ranking of the features obtained in the step S1, adding the features for modeling into a classifier from high importance to low importance to carry out random forest modeling again, and obtaining the optimal parameter value of each model through ten-fold interactive test evaluation;
s3, obtaining an optimal model: when the added features can not obviously increase the AUC evaluation index value of the model any more, finishing feature addition, carrying out random forest modeling by using the current feature number as the finally selected feature number, carrying out cross checking parameter optimization, selecting optimal parameters according to an AUC index evaluation result, and establishing an optimal model on the basis of the parameters;
s4, model verification: performing model verification on the external test set by using the optimal model obtained through interactive verification;
preferably, the importance ranking of all the features calculated in step (2) S1 is obtained by the following method: the initial model determines the relevance of the features and the lung cancer according to the regression test result of the decision tree, and determines the feature importance ranking according to the relevance ranking.
Preferably, the evaluation criterion of the initial model optimization super parameter in the step (2) S3 is a ten-fold cross-validation average AUC value, the optimization parameter includes the number of trees (n_optimizers), the feature number of decision trees (max_features) is considered when searching the optimal segmentation, the maximum depth tree (max_depth) and the class weight (class_weight); the optimal parameters described in S3 are obtained by the following method: and determining the relevance of the features and the ending index lung cancer according to the regression test result of the decision tree, and determining the feature ordering according to the relevance ordering.
Preferably, the model described in step (2) is built by the "randomForest" package in R (version 4.1.1).
Preferably, step (3) identifies potential blood routine data by machine learning methods, the selected variables and patient baseline characteristic variables are used as candidate parameters for model development, and the variable selection program (backward, p) is determined by stepwise multivariate logistic regression<0.05 Normal distribution of metering data
Figure BDA0003931400720000031
Representing, using independent sample t-test; the non-normally distributed metering data is represented by M (P 25 -P 75 ) The representation was tested using Mann-WhitneyU; count data is expressed in terms of examples and percentages, and χ is used for comparison between groups 2 Testing or Fisher exact probability method.
Preferably, the test result of step (3) is considered statistically significant only if P < 0.05.
Preferably, the construction method further comprises different histological subtypes to verify the accuracy of the model in identifying the different lung cancer subtypes.
The fifth object of the present invention is to provide a construction system of lung cancer diagnosis and/or prediction model based on random forest combined Logistic regression, which is applied to the construction method, comprising:
the data acquisition module is at least used for data acquisition and acquiring a sample data set;
a data processing module for extracting at least valid samples from the sample dataset that can be used to construct an assessment model;
the model construction module is at least used for randomly dividing the incomplete data set of the effective sample into a training set and a verification set, fitting the training set by using a random forest method, and recording optimal model parameters according to the out-of-bag error;
and the threshold calculating module is at least used for calculating a model classification threshold according to the ROC curve by using the verification set.
A sixth object of the present invention is to provide a lung cancer diagnosis and/or prediction model system based on random forest combined Logistic regression, comprising:
the pre-input module is used for inputting at least data to be diagnosed for the evaluation model;
the lung cancer diagnosis model constructed by the method is at least used for evaluating the data to be evaluated;
and the display module is at least used for displaying the diagnosis result.
A seventh object of the present invention is to provide a computer program product stored on a computer readable medium, comprising a computer readable program for providing a user input interface for applying the random forest lung cancer based diagnosis and/or prediction model diagnosis system when executed on an electronic device.
An eighth object of the present invention is to provide a computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to apply the random forest lung cancer-based diagnosis and/or prediction model diagnosis system.
The beneficial effects of the invention are as follows: (1) According to the invention, through screening the common 87 clinical parameters, screening 30 parameters through a random forest model, and finally screening 7 optimal parameters by combining a Logistic regression model, the optimal parameters comprise CEA, serum ProGRP, CYFRA211, SCC, IBIL, APTT and age, the risk probability of malignant lung cancer of a test sample is predicted, the risk probability of lung cancer diagnosis of a patient is predicted, and a basis is provided for diagnosis and treatment of a clinician; (2) The alignment chart prediction tool is adopted, the operation is simple, the rapid analysis is realized, the sample can be rapidly predicted, and the result is output; (3) A self-built database is adopted to construct a prediction model, the database has huge sample size and complete information, the prediction performance of the constructed prediction model is accurate and reliable, and the prediction discrimination index of the model comprises sensitivity, specificity and AUC which are respectively 0.723, 0.786 and 0.840; clinical decision curves and clinical impact curves also examined show that the RF model prediction accuracy is good. The model verification method adopts 48 cases of out-of-group verification patients at the same time, the sensitivity, specificity and AUC of the ROC curve of the model are respectively 0.961, 0.750 and 0.906, and the clinical decision curve and the clinical influence curve also verify the prediction accuracy of the RF model in the out-of-group verification.
Drawings
Figure 1 case screening flow chart
FIG. 2 random forest screening results
FIG. 3 model construction flow
Analysis of results for the model set of FIG. 4
Note that: a: ROC curves of model sets; b: a model set model curve; c: clinical decision curves for model sets; d: clinical influence profile of model group
FIG. 5 validation set results analysis
Note that: a: verifying ROC curves for groups; b: calization curves of the validation group; c: validating the clinical decision curve of the group; d: clinical impact profile of validation group
FIG. 6 model application nomogram alignment
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the present invention, random forest refers to a classifier that trains and predicts samples using multiple trees. The classifier was first proposed by Leo Breiman and Adele Cutler and registered as a trademark. In machine learning, a random forest is a classifier that contains multiple decision trees, and whose output class is a mode of the class output by the individual trees. Leo Breiman and Adele Cutler developed algorithms that extrapolated random forests. And "Random forces" are their trademarks. This term was developed in 1995 by random decision forest (random decision forests) by Tin Kam Ho of bell laboratories. This approach is to combine the Breimans 'bootstrapping aggregation idea with Ho' random subspace method to build a set of decision trees.
The random forest has the advantages that:
1) For a wide variety of materials, it can produce high accuracy classifiers;
2) It can handle a large number of input variables;
3) It can evaluate the importance of the variable in determining the category;
4) When a forest is built, it can internally produce an unbiased estimate of the generalized error;
5) It includes a good way to estimate the missing data and maintain accuracy if a significant portion of the data is lost;
6) It provides an experimental method to detect variable interactions;
7) For an unbalanced classified dataset, it can balance errors;
8) It calculates the closeness in each instance, which is very useful for data mining, detecting outliers (outlies), and visualizing the data;
9) The above is used. It can be extended to be applied to unlabeled material, which is typically clustered using unsupervised clustering.
Also can detect deviant and view data;
10 The learning process is very fast.
Ten-fold cross-validation in the following examples, named 10-fold cross-validation, was used to test algorithm accuracy. Is a common test method. The data set was divided into ten parts, 9 parts of which were used as training data and 1 part as test data in turn, and the test was performed. Each test gives a corresponding correct rate (or error rate). As an estimation of the accuracy of the algorithm, an average value of the accuracy (or error rate) of the result of 10 times is generally required to perform 10-fold cross-validation (e.g., 10 times 10-fold cross-validation), and then the average value is obtained as an estimation of the accuracy of the algorithm. Ten fold cross-validation was chosen to split the dataset into 10 parts because by using a large number of datasets, a large number of experiments using different learning techniques, it was shown that 10 fold is the proper choice to obtain the best error estimate, and some theory could prove this. But this is not the final diagnosis and the dispute remains. And it appears that 5-fold or 20-fold and 10-fold give comparable results.
In the following examples, clinical impact curve analysis shows clinical predictive effects, with the upper curve (high risk) representing the number of persons for whom the RF model was classified as positive (high risk) at each threshold probability; the lower curve (resulting high risk number) is the true positive number at each threshold probability. When the threshold probability is greater than 75% of the predicted score probability value, the RF model determines that the prediction accuracy in the training set is highly matched to the lung cancer population, which demonstrates that the RF model has very high clinical efficiency.
Multivariate binary logistic regression optimization modeling: logistic regression (logistic regression) is a classical classification method in statistical learning, belonging to a log-linear model, and is therefore also called log-probability regression. The main ideas of classifying by using logistic regression are: and establishing a regression formula for the classification boundary line according to the existing data so as to classify the boundary line. The application conditions are as follows: the dependent variable is a classified variable of two classifications or the occurrence of an event, and is a numerical variable. The dependent variable is subject to a binomial distribution, because the binomial distribution corresponds to independent classification variable independent variables and Logistic probabilities between the objects.
Logistic advantage:
1. the method is simple to realize and widely applied to industrial problems;
2. the calculated amount is very small, the speed is very high, and the storage resource is low during classification;
3. a convenient observation sample probability score;
4. multiple collinearity is not a problem for logistic regression, which can be solved in conjunction with L2 regularization;
5. the calculation cost is low, and the method is easy to understand and realize.
The Logistic model is actually a regression model, which is used to classify problems, or estimate the probability of occurrence of a certain event, and also analyze which influence factors of a certain problem are, and the difference between the model and the common linear regression model is that:
1) The dependent variable of the Logistic regression model is a classification variable;
2) The dependent variable and independent variable of the model have no linear relation;
3) In a general linear regression model, independent isodistribution, variance alignment and the like are required to be assumed, while a Logistic regression model is not required;
4) Logistic regression has no hypothetical conditions on the distribution of independent variables, and can be continuous, discrete, and virtual variables;
5) Since there is no linear relationship between the dependent and independent variables, the parameters (partial regression coefficients) are calculated using maximum likelihood estimation.
Embodiment 1,
1. Collecting a sample set: collecting clinical parameters of 861 cases of patients with lung nodules at first hospital in the university of lan from 1 in 2018 to 12 in 2021 to form a sample set, the cases being shown in fig. 1, the clinical parameters being 87 in total, including: (1) patient baseline data: gender, age, smoking history, drinking history, family history, tumor history, and hypertension;
(2) Pathological examination results: pathological typing, non-tumor, squamous carcinoma, adenocarcinoma, other tumors, benign and malignant nodules, and whether infiltration exists;
(3) Blood group: ABO blood type, rh blood type;
(4) Coagulation index: prothrombin time PT (seconds), prothrombin activity PTA (%), international prothrombin ratio PTR, international normalized ratio INR, fibrinogen content FIB (g/L), activated partial thrombin time APTT (seconds), thrombin time TT (seconds);
(4) Blood routine data: white blood cell WBC (10) 9 L), red blood cell RBC (10) 12 /L), hemoglobin HGB (g/L), hematocrit HCT (%), average red blood cell volume MCV (fL), average hemoglobin content MCH (pg), average hemoglobin concentration MCHC (g/L), platelet PLT (10) 9 Per liter), percent LYM (%) lymph, percent MON (%), percent neutrophil N (%), ratio of phages to EO (% L), ratio of phages to BA (%), absolute value of LYM (10) lymph 9 Per L), single core absolute value MONO (10 9 /L), neutral absolute value NEUT (10) 9 Absolute value of phagic acid E0 (10) 9 Absolute value of alkali-binding BA (10) 9 /L), red width RDW-CV (%), redWidth RDW-SD (fL), platelet width PDW (fL), platelet volume MPV (fL), platelet specific volume PCT (%), large platelet ratio P-LCR (%), naive cell blast (10) 9 L), percentage blast of naive cells;
(5) Biochemical data: aspartate aminotransferase AST (U/L), alanine aminotransferase ALT (U/L), AST/ALTS/L, total protein TP (g/L), albumin ALB (g/L), globulin GLO (g/L), albumin/GLO, total bilirubin TBIL (umol/L), direct bilirubin DBIL (umol/L), indirect bilirubin IBIL (umol/L), alkaline phosphatase ALP (U/L), gamma-sitz transferase GGT (U/L), cholinesterase CHE (KU/L), alpha-fucosase AFU (U/L), total bile acid TBA (umol/L), carbon dioxide CO 2 (mmol/L), urea Urea (mmol/L), creatinine Crea (umol/L), urine/muscular U/C, uric acid UA (umol/L), potassium K (mmol/L), sodium NA (mmol/L), chlorine CL (mmol/L), calcium CA (mmol/L), inorganic phosphorus P (mmol/L), magnesium MG (mmol/L), anion gap AG (mmol/L), osmotic OSM (mosm/L), glucose GLU (mmol/L), total cholesterol TC (mmol/L), triglyceride TG (mmol/L), high density lipoprotein HDL (mmol/L), low density lipoprotein LDL (mmol/L), lactate dehydrogenase LDH (U/L), alpha hydroxybutyrate dehydrogena alpha HBDH (U/L), creatine kinase CK (U/L), creatine kinase isozymes CK-MB (U/L), homocysteine HCY (umol/L), amylase AMY (U/L);
(6) Markers for lung cancer-related tumors: carcinoembryonic antigen (CEA (μg/L)), soluble fragment of cytokeratin 19 (CYFRA 21-1 (μg/L)), gastrin releasing peptide (ProGRP (pg/mL)), squamous cell carcinoma antigen (SCC (μg/L)), neuron-specific enolase (NSE (μg/L));
the clinical base line data of the cases are subjected to comparative analysis before modeling, and the basic differences of the cases are determined, and are shown in table 1.
Table 1 baseline clinical profile for all patients
Figure BDA0003931400720000081
2. Random forest feature preliminary screening
Identifying potential blood routine data by a machine learning method, the selected variables and baseline characteristic variables of the patient being used as candidate parameters for model development; the machine learning method is implemented by R (version 4.1.1), using a machine learning method comprising Lasso regression, which deals with the multiple collinearity problem of available features, and RF, which can implement variable selection procedures based on its effect on outcome predictions, RF parameters are optimized around their default values in logarithmic steps (using 500 trees and random subspaces of rounded values with dimensions equal to the square of the feature numbers), using ten-fold cross-validation and external test set validation to verify the reliability of the model.
The method comprises the following specific steps:
s1, constructing an initial model: establishing an initial classification model by incorporating all the features, continuously optimizing the initial model, and calculating importance ranking of all the features according to the optimized model;
s2, feature selection and model secondary optimization: according to the importance ranking of the features obtained in the step S1, adding the features for modeling into a classifier from high importance to low importance to carry out random forest modeling again, and obtaining the optimal parameter value of each model through ten-fold interactive test evaluation;
s3, obtaining an optimal model: when the added features can not obviously increase the AUC evaluation index value of the primary model any more, finishing feature addition, carrying out subsequent modeling by using the current target features, selecting optimal parameters according to the AUC index evaluation result, and establishing an optimal model based on the parameters, wherein the number of the currently added features is the number of the finally selected features;
s4, model verification: performing model verification on the external test set by using the optimal model obtained through interactive verification;
s1, calculating importance ranking of all the features is obtained by the following method: the initial model determines the correlation between the features and the ending index PEC according to the regression test result of the decision tree, and determines the feature importance ranking according to the correlation ranking.
The evaluation standard of the initial model optimization super parameter described in S3 is the average AUC value of ten-fold cross validation, the optimization parameter comprises the number of trees (n_optimators), the feature number of decision trees (max_features) is considered when searching the optimal segmentation, the (max_depth) of the maximum depth tree and the class weight (class_weight); the optimal parameters described in S3 are obtained by the following method: and determining the correlation between the features and the ending index PEC according to the regression test result of the decision tree, and determining the feature sequence according to the correlation sequence.
The method also includes off-group authentication.
Parameters obtained by preliminary screening of the random forest model are shown in figure 2, 30 parameters with higher correlation are obtained in total, and the model is constructed by using the parameters obtained by the preliminary screening.
3. Multivariable logistic regression predictive modeling
And discussing the relation between each clinical data and the PEP by using a Logistic regression analysis model, selecting meaningful variables in single-factor analysis, performing single-multiple-factor regression analysis, verifying by using an ROC curve, and establishing a Nomogram Nomogram model by using a 'gg-plot 2' package in the R language.
Potential blood routine data is identified by machine learning methods, selected variables and patient baseline characteristic variables are used as candidate parameters for model development, and variable selection procedures are determined by stepwise multivariate logistic regression (backward, p<0.05 Normal distribution of metering data
Figure BDA0003931400720000101
Representing, using independent sample t-test; the non-normally distributed metering data is represented by M (P 25 -P 75 ) Representing, using Mann-Whitney U test; count data is expressed in terms of examples and percentages, and χ is used for comparison between groups 2 Testing or Fisher exact probability method.
The test results were considered statistically significant only when P < 0.05.
Preferably, the construction method further comprises different histological subtypes to verify the accuracy of the model in identifying the different lung cancer subtypes.
For clinical applications, nomograms are drawn as comprehensive clinical prediction tools. The model building flow is shown in fig. 3.
After independent risk factors of lung cancer are determined through multi-factor Logistic regression statistics, the disease prediction value of a patient is estimated integrally by incorporating all the risk factors, a prediction model is built, and a model prediction ROC curve is drawn.
3.2. Analysis of results
After control confounding factors were checked, CEA serum levels were higher than 2.3 μg/L (or=0.193; 95% ci, 1.019-1.396), sensitivity was 60.6%, specificity was 72.1%, serum ProGRP levels were higher than 40.2 μg/L (or=1.014; 95% confidence interval, 1.001-1.028), sensitivity was 66.76%, specificity was 64.60% for lung cancer patients; serum CYFRA211 levels above 2.5 μg/L (or=1.714; 95% confidence interval, 1.356-2.167), sensitivity 56.98%, specificity 73.34%, SCC serum levels above 0.8U/L (or=2.336; 95% ci, 1.240_4.402), sensitivity 55.03%, specificity 74.78%; the serum level of IBIL was higher than 16.8U/L (or=1.057; 95% confidence interval, 1.009-1.107), the sensitivity was 27.6%, the specificity was 94.25%, the APTT (partial thromboplastin time) was shorter than 34s (or=0.916; 95% ci, 0.862-0.974), the sensitivity was 60.6%, the specificity was 72.12%. Age >52 years (or=1.045; 95% ci, 1.018-1.072), sensitivity 72.91% and specificity 58.41% are high risk predictors of the integrated model of lung nodule patient development (table 2).
TABLE 2 Single and multiple logistic regression risk factors for lung cancer
Figure BDA0003931400720000102
/>
Figure BDA0003931400720000111
Sensitivity, specificity and AUC of the nomogram for predicting lung cancer were 0.723, 0.786 and 0.840, respectively.
After determining independent risk factors of lung cancer through multi-factor Logistic regression statistics, integrally estimating disease occurrence prediction values of patients by taking all risk factors into account, establishing a prediction model, and drawing a model prediction ROC curve, wherein the sensitivity, specificity and AUC are respectively 0.723, 0.786 and 0.840. (FIG. 4A) model differentiation ruleCalibration by Hosmer-Lemeshow test shows χ 2 A P value of 5.526, 0.700, shows that the model has good predictive performance. The prediction accuracy was assessed by a calibration curve (fig. 4B), a clinical decision curve (fig. 4C) and a clinical impact curve (fig. 4D).
5. Out-of-group verification of models
Another cohort of 48 patients from first hospital in langzhou university, 2022, 1 month to 2022, 5 month, was independently externally validated using nomograms, according to the same inclusion exclusion criteria. ROC curve, DCA (decision curve analysis) curve and CIC (clinical impact curve) curve analysis were performed with sensitivity, specificity and AUC of the ROC curve of 0.961, 0.750 and 0.906, respectively. ROC curves, calibration, DCA and CIC curves indicate accuracy within the fitting range. The external cohort shows that our integrated model meets the clinical setting, verifying the accuracy of the model by estimating the differences between the nomogram results derived from the modeling cohort and the verification cohort.
ROC curves for out-of-group validated RF model calibration experiments as shown in fig. 5A, the probability values of onset were predicted by RF model decision trees, factors including CEA, serum ProGRP, CYFRA211, SCC, IBIL, APTT, and age. Sensitivity, specificity and AUC of the ROC curve were found to be 0.961, 0.750 and 0.906, respectively. The out-of-group verification calibration graph 5B also demonstrates that the model prediction accuracy is good. The model set out validation decision curve is shown in fig. 5C, which represents the predicted performance of the model. The curve results show that when the probability of malignant lung cancer verified outside the group is between 0 and 1.0. Lung cancer can be clearly distinguished when clinical decisions are made using this predictive model. The predictive model may have good predictive benefits. The clinical effect curve analysis shows the clinical prediction effect of the model in the out-of-group verification, and when the threshold probability is greater than 80% of the prediction score probability value, the model determines that the prediction accuracy in the out-of-group verification set is highly matched with the actual lung cancer crowd, which proves that the prediction model has very high clinical prediction efficiency.
6. Prediction model alignment map construction
To facilitate the application of our model, we have created an open nomogram prediction tool for lung cancer identification. The user can predict benign and malignant lung nodules through 7 parameter identification in the graph. Each factor has a prediction reference value based on or that shows the weight of each parameter, and the total score will distinguish healthy individuals from lung cancer patients, as shown in fig. 6, e.g., patient a, with the examination result: CEA:45 μg/L, serum ProGRP:40pg/mL, CYFRA211:20 μg/L, SCC:8 μg/L, IBIL: 50. Mu. Mol/L, APTT:30S and age 50 years, total Points were predicted to be 75 Points and lung cancer probability was 55%.
7. Subgroup analysis application
Following model construction and evaluation, subgroups were analyzed according to histological subtype. For pathological diagnostic types of patients, including squamous cell carcinoma (SQCC), adenocarcinoma (AD), other tumors, such as non-small cell lung carcinoma (NSCLC) and neuroendocrine tumor patients, the predictions were evaluated using a comprehensive nomogram, and the differences between the models for the different subgroups were compared. The predicted performance was assessed by AUC, decision curve and clinical impact curve.
A nomogram of different histological subtypes of lung cancer was analyzed. For SQCC patients, the integrated nomogram model showed better predictive performance than the AI model with an AUC of 0.668 (95% confidence interval 0.628-0.707), with an AUC value of 0.827 (95% confidence interval 0.794-0.857). Accuracy was checked by DCA and CIC curve analysis, which indicates that the integrated model has more accurate prediction accuracy. For AD patients, the predictive performance of the integrated nomogram model was slightly better, with an AUC of 1.799 (95% confidence interval 0.764-0.831). Accuracy was checked by DCA and CIC curve analysis, which indicates that the integrated model has more accurate prediction accuracy. For other types of lung tumor patients, our synthetic nomogram model showed poor predictive performance with AUC of 0.728 (95% confidence interval 0.690-0.764), and the accuracy test in DCA and CIC curves also demonstrated no significant predictive differences for the integrated model.
In summary, CEA, serum ProGRP, CYFRA211, SCC, IBIL, APTT and age were identified and validated as potential independent factors for lung cancer by using the RF-coupled Logistic regression method, with sensitivity, specificity and AUC of 0.723, 0.786 and 0.840 respectively, and a p-value of 0.700 for the Hosmer-Lemeshow test. The analysis of parameters and DCA, CIC and calibration curves shows that our integrated model has excellent predictive power for lung cancer detection compared to conventional models.
Subgroup analysis of the different histological subtypes of lung cancer showed a more accurate predictive performance for both SQCC and AD for the integrated nomogram. External verification also proves that the comprehensive model has better predictive value for lung cancer. A convenient and accurate prediction alignment chart tool is established, and can be used for clinical application.

Claims (10)

1. A marker for diagnosing and/or prognosticating lung cancer, comprising at least one or more of carcinoembryonic antigen (CEA), gastrin releasing peptide (ProGRP), soluble fragment of cytokeratin 19 (CYFRA 21-1), squamous cell carcinoma antigen (SCC), indirect Bilirubin (IBIL), activated Partial Thrombin Time (APTT), and age.
2. The marker for diagnosing and/or prognosticating lung cancer of claim 1, wherein said marker comprises one or more of carcinoembryonic antigen (CEA), gastrin releasing peptide (ProGRP), soluble fragment of cytokeratin 19 (CYFRA 21-1), squamous cell carcinoma antigen (SCC), indirect bilirubin (ibi), activated Partial Thrombin Time (APTT), and age.
3. The marker for diagnosing and/or prognosticating lung cancer of claim 2, wherein said marker comprises carcinoembryonic antigen (CEA), gastrin releasing peptide (ProGRP), soluble fragment of cytokeratin 19 (CYFRA 21-1), squamous cell carcinoma antigen (SCC), indirect bilirubin (ibi), activated Partial Thrombin Time (APTT), and age.
4. Use of a marker for diagnosing and/or prognosticating lung cancer according to any of claims 1-3 for the manufacture of a reagent product, kit or database for diagnosing and/or prognosticating lung cancer.
5. A reagent product or kit comprising a standard for diagnosing and/or prognosticating a marker of lung cancer according to any one of claims 1-3.
6. The construction method of the lung cancer diagnosis and/or prediction model based on random forest combined Logistic regression is characterized by comprising the following steps:
(1) Acquiring a sample set: collecting clinical data of a lung cancer patient to form a sample set;
(2) Random forest variable screening: identifying potential blood routine data by a machine learning method, the selected variables and baseline characteristic variables of the patient being used as candidate parameters for model development;
(3) Multivariable logistic regression prediction modeling: and (3) discussing the correlation between each clinical data and lung cancer by using a Logistic regression analysis model, selecting meaningful variables in single-factor analysis to carry out single-factor multi-factor regression analysis, carrying out model prediction area division and accuracy verification, and establishing a Nomogram Nomogram model by using a 'gg-plot 2' package in R language.
7. A construction system of lung cancer diagnosis and/or prediction model based on random forest combined Logistic regression, which is applied to the construction method of claim 6, comprising:
the data acquisition module is at least used for data acquisition and acquiring a sample data set;
a data processing module for extracting at least valid samples from the sample dataset that can be used to construct an assessment model;
the model construction module is at least used for randomly dividing the incomplete data set of the effective sample into a training set and a verification set, fitting the training set by using a random forest method, and recording optimal model parameters according to the out-of-bag error;
and the threshold calculating module is at least used for calculating a model classification threshold according to the ROC curve by using the verification set.
8. A lung cancer diagnosis and/or prediction model diagnosis system based on random forest combined Logistic regression, comprising:
the pre-input module is used for inputting at least data to be diagnosed for the evaluation model;
a lung cancer diagnostic model constructed by the method of claim 5, at least for evaluating the data to be evaluated;
and the display module is at least used for displaying the diagnosis result.
9. A computer program product stored on a computer readable medium, comprising a computer readable program for, when executed on an electronic device, providing a user input interface to apply the random forest combined Logistic regression-based lung cancer diagnosis and/or prediction model diagnosis system of claim 8.
10. A computer readable storage medium storing instructions that when executed on a computer cause the computer to apply the random forest combined Logistic regression-based lung cancer diagnosis and/or prediction model diagnosis system of claim 8.
CN202211389513.7A 2022-11-08 2022-11-08 Marker for diagnosing and/or predicting lung cancer, diagnostic model and construction method thereof Pending CN116047074A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211389513.7A CN116047074A (en) 2022-11-08 2022-11-08 Marker for diagnosing and/or predicting lung cancer, diagnostic model and construction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211389513.7A CN116047074A (en) 2022-11-08 2022-11-08 Marker for diagnosing and/or predicting lung cancer, diagnostic model and construction method thereof

Publications (1)

Publication Number Publication Date
CN116047074A true CN116047074A (en) 2023-05-02

Family

ID=86130122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211389513.7A Pending CN116047074A (en) 2022-11-08 2022-11-08 Marker for diagnosing and/or predicting lung cancer, diagnostic model and construction method thereof

Country Status (1)

Country Link
CN (1) CN116047074A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275578A (en) * 2023-11-16 2023-12-22 北京大学人民医院 Method for constructing lung cancer lymph node metastasis multi-mode prediction model by combining gene mutation characteristics and mIF image characteristics

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275578A (en) * 2023-11-16 2023-12-22 北京大学人民医院 Method for constructing lung cancer lymph node metastasis multi-mode prediction model by combining gene mutation characteristics and mIF image characteristics
CN117275578B (en) * 2023-11-16 2024-02-27 北京大学人民医院 Method for constructing multi-mode prediction model of lung cancer lymph node metastasis

Similar Documents

Publication Publication Date Title
US11769596B2 (en) Plasma based protein profiling for early stage lung cancer diagnosis
JP6063447B2 (en) Cluster analysis of biomarker expression in cells
US10489550B2 (en) Predictive test for aggressiveness or indolence of prostate cancer from mass spectrometry of blood-based sample
CN111028224A (en) Data labeling method, model training device, image processing method, image processing device and storage medium
CN106202968B (en) Cancer data analysis method and device
Rezaeijo et al. Screening of COVID-19 based on the extracted radiomics features from chest CT images
CN105793710B (en) For the composition of pulmonary cancer diagnosis, method and kit
CN113223722B (en) Method and system for constructing lung nodule database and prediction model based on nomogram
CN113270188A (en) Method and device for constructing prognosis prediction model of patient after esophageal squamous carcinoma radical treatment
CN116047074A (en) Marker for diagnosing and/or predicting lung cancer, diagnostic model and construction method thereof
Liu et al. CT and CEA‐based machine learning model for predicting malignant pulmonary nodules
Wang et al. Survival risk prediction model for ESCC based on relief feature selection and CNN
Chernbumroong et al. Machine learning can predict disease manifestations and outcomes in lymphangioleiomyomatosis
CN114864080A (en) Method, system, equipment and medium for establishing liver cancer diagnosis model C-GALAD II
Yeaton et al. Discrimination between chronic pancreatitis and pancreatic adenocarcinoma using artificial intelligence‐related algorithms based on image cytometry‐generated variables
Blagojević et al. Combined machine learning and finite element simulation approach towards personalized model for prognosis of COVID-19 disease development in patients
Zhao et al. Predicting the risk of nodular thyroid disease in coal miners based on different machine learning models
CN115862838A (en) Bile duct cancer diagnosis model based on machine learning algorithm and construction method and application thereof
Yu et al. Leukemia can be Effectively Early Predicted in Routine Physical Examination with the Assistance of Machine Learning Models
Patel et al. Predicting Mutation Status and Recurrence Free Survival in Non-Small Cell Lung Cancer: A Hierarchical ct Radiomics–Deep Learning Approach
Liu et al. Uncovering nasopharyngeal carcinoma from chronic rhinosinusitis and healthy subjects using routine medical tests via machine learning
Das et al. A Comparative Study of Ovarian Cancer Prediction Using Machine Learning Method
Sourati et al. Assessing subsets of analytes in context of detecting laboratory errors
Lu et al. Machine Learning Techniques for Early Prediction of Diabetes on Multiple Datasets
Lyu EP04. 01-10 Metabolomic Biomarkers for Lung Cancer: A Systematic Review

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination