CN115472292A

CN115472292A - Method for constructing lung cancer risk prediction model based on peripheral blood markers

Info

Publication number: CN115472292A
Application number: CN202211114692.3A
Authority: CN
Inventors: 陆松梅; 许林权
Original assignee: Chongqing University Cancer Hospital
Current assignee: Chongqing University Cancer Hospital
Priority date: 2022-09-14
Filing date: 2022-09-14
Publication date: 2022-12-13

Abstract

The invention discloses a method for constructing a lung cancer risk prediction model based on peripheral blood markers. The method comprises the following steps: 1) Collecting leukocyte, neutrophil absolute value, hemoglobin, lymphocyte absolute value, platelet count in peripheral blood routine, albumin, prealbumin, globulin and alkaline phosphatase in liver functions, sodium, chlorine and iron in electrolytes, fibrinogen and D-dimer in hemagglutination, CA125, CEA, cyfra21-1, SCC, NSE, proGRP in tumor markers, CD3+, CD4+, CD8+, B, NK lymphocyte count in immunologic function, and age 27 indexes; 2) Peripheral blood test indicators for lung cancer patients and lung benign disease patients (G1) were modeled based on machine learning. The model has high sensitivity and specificity; the model integrates inflammation index, nutrition index, blood coagulation index, tumor marker and immune function, and improves the sensitivity and specificity of a single index prediction model.

Description

Method for constructing lung cancer risk prediction model based on peripheral blood markers

Technical Field

The invention relates to a tumor early screening technology, in particular to a method for constructing a lung cancer risk prediction model based on peripheral blood markers.

Background

The middle-aged and elderly people are mostly the middle-aged and elderly people, the middle-aged and elderly people often have various basic diseases, such as hypertension, diabetes, chronic obstructive pulmonary disease, coronary heart disease, cerebral infarction, lumbar disc herniation and other series of aging-associated pathological changes, which are intensively outbreaked in the middle-aged and elderly people, and the model control group does not exclude the sick people with various basic diseases.

In addition, traditional lung cancer risk screening recommends low-dose helical CT examinations for populations with severe smoking exposure. Smoking is a non-essential factor in the onset of lung cancer. Compared with the western countries, under the influence of comprehensive factors such as second-hand smoke, environmental oil smoke and the like, the proportion of lung cancer of women non-smoking people in China is far higher than that of western people. At present, the incidence rate of the adenocarcinoma is in a continuously rising trend, most of the patients are non-smoking patients, and a large number of non-smoking lung cancer patients are missed by the traditional screening strategy.

The existing lung cancer prediction models based on the combination of various peripheral blood markers have single prediction factors in the research of the existing lung cancer prediction models based on the peripheral blood markers, and the markers used for early screening clinical products of malignant tumors are mainly concentrated on cfDNA, so that the cfDNA has low content in the peripheral blood, the extraction process is easily polluted by leucocyte DNA, and the prediction effect is influenced. And the study control group is mostly healthy people and is not involved in the basic diseases of the affected age group. Based solely on gene methylation markers may present challenges in practical clinical applications in populations with a large number of benign conditions in combination.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method for constructing a lung cancer risk prediction model based on a peripheral blood marker.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

a method for constructing a lung cancer risk prediction model based on peripheral blood markers comprises the following steps:

1) Collecting 27 indexes of white blood cells, neutrophil absolute value, hemoglobin, lymphocyte absolute value, platelet count, albumin, prealbumin, globulin and alkaline phosphatase in liver function, sodium, chlorine and iron in electrolyte, fibrinogen and D-dimer in hemagglutination, CA125, CEA, cyfra21-1, SCC, NSE and ProGRP in tumor markers, CD3+, CD4+, CD8+, B and NK lymphocyte test value in immune function, total lymphocyte count and age in the peripheral blood routine before anti-tumor treatment at the first visit;

2) Modeling peripheral blood test indexes of lung cancer patients and lung benign disease patients based on machine learning

2.1)K Nearest Neighbors

The algorithm is that a test point X is given, K points (samples) which are most similar to the X are found out according to the distance, and the class to which the K points belong and which has the most categories is the category to which the X belongs; assuming that orange dots are class 0 and green dots are class 1; when the yellow point is X and K =3, 2 orange points and 1 green point are reference points, and at the moment, the category to which the test point X belongs is 0; similarly, when K =5, the category to which the test point X belongs is 1 category; through an iterative experiment, finally finding out that the optimal value of K is 6;

2.2)Logistic Regression

in the regression problem, a linear relation between an independent variable and a dependent variable is adopted for realizing, wherein the dependent variable is of a continuous type; logistic regression is used to solve the case where the dependent variable is a discrete variable; in the actual operation process, a sigmoid function is adopted for realization; as in equation (2), when x tends to be negative infinity, the function value is 0; when x tends to be positive infinity, the function value is 1;

h _θ (x)＝θ ₀ +θ ₁ x ₁ +...+θ _n x _n (1)

in formula (1), θ represents the parameters of the model, and x represents the indexes of each sample (27 indexes in step 1); equation (2) is the loss function used by this expression; the value of the function is 0-1, and the closer the value is to 0, the lower the probability of suffering from lung cancer is; the closer the value is to 1, the greater the probability of lung cancer;

2.3)Random Forest

decision trees are an algorithm for solving classification problems, and realize classification by using layer-by-layer reasoning; during prediction, value judgment is carried out in the tree and branch entering is determined until a leaf node is reached to obtain a classification result; decision trees generally include feature selection, decision tree generation, and decision pruning; the random forest is composed of different decision trees, and each decision tree is not related; when a sample needs to be classified, inputting the sample into each decision tree for classification, and performing final classification according to the decision result of each tree;

2.4)SVM support vector machines

SVM supported vector machines, which are a two-class model, map instances to points in space; the SVM calculates a line to best distinguish the two types of points; the SVM is suitable for medium and small data sets, nonlinear and high-dimensional classification tasks;

2.5)Naive Bayes,NB

the bayer is a classification method based on the bayer theorem and independent assumption of characteristic conditions; for a given training set, firstly, learning the joint probability distribution of input and output based on the independent assumption of characteristic conditions, and then, for input x, solving the posterior probability and outputting y by using Bayes theorem, such as formula (3);

p (c | x) represents the conditional probability of belonging to the category (benign, lung cancer) based on the input, i.e., 27 indices in step (1); p (c) is the prior probability that a sample belongs to a class; p (x | c) is a conditional probability based on the category; p (x) is a total probability formula;

2.6 Neural networks

The structure of the neural network is that an input layer comprises 27 nodes which respectively correspond to 27 indexes of peripheral blood; indices are age, leukocyte, hemoglobin, neutrophil count, lymphocyte count, sodium, chloride, iron, albumin, prealbumin, globulin, alkaline phosphatase, total T lymphocyte count, CD4 lymphocyte count, CD8 lymphocyte count, NK cell count, B lymphocyte count, total lymphocyte count, fibrinogen, D-dimer, carcinoembryonic antigen, squamous cell carcinoma-associated antigen, neuron-specific enolase, gastrin-releasing peptide precursor, cytokeratin 19 fragment, and CA125; the second layer comprises 10 neuron nodes and completes the extraction of shallow features; after the second layer of neurons, a relu activation function is added to complete the mapping from the linear space to the nonlinear space; 18 neurons are designed in the third layer to extract features, and relu is also selected as an activation function after the third layer; selecting one neuron in the fourth layer, adopting 2 neurons as a full-connection layer in the last layer, and completing the prediction of whether the lung cancer exists by softmax;

the number of parameters is 36 (i.e. 18 parameters at the second layer, 10 parameters at the third layer, and 8 parameters at the fourth layer); the learning goal of the model is to determine the specific values of these parameters; in the prediction, the model calculates the final classification result according to 27 inputs and by combining 36 parameters;

2results

in the model calibration curve, the closer y is to x, the more optimal the model will work.

Compared with the prior art, the invention has the following technical effects:

1. the method for constructing the lung cancer risk prediction model based on the peripheral blood markers integrates more and more comprehensive clinical examination data compared with the conventional research model based on a big data platform. The multi-index prediction can reflect the actual hematological changes in the organism of the tumor patient, can make up for deficiencies of each other, and improves the sensitivity and specificity of single-factor prediction.

2. The model trained by the test can predict the whole population in the real world with hypertension, diabetes, chronic obstructive pulmonary disease and the like, and improves the use value of the model. The model prediction accuracy is also related to the tumor TNM stage, the later the stage is, the more the metastasis part is, the larger the tumor load is, the lower the corresponding inflammation index, coagulation index, nutrition index and immunity index are, the higher the tumor marker is, and the stronger the sensitivity of the model prediction is.

3. The model trained in the test does not contain smoking, family history and the like, so that the model still has good prediction effect on patients without smoking and family history. The model can distinguish the benign and malignant lung nodules, and the lung nodule patient can further improve the distinguishing capability of the benign and malignant lung nodules by combining the peripheral blood lung cancer prediction model with image recognition. A large comprehensive prediction model combining peripheral blood markers, smoking history, family history and imaging data certainly can further improve the accuracy of lung cancer prediction, but multidimensional data is also certainly required to be provided to influence the popularization of screening. Therefore, the excellent model obtained only through the comprehensive data of peripheral blood is an ideal tool for early screening of lung cancer and provides a reliable tool for preventing and treating tumors.

4. According to the invention, through integration of clinical common peripheral blood test indexes, lung cancer patients and patients with various benign diseases in the real world are combined to serve as control groups, high-sensitivity and high-specificity lung cancer prediction is realized for lung cancer diseases by means of machine learning, and the method is relatively cheap, can provide organism condition information of the patients, is suitable for dynamic observation, and is worthy of popularization.

Drawings

FIG. 1 is a schematic view of the algorithm for knn;

FIG. 2 is a diagram of a neural network architecture;

FIG. 3 is a calibration curve diagram of a lung cancer risk prediction model based on SVM, random forest and neural network for the first three prediction effects;

FIG. 4 is a graph of lung cancer model predictive effect for 6 algorithms;

FIG. 5 is a graph showing the predicted AUC for the risk of lung cancer in stages I-II as 0.68, 95% CI (0.65-0.72) using benign patients (G1) as a control group.

Detailed Description

The invention is described in further detail below with reference to the figures and the detailed description.

1) 7060 lung cancer patients diagnosed since 2013, 2463 benign lung disease patients (pharyngitis, upper respiratory tract infection, community-acquired pneumonia, COPD, tuberculosis, etc.) were examined for the first time, and 27 indexes including absolute value of neutrophil in routine peripheral blood, absolute value of neutrophil, hemoglobin, absolute value of lymphocyte, platelet count, albumin, prealbumin, globulin, alkaline phosphatase in liver function, sodium, chlorine, iron in electrolyte, fibrinogen and D-dimer in hemagglutination, CA125, CEA, cyfra21-1, SCC, NSE, proGRP, CD3+, CD4+, CD8+, B, NK lymphocyte test value, total T lymphocyte count and age in tumor marker were collected.

2) The method is characterized in that the peripheral blood test indexes of lung cancer patients and lung benign disease patients are modeled based on machine learning, and the method comprises the following specific steps:

2.1)K Nearest Neighbors

for a given test point X, as shown in FIG. 1, firstly, calculating the distance between each sample according to 27 indexes in step (1), and finding out K points (samples) closest to X, wherein the class with the most classes to which the K points belong is the class to which X belongs; assuming that the orange dots are class 0 and the green dots are class 1; when the yellow point is X and K =3, that is, 3 points closest to X are shown as an orange circle in the figure, and 2 orange points and 1 green point are reference points, at this time, the category to which the test point X belongs is class 0; similarly, when K =5, the category to which the test point X belongs is 1 category; through iterative experiments, the optimal value of K was finally found to be 6.

2.2)Logistic Regression

In the regression problem, the linear relation between independent variable and dependent variable is adopted to realize, wherein the dependent variable is of a continuous type; logistic regression is used to solve the case where the dependent variable is a discrete variable; in the actual operation process, a sigmoid function is adopted for realization; as in equation (2), when x tends to negative infinity, the function value is 0; when x tends to be positive infinity, the function value is 1;

h _θ (x)＝θ ₀ +θ ₁ x ₁ +...+θ _n x _n (1)

in the formula (1), θ represents a parameter of the model, and x represents indexes (27 indexes in step 1) of each sample. Equation (2) is the loss function used for this expression. The value of this function is 0-1, and the closer the value is to 0, the lower the probability of lung cancer. The closer the value is to 1, the greater the probability of lung cancer.

2.3)Random Forest

A decision tree is an algorithm for solving the classification problem, and the decision tree realizes classification by using layer-by-layer reasoning; during prediction, value judgment is carried out in the tree and branch entering is determined until a leaf node is reached to obtain a classification result; decision trees generally include feature selection, decision tree generation, and decision pruning; the random forest is composed of different decision trees, and each decision tree is not associated; when a sample needs to be classified, the sample is input into each decision tree for classification, and final decision classification is carried out according to the decision result of each tree.

2.4)SVM support vector machines

SVM supported vector machines, which are a two-class model, map instances to points in space; the SVM has the main idea that a line is obtained to best distinguish the two types of points; the SVM is suitable for small and medium-sized data sets, nonlinear and high-dimensional classification tasks.

2.5)Naive Bayes,NB

The bayer is a classification method based on the bayer theorem and independent hypothesis of characteristic conditions; for a given training set, the joint probability distribution of the input and output is first learned based on the feature condition independent assumptions, and then for input x, bayesian theorem, such as fair, is usedSolving the posterior probability and outputting y in the formula (3);

p (c | x) represents the conditional probability of belonging category (benign, lung cancer) based on the input, i.e. 27 indices in step (1). P (c) is the prior probability that a sample belongs to a class. P (x | c) is a conditional probability based on the class. P (x) is the total probability formula.

2.6 Neural networks

As shown in fig. 2, the neural network has a structure in which the input layer includes 27 nodes, each corresponding to 27 indices of peripheral blood; indices are age, leukocyte, hemoglobin, neutrophil count, lymphocyte count, sodium, chloride, iron, albumin, prealbumin, globulin, alkaline phosphatase, total T lymphocyte count, CD4 lymphocyte count, CD8 lymphocyte count, NK cell count, B lymphocyte count, total lymphocyte count, fibrinogen, D-dimer, carcinoembryonic antigen, squamous cell carcinoma-associated antigen, neuron-specific enolase, gastrin-releasing peptide precursor, cytokeratin 19 fragment, and CA125; the second layer comprises 10 neuron nodes and completes the extraction of shallow layer features; after the second layer of neurons, a relu activation function is added to complete the mapping from the linear space to the nonlinear space; the third layer is designed with 18 neurons for extracting features, and similarly, relu is also selected as an activation function after the third layer; the fourth layer selects one neuron, the last layer adopts 2 neurons as a full connection layer, and the prediction of whether the lung cancer exists is completed by softmax.

As shown in fig. 2, the number of parameters is 36 (i.e. 18 parameters at the second layer, 10 parameters at the third layer, and 8 parameters at the fourth layer). The learning goal of the model is to determine the specific values of these parameters. In the prediction, the model calculates the final classification result from 27 inputs in combination with 36 parameters.

2results

The closer y is to x in the model calibration curve, the more optimal the model will work, as shown in fig. 3.

Peripheral blood inspection indexes of lung cancer patients and lung benign disease patients are modeled based on machine learning, and model effectiveness is further verified in internal data set. The lung cancer risk prediction model based on the peripheral blood marker has high sensitivity and specificity, integrates an inflammation index, a nutrition index, a blood coagulation index, a tumor marker and an immune function, and improves the sensitivity and specificity of the single-index prediction model.

Fig. 4 shows a graph showing an ROC curve obtained by using the data set according to the present invention and using knn, linear regression, random forest, support vector set, naive bayes, neural network, and other algorithms. All AUC indexes of the invention reach more than 0.65. This result indicates that the model has an excellent classification effect. The prediction effect of the random forest, the support vector machine and the neural network is better, and the AUC value can reach over 0.81.

FIG. 5 shows the test index of the present invention based on stages I-IV of lung cancer. The test set used for this index test (containing benign and lung cancer data.) as can be seen in figure 5, the AUC test results for stages I-II of the tumor of the present invention are 0.6798, 0.8659 for stage III of the tumor, and 0.9439 for stage IV of the tumor. In addition, the sensitivity of the model can reach 0.67 under the specificity of 95%, and the model has very high prediction capability on patients with stage I-II lung cancer.

The method for constructing the lung cancer risk prediction model based on the peripheral blood markers has the advantages that: the method has the advantages that the method can be used for predicting the whole population with basic diseases such as hypertension, diabetes, coronary heart disease, liver cirrhosis and the like in benign lung disease patients without excluding basic diseases such as hypertension, diabetes, coronary heart disease, liver cirrhosis and the like, and the method is more in line with the actual situation than the method using pure healthy population as a control group, improves the use value of the model and is low in cost.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. The method for constructing the lung cancer risk prediction model based on the peripheral blood marker is characterized by comprising the following steps of:

1) 27 indexes including white blood cells, hemoglobin, absolute value of neutrophils, absolute value of lymphocytes and platelet count, albumin, prealbumin, globulin and alkaline phosphatase in liver functions, sodium, chlorine and iron in electrolytes, fibrinogen and D-dimer in hemagglutination images, CA125, CEA, cyfra21-1, SCC, NSE, proGRP in tumor markers, CD3+, CD4+, CD8+, B and NK lymphocyte test value in immunologic function, total T lymphocyte count and age in the peripheral blood routine of a lung cancer patient before anti-tumor treatment at the first visit are collected;

2) Modeling of peripheral blood test indicators for lung cancer patients and lung benign disease patients based on machine learning

2.1)K Nearest Neighbors

The algorithm is characterized in that a test point X is given, the distance between the test point X and each sample is calculated according to 27 indexes in the step (1), K points which are most similar to the X are found, and the class which the K points belong to has the most categories is the category which the X belongs to; assuming that the orange dots are class 0 and the green dots are class 1; when the yellow point is X and K =3, 2 orange points and 1 green point are reference points, and at the moment, the category to which the test point X belongs is 0; similarly, when K =5, the category to which the test point X belongs is 1 category; through an iterative experiment, finally finding out that the optimal value of K is 6;

2.2)Logistic Regression

h _θ (x)＝θ ₀ +θ ₁ x ₁ +...+θ _n x _n (1)

in formula (1), θ represents a parameter of the model, and x represents an index of each sample; the formula (2) is a loss function used in the expression, the value of the function is 0-1, and the closer the value is to 0, the lower the probability of suffering from lung cancer is; the closer this value is to 1, the greater the probability of lung cancer;

2.3)Random Forest

decision trees are an algorithm for solving classification problems, and realize classification by using layer-by-layer reasoning; during prediction, value judgment is carried out in the tree and branch entering is determined until a leaf node is reached to obtain a classification result; decision trees generally include feature selection, decision tree generation, and decision pruning; the random forest is composed of different decision trees, and each decision tree is not related; when a sample needs to be classified, the sample is input into each decision tree for classification, and final decision classification is carried out according to the decision result of each tree;

2.4)SVM support vector machines

2.5)Naive Bayes,NB

the bayer is a classification method based on the bayer theorem and independent hypothesis of characteristic conditions; for a given training set, firstly, learning the joint probability distribution of input and output based on the independent assumption of characteristic conditions, and then, for input x, solving the posterior probability by using Bayes theorem, such as formula (3), and outputting y;

p (c | x) represents the conditional probability of belonging to the category based on the input, i.e., the 27 indices in step (1); p (c) is the prior probability that a sample belongs to a certain class; p (x | c) is a conditional probability based on the class; p (x) is a total probability formula;

2.6 Neural networks

The structure of the neural network is that an input layer comprises 27 nodes which respectively correspond to 27 indexes of peripheral blood; indices are leukocyte, hemoglobin, neutrophil absolute, lymphocyte absolute, platelet count, albumin, prealbumin, globulin, alkaline phosphatase, sodium, chloride, iron, fibrinogen, D-dimer, carcinoembryonic antigen, squamous cell carcinoma-associated antigen, neuron-specific enolase, gastrin-releasing peptide precursor, cytokeratin 19 fragment, CA125, total T lymphocyte count, CD4 lymphocyte count, CD8 lymphocyte count, NK cell count, B lymphocyte count, total lymphocyte count, and age; the second layer comprises 10 neuron nodes and completes the extraction of shallow features; after the second layer of neurons, a relu activation function is added to complete the mapping from the linear space to the nonlinear space; the third layer is designed with 18 neurons for extracting features, and similarly, relu is also selected as an activation function after the third layer; selecting a neuron in the fourth layer, wherein the last layer adopts 2 neurons as a full connection layer, and finishes the prediction of whether the lung cancer exists by softmax;

2 results

in the model calibration curve, the closer y is to x, the more desirable the model will work.