Quantitative analysis method for risk factors of senile lung cancer onset
Technical Field
The invention relates to a quantitative analysis method for risk factors of senile lung cancer, belonging to the technical field of medical biological information processing.
Background
The lung cancer becomes a malignant tumor with the fastest global morbidity and mortality, is particularly prominent in the elderly, has great influence on the life quality of the elderly, and brings huge economic pressure to nations and families. In recent years, with the aging of the population in China becoming increasingly prominent, the development of active and effective lung cancer prevention and control research for the elderly is more urgent. However, the occurrence of the lung cancer is a complex process, and relates to the comprehensive action of various risk factors, the quantitative association relationship between the risk factors and the lung cancer occurrence is found in time, the pathogenesis of the elderly lung cancer is facilitated to be known, the accurate prevention of the elderly lung cancer is facilitated to be effectively developed, and the technical support can be provided for actively coping with the aging strategic target.
The invention adopts a deep learning method to identify the risk factors of the lung cancer of the elderly and quantitatively analyzes the influence degree of the risk factors on the lung cancer.
Disclosure of Invention
The invention aims to provide a quantitative analysis method for the risk factors of the senile lung cancer, aiming at the problem that the quantitative association relation between the occurrence of the senile lung cancer and various risk factors is not clear.
The core idea of the invention is that: integrating relevant data of the subject, such as demographic data, smoking habits, disease history, radiation exposure and behavioral risk related data; aiming at the problem that the actual number of the lung cancer patients is far lower than that of the patients who are not ill, carrying out data imbalance treatment, and further carrying out pretreatment and layered division on the data; and respectively training models for the hierarchical data of the old people by using a deep neural network method, identifying respective risk factors, and carrying out quantitative analysis on the risk factors of the lung cancer of the old people.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the method for quantitatively analyzing the risk factors of the senile lung cancer comprises the following steps:
step 1, acquiring survey data of the old, and integrating the survey data and the environmental data to form a cross-domain data source M;
step 2, carrying out data preprocessing on the basis of the cross-domain data source M obtained in the step 1 to obtain preprocessed data, and specifically comprising the following substeps:
step 2.1, balancing the data by adopting a few synthesized over-sampled SMOTEs to obtain balanced data;
where SMOTE, i.e., synthesizing a few oversamples; the balance treatment solves the problem of data imbalance caused by low prevalence rate of lung cancer and far lower actual morbidity than non-morbidity;
step 2.2, carrying out vacancy value compensation and noise smoothing on the data after the balance processing to obtain smoothed data;
compared with the data after the balance processing, the smoothed data solves the problems of data sources, namely the data after the balance processing, such as empty values and incomplete data;
step 2.3, performing layered division on the smoothed data output in the step 2.2 to obtain preprocessed data;
wherein, the layering divides, specifically is: dividing data according to gender, and then dividing the data according to the age of more than or equal to r and less than r to generate n groups of people, wherein the data after the hierarchical division is called preprocessed data;
step 3, training the deep neural network model respectively by using the preprocessed data obtained in the step 2 to obtain risk factors of each layered population, and specifically comprising the following steps:
step 3.1, respectively carrying out data format conversion on the data of the n groups of population layers, and respectively establishing a training set and a test set;
wherein x% of data of the n groups of population layers is a training set, and 1-x% is a testing set;
wherein x in x% ranges from 50 to 80;
step 3.2, training by using the training set to generate n well-trained deep neural network models, which specifically comprises the following steps: taking the data in the training set as the input of the deep neural network model, calculating the weights of different risk factors through the hidden layer, and obtaining the respective weight values of the risk factors on the output layer to obtain n trained deep neural network models;
wherein, the hidden layer weight forms a trained deep neural network model;
step 3.3, inputting the test set into n trained deep neural network models, and identifying the risk factors of each layered population, wherein the specific steps are as follows: the data in the test set are used as the input of n trained deep neural network models, the hidden layer of the neural network model calculates the weight values of different risk factors, and the output layer obtains the weight values of the risk factors;
step 4, respectively carrying out normalization processing on the weights of the lung cancer onset risk factors of different stratified populations corresponding to the risk factors of each stratified population to obtain quantitative risk factors of the n stratified populations;
so far, the quantitative analysis method for the risk factors of the senile lung cancer is completed from step 1 to step 4.
Advantageous effects
Compared with the existing statistical risk factor analysis methods such as linear regression and logistic regression, the quantitative analysis method for the risk factors of the senile lung cancer has the following beneficial effects:
1. the quantitative analysis method for the senile lung cancer onset risk factors adopts a deep neural network method, has the advantages of high calculation precision and high calculation speed, and can be used for high-speed calculation of large-scale data;
2. the quantitative analysis method for the senile lung cancer onset risk factors has the advantages of quantitative analysis, high accuracy and simplicity in operation.
Drawings
FIG. 1 is a schematic diagram of a quantitative analysis method of risk factors for the onset of senile lung cancer according to the present invention;
FIG. 2 is a schematic diagram of a risk factor identification model construction in the method for quantitatively analyzing risk factors of the onset of the senile lung cancer.
Detailed Description
The quantitative analysis method for the risk factors of the senile lung cancer is further described in detail with reference to the accompanying drawings and examples.
Example 1
In this embodiment, when the quantitative analysis method for risk factors of lung cancer of old people of the present invention is implemented based on a deep neural network, the method comprises the following steps with reference to fig. 1:
step 1, acquiring old people survey data, integrating to form an old people survey data source M relating to multiple fields, surveying data, meteorological and environmental data in 1996 and 2017 in specific implementation, and integrating to obtain a cross-field data source M;
according to the method, 235000 adult survey data in 1996 and 2017 are used as part of input of a model, wherein the middle-aged and elderly people account for 35%, and simultaneously meteorological data, environmental data and survey data are integrated together according to corresponding dates of the survey data to form a cross-domain data source M which is used as input of an identification model of the lung cancer onset risk factors of the elderly;
step 2, carrying out data preprocessing on the basis of the data source M obtained in the step 1, wherein the specific process is as follows:
step 2.1, carrying out balanced processing on the data by adopting a Synthetic Minority Oversampling Technology (SMOTE) to solve the problem of data imbalance caused by low prevalence rate of lung cancer and the fact that the actual number of the diseased people is far lower than the number of the non-diseased people;
step 2.2, the problems of the data source that the data source has the vacancy value and the data is incomplete are solved by adopting vacancy value compensation and noise smooth data preprocessing technology;
step 2.3, carrying out layered division on the data obtained on the basis of the step 2.2;
the specific method comprises the following steps: dividing data according to gender, and dividing the data according to the age of more than or equal to r and less than r on the basis of the gender so as to generate n groups of stratified population;
step 3, respectively training a deep neural network model by using the layered crowd obtained in the step 2; the method specifically comprises the following steps:
step 3.1, respectively carrying out data format conversion on the data of the n groups of the crowd;
3.2, training and producing n deep neural network models by utilizing n groups of population; the specific method comprises the following steps: the integrated data is used as the input of a deep neural network model, the weights of different risk factors are calculated through a hidden layer, the respective weighted values of the risk factors are obtained at the output layer of the model, and the whole process is shown as the construction principle of a risk factor identification model in fig. 2;
3.3, identifying the risk factors of each layered crowd through a deep neural network model;
and 4, respectively carrying out normalization treatment on the weights of the lung cancer onset risk factors of different stratified populations, thereby carrying out quantitative analysis on the respective risk factors in the n stratified populations.
And (4) respectively training a multivariate logistic regression model by using the preprocessed data obtained in the step (2), analyzing the risk factors of the senile lung cancer, comparing the risk factors with the risk factors of the n groups of stratified population output in the step (4), and finishing the quantitative analysis of the risk factors of the senile lung cancer. The method specifically comprises the following steps: the method comprises the steps of utilizing layered groups to respectively train and generate an old-age lung cancer onset risk factor recognition model based on a deep neural network, and verifying the model by adopting a ten-fold intersection method, wherein the performance of the model is shown in table 1. The performance of the deep neural network model and the multiple logistic regression model are obviously improved in the table 1, the model accuracy of people over 60 years old, male people over 60 years old, female people over 60 years old and the whole population is respectively improved by 9.32%, 7.98%, 8.53% and 8.86%, the deep neural network model is fast in training speed, and the time cost is saved.
TABLE 1 senile Lung cancer onset Risk factor identification model Performance
The influence degree of the risk factors in four groups of stratified population on the lung cancer is shown in table 2.
TABLE 2 influence of Risk factors on Lung cancer incidence in stratified population
The lung cancer onset risk factors in different stratified populations are quantitatively analyzed, and the following results are obtained:
1. the duration and frequency of smoking cessation are major risk factors for the development of lung cancer in elderly over 60 years of age. This is more pronounced in older men over the age of 60. Meanwhile, elderly men who have short smoking cessation time and smoke many times a day are more likely to develop lung cancer.
2. The frequency of smoking is the most major risk factor for developing lung cancer in men over 60 years of age. As shown in table 2, in the aged male group over 60 years old, the weight of smoking frequency and smoking cessation duration were 0.20 and 0.15, respectively, and the weight of smoking frequency was 33.3% higher than that of smoking cessation duration. Meanwhile, the first four dangerous factors in the elderly men are smoking frequency, smoking cessation duration, whether to use the electronic cigarette and at least 5 smoking bags, and are related to smoking. These smoking-related risk factors have a greater impact on the development of lung cancer in older men. Therefore, the active smoking cessation of the old men is more beneficial to reducing the occurrence of lung cancer.
3. Length of smoking cessation and at least 5 packs of smoking are the most major risk factors for developing lung cancer in women over the age of 60. As shown in table 2, the weight of the length of smoking cessation in women over the age of 60 was 0.21, which is 16.7% higher than the weight of at least 5 packs of cigarettes. The first three risk factors (length of smoking cessation, at least 5 packets of smoking, and frequency of smoking) in older women were associated with smoking. Thus, smoking-related risk factors have a greater impact on the development of lung cancer in older women than other risk factors.
4. In four groups of stratified people, the cancer history is the risk factor of lung cancer in the top ranking. This may suggest that the cancer history plays a role in the development of lung cancer.
5. Radiation exposure has a greater effect on the development of lung cancer in women over the age of 60 than in other groups. In addition, physical activity also has a certain impact on the development of lung cancer in the general population.
While the embodiments of the present invention have been described in detail with reference to a specific embodiment, it will be apparent to those skilled in the art that various changes and modifications can be made without departing from the spirit and scope of the invention.