CN113936795A - Quantitative analysis method for risk factors of senile lung cancer onset - Google Patents

Quantitative analysis method for risk factors of senile lung cancer onset Download PDF

Info

Publication number
CN113936795A
CN113936795A CN202111303117.3A CN202111303117A CN113936795A CN 113936795 A CN113936795 A CN 113936795A CN 202111303117 A CN202111303117 A CN 202111303117A CN 113936795 A CN113936795 A CN 113936795A
Authority
CN
China
Prior art keywords
data
risk factors
lung cancer
steps
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111303117.3A
Other languages
Chinese (zh)
Inventor
陈松景
吴思竹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Medical Information CAMS
Original Assignee
Institute of Medical Information CAMS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Medical Information CAMS filed Critical Institute of Medical Information CAMS
Priority to CN202111303117.3A priority Critical patent/CN113936795A/en
Publication of CN113936795A publication Critical patent/CN113936795A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/02Reliability analysis or reliability optimisation; Failure analysis, e.g. worst case scenario performance, failure mode and effects analysis [FMEA]

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Primary Health Care (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention relates to a quantitative analysis method for risk factors of senile lung cancer, belonging to the technical field of medical biological information processing. The method integrates subject-relevant data such as demographic data, smoking habits, disease history, radiation exposure and behavioral risk-related data; aiming at the problem that the actual number of the lung cancer patients is far lower than that of the patients who are not ill, carrying out data imbalance treatment, and further carrying out pretreatment and layered division on the data; and respectively training models for the hierarchical data of the old people by using a deep neural network method, identifying respective risk factors, and carrying out quantitative analysis on the risk factors of the lung cancer of the old people. The method has the advantages of high precision and high calculation speed, and can be used for high-speed calculation of large-scale data; has the advantages of quantitative analysis, high accuracy and simple operation.

Description

Quantitative analysis method for risk factors of senile lung cancer onset
Technical Field
The invention relates to a quantitative analysis method for risk factors of senile lung cancer, belonging to the technical field of medical biological information processing.
Background
The lung cancer becomes a malignant tumor with the fastest global morbidity and mortality, is particularly prominent in the elderly, has great influence on the life quality of the elderly, and brings huge economic pressure to nations and families. In recent years, with the aging of the population in China becoming increasingly prominent, the development of active and effective lung cancer prevention and control research for the elderly is more urgent. However, the occurrence of the lung cancer is a complex process, and relates to the comprehensive action of various risk factors, the quantitative association relationship between the risk factors and the lung cancer occurrence is found in time, the pathogenesis of the elderly lung cancer is facilitated to be known, the accurate prevention of the elderly lung cancer is facilitated to be effectively developed, and the technical support can be provided for actively coping with the aging strategic target.
The invention adopts a deep learning method to identify the risk factors of the lung cancer of the elderly and quantitatively analyzes the influence degree of the risk factors on the lung cancer.
Disclosure of Invention
The invention aims to provide a quantitative analysis method for the risk factors of the senile lung cancer, aiming at the problem that the quantitative association relation between the occurrence of the senile lung cancer and various risk factors is not clear.
The core idea of the invention is that: integrating relevant data of the subject, such as demographic data, smoking habits, disease history, radiation exposure and behavioral risk related data; aiming at the problem that the actual number of the lung cancer patients is far lower than that of the patients who are not ill, carrying out data imbalance treatment, and further carrying out pretreatment and layered division on the data; and respectively training models for the hierarchical data of the old people by using a deep neural network method, identifying respective risk factors, and carrying out quantitative analysis on the risk factors of the lung cancer of the old people.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the method for quantitatively analyzing the risk factors of the senile lung cancer comprises the following steps:
step 1, acquiring survey data of the old, and integrating the survey data and the environmental data to form a cross-domain data source M;
step 2, carrying out data preprocessing on the basis of the cross-domain data source M obtained in the step 1 to obtain preprocessed data, and specifically comprising the following substeps:
step 2.1, balancing the data by adopting a few synthesized over-sampled SMOTEs to obtain balanced data;
where SMOTE, i.e., synthesizing a few oversamples; the balance treatment solves the problem of data imbalance caused by low prevalence rate of lung cancer and far lower actual morbidity than non-morbidity;
step 2.2, carrying out vacancy value compensation and noise smoothing on the data after the balance processing to obtain smoothed data;
compared with the data after the balance processing, the smoothed data solves the problems of data sources, namely the data after the balance processing, such as empty values and incomplete data;
step 2.3, performing layered division on the smoothed data output in the step 2.2 to obtain preprocessed data;
wherein, the layering divides, specifically is: dividing data according to gender, and then dividing the data according to the age of more than or equal to r and less than r to generate n groups of people, wherein the data after the hierarchical division is called preprocessed data;
step 3, training the deep neural network model respectively by using the preprocessed data obtained in the step 2 to obtain risk factors of each layered population, and specifically comprising the following steps:
step 3.1, respectively carrying out data format conversion on the data of the n groups of population layers, and respectively establishing a training set and a test set;
wherein x% of data of the n groups of population layers is a training set, and 1-x% is a testing set;
wherein x in x% ranges from 50 to 80;
step 3.2, training by using the training set to generate n well-trained deep neural network models, which specifically comprises the following steps: taking the data in the training set as the input of the deep neural network model, calculating the weights of different risk factors through the hidden layer, and obtaining the respective weight values of the risk factors on the output layer to obtain n trained deep neural network models;
wherein, the hidden layer weight forms a trained deep neural network model;
step 3.3, inputting the test set into n trained deep neural network models, and identifying the risk factors of each layered population, wherein the specific steps are as follows: the data in the test set are used as the input of n trained deep neural network models, the hidden layer of the neural network model calculates the weight values of different risk factors, and the output layer obtains the weight values of the risk factors;
step 4, respectively carrying out normalization processing on the weights of the lung cancer onset risk factors of different stratified populations corresponding to the risk factors of each stratified population to obtain quantitative risk factors of the n stratified populations;
so far, the quantitative analysis method for the risk factors of the senile lung cancer is completed from step 1 to step 4.
Advantageous effects
Compared with the existing statistical risk factor analysis methods such as linear regression and logistic regression, the quantitative analysis method for the risk factors of the senile lung cancer has the following beneficial effects:
1. the quantitative analysis method for the senile lung cancer onset risk factors adopts a deep neural network method, has the advantages of high calculation precision and high calculation speed, and can be used for high-speed calculation of large-scale data;
2. the quantitative analysis method for the senile lung cancer onset risk factors has the advantages of quantitative analysis, high accuracy and simplicity in operation.
Drawings
FIG. 1 is a schematic diagram of a quantitative analysis method of risk factors for the onset of senile lung cancer according to the present invention;
FIG. 2 is a schematic diagram of a risk factor identification model construction in the method for quantitatively analyzing risk factors of the onset of the senile lung cancer.
Detailed Description
The quantitative analysis method for the risk factors of the senile lung cancer is further described in detail with reference to the accompanying drawings and examples.
Example 1
In this embodiment, when the quantitative analysis method for risk factors of lung cancer of old people of the present invention is implemented based on a deep neural network, the method comprises the following steps with reference to fig. 1:
step 1, acquiring old people survey data, integrating to form an old people survey data source M relating to multiple fields, surveying data, meteorological and environmental data in 1996 and 2017 in specific implementation, and integrating to obtain a cross-field data source M;
according to the method, 235000 adult survey data in 1996 and 2017 are used as part of input of a model, wherein the middle-aged and elderly people account for 35%, and simultaneously meteorological data, environmental data and survey data are integrated together according to corresponding dates of the survey data to form a cross-domain data source M which is used as input of an identification model of the lung cancer onset risk factors of the elderly;
step 2, carrying out data preprocessing on the basis of the data source M obtained in the step 1, wherein the specific process is as follows:
step 2.1, carrying out balanced processing on the data by adopting a Synthetic Minority Oversampling Technology (SMOTE) to solve the problem of data imbalance caused by low prevalence rate of lung cancer and the fact that the actual number of the diseased people is far lower than the number of the non-diseased people;
step 2.2, the problems of the data source that the data source has the vacancy value and the data is incomplete are solved by adopting vacancy value compensation and noise smooth data preprocessing technology;
step 2.3, carrying out layered division on the data obtained on the basis of the step 2.2;
the specific method comprises the following steps: dividing data according to gender, and dividing the data according to the age of more than or equal to r and less than r on the basis of the gender so as to generate n groups of stratified population;
step 3, respectively training a deep neural network model by using the layered crowd obtained in the step 2; the method specifically comprises the following steps:
step 3.1, respectively carrying out data format conversion on the data of the n groups of the crowd;
3.2, training and producing n deep neural network models by utilizing n groups of population; the specific method comprises the following steps: the integrated data is used as the input of a deep neural network model, the weights of different risk factors are calculated through a hidden layer, the respective weighted values of the risk factors are obtained at the output layer of the model, and the whole process is shown as the construction principle of a risk factor identification model in fig. 2;
3.3, identifying the risk factors of each layered crowd through a deep neural network model;
and 4, respectively carrying out normalization treatment on the weights of the lung cancer onset risk factors of different stratified populations, thereby carrying out quantitative analysis on the respective risk factors in the n stratified populations.
And (4) respectively training a multivariate logistic regression model by using the preprocessed data obtained in the step (2), analyzing the risk factors of the senile lung cancer, comparing the risk factors with the risk factors of the n groups of stratified population output in the step (4), and finishing the quantitative analysis of the risk factors of the senile lung cancer. The method specifically comprises the following steps: the method comprises the steps of utilizing layered groups to respectively train and generate an old-age lung cancer onset risk factor recognition model based on a deep neural network, and verifying the model by adopting a ten-fold intersection method, wherein the performance of the model is shown in table 1. The performance of the deep neural network model and the multiple logistic regression model are obviously improved in the table 1, the model accuracy of people over 60 years old, male people over 60 years old, female people over 60 years old and the whole population is respectively improved by 9.32%, 7.98%, 8.53% and 8.86%, the deep neural network model is fast in training speed, and the time cost is saved.
TABLE 1 senile Lung cancer onset Risk factor identification model Performance
Figure BDA0003339095300000061
The influence degree of the risk factors in four groups of stratified population on the lung cancer is shown in table 2.
TABLE 2 influence of Risk factors on Lung cancer incidence in stratified population
Figure BDA0003339095300000062
The lung cancer onset risk factors in different stratified populations are quantitatively analyzed, and the following results are obtained:
1. the duration and frequency of smoking cessation are major risk factors for the development of lung cancer in elderly over 60 years of age. This is more pronounced in older men over the age of 60. Meanwhile, elderly men who have short smoking cessation time and smoke many times a day are more likely to develop lung cancer.
2. The frequency of smoking is the most major risk factor for developing lung cancer in men over 60 years of age. As shown in table 2, in the aged male group over 60 years old, the weight of smoking frequency and smoking cessation duration were 0.20 and 0.15, respectively, and the weight of smoking frequency was 33.3% higher than that of smoking cessation duration. Meanwhile, the first four dangerous factors in the elderly men are smoking frequency, smoking cessation duration, whether to use the electronic cigarette and at least 5 smoking bags, and are related to smoking. These smoking-related risk factors have a greater impact on the development of lung cancer in older men. Therefore, the active smoking cessation of the old men is more beneficial to reducing the occurrence of lung cancer.
3. Length of smoking cessation and at least 5 packs of smoking are the most major risk factors for developing lung cancer in women over the age of 60. As shown in table 2, the weight of the length of smoking cessation in women over the age of 60 was 0.21, which is 16.7% higher than the weight of at least 5 packs of cigarettes. The first three risk factors (length of smoking cessation, at least 5 packets of smoking, and frequency of smoking) in older women were associated with smoking. Thus, smoking-related risk factors have a greater impact on the development of lung cancer in older women than other risk factors.
4. In four groups of stratified people, the cancer history is the risk factor of lung cancer in the top ranking. This may suggest that the cancer history plays a role in the development of lung cancer.
5. Radiation exposure has a greater effect on the development of lung cancer in women over the age of 60 than in other groups. In addition, physical activity also has a certain impact on the development of lung cancer in the general population.
While the embodiments of the present invention have been described in detail with reference to a specific embodiment, it will be apparent to those skilled in the art that various changes and modifications can be made without departing from the spirit and scope of the invention.

Claims (7)

1. A quantitative analysis method for risk factors of lung cancer onset of old people is characterized by comprising the following steps: the method comprises the following steps:
step 1, acquiring survey data of the old, and integrating the survey data and the environmental data to form a cross-domain data source M;
step 2, carrying out data preprocessing on the basis of the cross-domain data source M obtained in the step 1 to obtain preprocessed data;
step 3, training the deep neural network model respectively by using the preprocessed data obtained in the step 2 to obtain risk factors of each layered population, and specifically comprising the following steps:
step 3.1, respectively carrying out data format conversion on the data of the n groups of population layers, and respectively establishing a training set and a test set;
step 3.2, training by using the training set to generate n well-trained deep neural network models, which specifically comprises the following steps: taking the data in the training set as the input of the deep neural network model, calculating the weights of different risk factors through the hidden layer, and obtaining the respective weight values of the risk factors on the output layer to obtain n trained deep neural network models;
step 3.3, inputting the test set into n trained deep neural network models, and identifying the risk factors of each layered population, wherein the specific steps are as follows: the data in the test set are used as the input of n trained deep neural network models, the hidden layer of the neural network model calculates the weight values of different risk factors, and the output layer obtains the weight values of the risk factors;
and 4, respectively carrying out normalization treatment on the weights of the lung cancer onset risk factors of different stratified populations corresponding to the risk factors of each stratified population to obtain quantitative risk factors of the n stratified populations.
2. The method for quantitatively analyzing the risk factors for the onset of the senile lung cancer according to claim 1, wherein the method comprises the following steps: step 2, specifically comprising the following substeps:
step 2.1, balancing the data by adopting a few synthesized over-sampled SMOTEs to obtain balanced data;
step 2.2, carrying out vacancy value compensation and noise smoothing on the data after the balance processing to obtain smoothed data;
and 2.3, performing layered division on the smoothed data output in the step 2.2 to obtain preprocessed data.
3. The method for quantitatively analyzing the risk factors for the onset of the senile lung cancer according to claim 2, wherein the method comprises the following steps: in step 2.1, SMOTE, namely synthesizing a few oversampling samples; the balance treatment solves the problem of data imbalance caused by low prevalence rate of lung cancer and much lower actual morbidity than non-morbidity.
4. The method for quantitatively analyzing the risk factors for the onset of the senile lung cancer according to claim 2, wherein the method comprises the following steps: in step 2.2, compared with the data after the balance processing, the smoothed data solves the problems that the data source, namely the data after the balance processing has a vacancy value and the data is incomplete.
5. The method for quantitatively analyzing the risk factors for the onset of the senile lung cancer according to claim 2, wherein the method comprises the following steps: in step 2.3, the layering division specifically comprises: the data is divided according to gender, and then divided according to the age of more than or equal to r and less than r, so that n groups of people are generated, and the data after the division is called preprocessed data.
6. The method for quantitatively analyzing the risk factors for the onset of the senile lung cancer according to claim 1, wherein the method comprises the following steps: in step 3.1, x% of data of the n groups of population layers is a training set, and 1-x% is a testing set; wherein x in x% ranges from 50 to 80.
7. The method for quantitatively analyzing the risk factors for the onset of the senile lung cancer according to claim 1, wherein the method comprises the following steps: in step 3.2, the hidden layer weights form a trained deep neural network model.
CN202111303117.3A 2021-11-05 2021-11-05 Quantitative analysis method for risk factors of senile lung cancer onset Pending CN113936795A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111303117.3A CN113936795A (en) 2021-11-05 2021-11-05 Quantitative analysis method for risk factors of senile lung cancer onset

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111303117.3A CN113936795A (en) 2021-11-05 2021-11-05 Quantitative analysis method for risk factors of senile lung cancer onset

Publications (1)

Publication Number Publication Date
CN113936795A true CN113936795A (en) 2022-01-14

Family

ID=79285842

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111303117.3A Pending CN113936795A (en) 2021-11-05 2021-11-05 Quantitative analysis method for risk factors of senile lung cancer onset

Country Status (1)

Country Link
CN (1) CN113936795A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999649A (en) * 2022-06-06 2022-09-02 广州培生智能科技有限公司 Old people physical sign data monitoring and early warning method and system based on deep learning
CN116130084A (en) * 2022-12-12 2023-05-16 中国医学科学院医学信息研究所 Method for predicting intervention effect of high risk group of senile lung cancer incidence

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999649A (en) * 2022-06-06 2022-09-02 广州培生智能科技有限公司 Old people physical sign data monitoring and early warning method and system based on deep learning
CN114999649B (en) * 2022-06-06 2022-12-27 广州培生智能科技有限公司 Old people sign data monitoring and early warning method and system based on deep learning
CN116130084A (en) * 2022-12-12 2023-05-16 中国医学科学院医学信息研究所 Method for predicting intervention effect of high risk group of senile lung cancer incidence

Similar Documents

Publication Publication Date Title
CN110674841B (en) Logging curve identification method based on clustering algorithm
CN109299380B (en) Exercise personalized recommendation method based on multi-dimensional features in online education platform
CN109034194B (en) Transaction fraud behavior deep detection method based on feature differentiation
CN112766379A (en) Data equalization method based on deep learning multi-weight loss function
CN113936795A (en) Quantitative analysis method for risk factors of senile lung cancer onset
CN103425727B (en) Context speech polling expands method and system
CN106653001A (en) Baby crying identifying method and system
CN103714261A (en) Intelligent auxiliary medical treatment decision supporting method of two-stage mixed model
CN110349597A (en) A kind of speech detection method and device
CN110110663A (en) A kind of age recognition methods and system based on face character
CN111428152B (en) Method and device for constructing similar communities of scientific researchers
CN108280164A (en) A kind of short text filtering and sorting technique based on classification related words
CN110109902A (en) A kind of electric business platform recommender system based on integrated learning approach
CN110085216A (en) A kind of vagitus detection method and device
CN117727307B (en) Bird voice intelligent recognition method based on feature fusion
Datla Bench marking of classification algorithms: Decision Trees and Random Forests-a case study using R
CN110990782B (en) Cigarette sensory quality evaluation method based on weighted hesitation fuzzy power aggregation operator
CN113361653A (en) Deep learning model depolarization method and device based on data sample enhancement
CN112836772A (en) Random contrast test identification method integrating multiple BERT models based on LightGBM
CN111767277A (en) Data processing method and device
CN115577237A (en) Elevator fault diagnosis method based on one-dimensional convolutional neural network and meta-learning
CN109460474B (en) User preference trend mining method
KR20210018610A (en) System for Prescriptive Analytics and Variable Importance Analysis of Prognostic Factors for Cancer Patients using Artificial Intelligence
CN108205525A (en) The method and apparatus that user view is determined based on user speech information
CN116473514B (en) Parkinson disease detection method based on plantar pressure self-adaptive directed space-time graph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination