CN116779172A

CN116779172A - Lung cancer disease burden risk early warning method based on ensemble learning

Info

Publication number: CN116779172A
Application number: CN202310786560.3A
Authority: CN
Inventors: 马倩倩; 赵杰; 谭中科; 孙东旭; 高景宏; 卢耀恩; 石金铭; 陈保站; 陈昊天; 王振博
Original assignee: First Affiliated Hospital of Zhengzhou University
Current assignee: First Affiliated Hospital of Zhengzhou University
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-09-19

Abstract

The invention discloses a lung cancer disease burden risk early warning method based on ensemble learning, which belongs to the technical field of big data, and comprises the steps of integrating and cleaning data, screening prediction indexes, reducing dimension, measuring hysteresis effect, establishing a prediction model pool, verifying and optimizing models, evaluating the prediction effect of the models, and carrying out stacking integration combination on a plurality of models to solve the technical problem of providing more accurate reference data for predicting lung cancer disease burden.

Description

Lung cancer disease burden risk early warning method based on ensemble learning

Technical Field

The invention belongs to the technical field of big data, and particularly relates to a lung cancer disease burden risk early warning method based on ensemble learning.

Background

There are studies suggesting that most cancers are attributable to environmental factors rather than genetic factors, which are diseases caused by prolonged exposure to low doses of environmental carcinogens.

Numerous studies have demonstrated a significant relationship between air pollution and tumors, but the contaminants studied are limited to PM2.5, PM10, SO2, etc., and are less related to NH3, OC, BC, CO, NOx, NMVOC, etc. Meanwhile, a prediction model which is integrated with multidimensional characteristics such as environment, air pollution, economy, weather and the like is lacking;

considering that the influence of environmental economy and other factors has different hysteresis effects, the hysteresis analysis of the prediction index can greatly extend the external prediction window length of the model, and the current model lacks the consideration of the hysteresis effect;

there is no research in the prior art directed to the analysis of the burden-related relationship between air pollutants and lung cancer diseases in a longer time series.

ARIMA is a traditional multivariate time series data model, has relatively high requirements on data, needs a long continuous time series, has poor model reliability if the series is too short, and is relatively complex in model identification and calculation. Current common methods fail to meet the increasing medical big data demands. The different methods are applicable to different data, and the disease burden prediction method suitable for various data distribution, integrated deep learning, machine learning, statistical regression models and other models is proposed, so that time series data with high latitude and different time fine granularity can be processed, and the prediction precision is improved.

Disclosure of Invention

The invention aims to provide a lung cancer disease burden risk early warning method based on ensemble learning, which solves the technical problem of providing more accurate reference data for predicting lung cancer disease burden.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a lung cancer disease burden risk early warning method based on ensemble learning comprises the following steps:

step 1: establishing a database server, wherein the database server acquires disease burden data, meteorological data, air pollution data, regional economic data and time characteristic data through the Internet, integrates and cleans the data to construct a lung cancer disease burden characteristic database, and visually displays database data through a chart to display time sequence characteristics of diseases and characteristics;

step 2: establishing a model server, acquiring data integrated and cleaned in a database server by the model server, performing reduction and screening of prediction indexes through information entropy and main components, and analyzing and measuring and calculating hysteresis effects of the prediction indexes on the burden of lung cancer diseases through gray correlation;

respectively constructing a prediction model pool on a training sequence, wherein the prediction model pool comprises a GAM model, an LSTM model, a GM (1, N) model, an ARIMA model, an XGBoost algorithm model, an RFR algorithm model, a BP neural network model and an AdaBoost algorithm model, verifying each model in the prediction model pool, optimizing each model parameter, updating and iterating each model, evaluating the prediction performance of each model on a test set, and sequencing each model according to the prediction performance;

step 3: an integrated model server is established, and 4 models with the predictive performance arranged at the front 4 are selected from a predictive model pool by the integrated model server to be used as a first layer of base learning device of Stacking integrated learning; fitting is carried out on the verification set and the prediction set by each predictor respectively to form a new training set and a new testing set which are used as the input of the meta learner of the Stacking second layer; taking a linear regression model and a ridge regression model in the model as candidate element learners, and preferentially obtaining a final integrated model through predicting performance evaluation; providing relevant reference data for the prediction of the s-step future period based on the hysteresis effect index;

step 4: and (3) the integrated model server performs visual display on the result obtained in the step (3).

Preferably, when the step 1 is executed, the data is integrated and cleaned, specifically, the abnormal data, the missing data, the repeated data and the inconsistent data are cleaned.

Preferably, in executing step 1, the missing data is filled up by adopting a mathematical statistical method such as a mean value method, a regression method or a multiple filling method, a variable with the missing proportion exceeding 10% is removed, and standard data is obtained after integrating and cleaning the data through the steps of data analysis, definition of a cleaning strategy, data inspection, data cleaning execution, data quality evaluation and clean data backflow.

Preferably, when the step 1 is executed, the visual display of database data through a chart specifically comprises collecting as much data as possible, after data mining and cleaning, arranging the data from different sources into primary indexes such as disease burden, weather, air pollution, economy and other environmental data, constructing a lung cancer disease burden risk early warning primary database, carrying out descriptive statistical analysis on the environmental pollution, weather characteristic and economic characteristic distribution of the region through means, standard deviation, extremum and quartile, and calculating the annual average composite growth rate of the disease burden.

Preferably, when executing step 2, the screening of the prediction index specifically includes the following steps:

step 2-1: acquiring initial indexes based on importance screening through subjective expert interviews and literature theory collection;

step 2-2: screening initial indexes based on information entropy, calculating the comparison information entropy of different initial indexes and lung cancer disease burden, eliminating indexes with lower relevance to the disease burden from the initial indexes, and eliminating redundant indexes with higher relevance;

step 2-3: screening important indexes or extracting main components as new indexes based on main component analysis, and specifically comprises the following steps:

step 2-3-1: construction of an index matrixWherein x is _np The p index value of the nth sample is represented, and n and p respectively represent the row number and the column number of the index in the matrix;

step 2-3-1: performing standardized transformation on the matrix X to obtain Z;

step 2-3-2: calculating a correlation coefficient matrix of the standardized matrix ZWherein m represents the number of samples, and T represents the matrix transposition;

step 2-3-3: calculating eigenvalue lambda of correlation coefficient matrix R _j And corresponding orthogonalization unit feature vector a _j ；

Obtaining a principal component score F _i ＝a _1i x ₁ +a _2i x ₂ +…+a _pi x _p The method comprises the steps of carrying out a first treatment on the surface of the Wherein i is the number of the main component, and p is the total index number;

step 2-3-4: calculating factor load, index x _j In the main component F _i The load on isReflecting the principal component F _i And index x _j The degree of correlation between the two variables represents the importance of each variable in the main component and the contribution of each variable to the result, and the degree of correlation can be calculated by |l| #F _i ,x _j ) Screening out important indexes, wherein j is an index number, and i is a main component number;

step 2-3-5: when the index is excessive, k main components are selected as new indexes, the k value is determined by the information contribution rate of the main components reaching 80%,

preferably, in the step 2, hysteresis effects of the influence of the predictive indicators on the burden of the lung cancer disease are measured and calculated through gray correlation analysis, specifically, gray correlation analysis is carried out by quantitatively comparing the geometric shapes of the research variable sequence and the related factor sequence to judge the correlation degree of the related factors and the research variable, and the influence degree and the hysteresis effects of the predictive indicators on the morbidity and mortality of the lung cancer are analyzed through Dunn correlation.

Preferably, when executing step 3, the method specifically comprises the following steps:

step 3-1: the first layer of the modeling integrated model comprises a GAM model, an LSTM model, a GM (1, N) model, an ARIMA model, an XGBoost algorithm model, an RFR algorithm model, a BP neural network model and an AdaBoost algorithm model to form a prediction model pool, and 4 regression algorithm models with the prediction performance arranged at the front 4 are selected from the prediction model pool to serve as a modeling first layer;

step 3-2: fitting each predictor which is optimized by parameters in a verification set and a prediction set respectively, combining prediction results of the verification set to form a new training set, and forming a new test set by the prediction results of the test set through weighted average, wherein the new test set is used as input of a Stacking second layer;

step 3-3: introducing a meta learner into a second layer of the modeling integrated model, respectively carrying out regression training on the prediction result of the previous layer as a training set and a testing set, taking a linear regression model and a ridge regression model as the meta learner, and obtaining a final meta learner preferentially through prediction effect evaluation;

step 3-4: based on the hysteresis effect index, relevant reference data is provided for the prediction of the s-step future period.

The lung cancer disease burden risk early warning method based on ensemble learning solves the technical problem of providing more accurate reference data for predicting lung cancer disease burden, the method fuses multi-source data to provide more comprehensive information, fully and comprehensively utilizes various prediction model information, combines a plurality of model results to generate a strong predictor, fully utilizes the advantages of different models, reduces uncertainty and deviation of a single model, improves prediction accuracy and stability, and can provide more accurate prediction reference data than a single prediction model. The model of the present invention has different features and capabilities in processing time series data. By combining the models, the data with various characteristics can be processed, and different data characteristics and trends are considered more comprehensively, so that the accuracy of the data is improved, and the time sequence relation between different indexes and disease burden can be captured by analyzing the hysteresis effect of different prediction indexes. By considering the hysteresis effect, the prediction model can be established more accurately, and the accuracy of prediction is improved. The method can provide prediction reference data within a longer time range by using hysteresis effect, adopts a rolling window technology to realize cross verification of time series data, performs training of a single prediction model and a meta learning model, and can help model parameter estimation.

Drawings

FIG. 1 is a diagram of a data architecture of the present invention;

FIG. 2 is a schematic diagram of a data cleansing flow according to the present invention;

FIG. 3 is a flow chart of index screening of the present invention;

FIG. 4 is a schematic diagram of an LSTM network architecture of the present invention;

fig. 5 is a schematic view of the Stacking structure of the present invention.

Detailed Description

The lung cancer disease burden risk early warning method based on ensemble learning shown in fig. 1-5 comprises the following steps:

air pollution: primary particles (particulate matter PM10 and PM2.5, carbonaceous morphology (black carbon BC, organic carbon OC)), acidified gases (nitrogen oxides NOx, sulfur dioxide SO 2), ozone precursor gases (carbon monoxide CO, nitrogen oxides NOx, non-methane volatile organic compounds NMVOC), ammonia NH3, and the like.

Weather factors: average relative humidity, average air temperature, average rainfall, average barometric regional economic level: GDP, personnel income.

Time characteristic data: season, holiday, week data.

Other environmental pollution: water pollution data such as wastewater discharge, chemical oxygen demand, total ammonia nitrogen discharge and the like, and pollution data of the production amount of general industrial solid waste.

Disease burden: including sex, number of lung cancer, morbidity, mortality, DALYs, and DALYs rate.

When the step 1 is executed, the data are integrated and cleaned, specifically, abnormal data, missing data, repeated data and inconsistent data are cleaned, the missing data are filled by adopting a mathematical statistics method such as a mean value method, a regression method or a multiple filling method, variables with the missing proportion exceeding 10% are removed, standard data are obtained after the data are integrated and cleaned through the steps of data analysis, definition of a cleaning strategy, data inspection, execution of data cleaning, data quality evaluation and clean data backflow, the data are subjected to data visualization display through a chart, specifically, the data are collected as much as possible, after the data are mined and cleaned, the data from different sources are arranged into primary indexes such as disease burden, weather, air pollution, economy and other environmental data, a lung cancer disease burden risk early warning primary database is constructed, the regional environmental pollution, weather characteristics and economic characteristic distribution are subjected to descriptive statistical analysis through the mean value, standard deviation, extremum and quartile, the disease burden annual average composite growth rate (Compound Annual Growth Rate) is calculated, and the specific formula of the disease burden annual average composite growth rate is as follows:

wherein y represents a disease burden value, and n represents the years of the disease burden sequence.

the dataset typically contains indicators that are partially unimportant or redundant, severely impacting predictive performance in the model. In addition, redundancy tends to have a large correlation among indexes, which causes multiple collinearity problems in the regression model. It is therefore desirable to select indices that are highly correlated with the burden of lung cancer disease, while not correlating with each other. And removing indexes which are not actually related or redundant with the lung cancer disease burden prediction, wherein the removal of the indexes does not cause information loss, but can realize the effects of shortening the model training time, reducing the overfitting and the like, thereby establishing a real and effective prediction index system and improving the model accuracy.

Forming an initial index system set on the basis of disease burden risk factor analysis, and then forming a final prediction index system by adopting a method combining subjective analysis and objective analysis, wherein the screening of the prediction index specifically comprises the following steps:

information gain, g (x, y) =h (x) -H (x|y), is calculated, wherein H (x) is the information entropy of index x, and H (x|y) is the conditional entropy.

The entropy of the comparison information is calculated,which reflects the degree of correlation between the indicators or the degree of correlation between the indicators and the burden of lung cancer disease.

According to the above formula, calculating the correlation degree of the index and lung cancer prognosis, if IR (x _i ,y)≤η ₁ The index is considered to have low correlation with the burden of lung cancer disease, and is eliminated, wherein eta ₁ Representing the information entropy threshold.

Calculating the correlation degree between the indexes after screening according to the above method, if IR (x _i ,x _j )≥η ₂ If the two indexes are considered to have redundancy, eliminating the index with lower degree of relevance to the burden of lung cancer diseases, wherein eta ₂ Representing the set information entropy threshold.

step 2-3-1: performing standardized transformation on the matrix X to obtain Z;

step 2-3-2: calculating a correlation coefficient matrix of the standardized matrix ZWherein n represents the number of samples, and T represents the matrix transposition;

Obtaining a principal component score F _i ＝a _1i x ₁ +a _2i x ₂ +…+a _pi x _p The method comprises the steps of carrying out a first treatment on the surface of the Wherein p is the total index number, i is the number of the main component, and p is the total index number;

step 2-3-4: calculating factor load, index x _j In the main component F _i The load on isReflecting the principal component F _i And index x _j The degree of correlation between the two variables indicates the importance of each variable in the principal component and the contribution to the result by |l (F _i ,x _j ) Screening out important indexes, wherein j is an index number, and i is a main component number;

the hysteresis effect of the influence of each prediction index on the lung cancer disease burden is calculated through gray correlation analysis, specifically comprises the steps of quantitatively comparing the geometrical similarity or dissimilarity degree of a research variable sequence and a related factor sequence through gray correlation analysis so as to judge the correlation degree of the related factor and the research variable, and analyzing the influence degree and the hysteresis effect of each prediction index on lung cancer morbidity and mortality by adopting Dunn correlation.

In this embodiment, there is a hysteresis effect in calculating the influence of environmental pollution such as air, weather factors, and economic indicators on diseases, and the method specifically includes the following steps:

step S1: reference sequence X based on disease burden ₀ ＝(x ₀ (1),…,x ₀ (k),…,x ₀ (n))；

Step S2: respectively takes environmental pollution, weather and other indexes of different lag phases as a comparison sequence X _i ＝(x _i (1),…,x _i (k),…,x _i (n))；

Step S3: calculating the association coefficient and association degree of each index in the current period and the burden of the lung cancer diseases, and comparing the ith comparison sequence X _i Reference sequence X for disease burden ₀ The correlation coefficient at the point k is,the resolution coefficient phi is 0.5;

dunn association degree, i-th comparison sequence X _i Reference sequence X for disease burden ₀ The Deng's gray correlation degree of (C) is set,

step S4: calculating the association degree of different lag-t sequences and lung cancer disease burden, gamma _i (-t)；

Step S5: after T years, gamma _i (-t) is the largest, giving X _i The index hysteresis effect is T;

step S6: cycling until all indicators of hysteresis are obtained.

the method specifically comprises the following steps:

in this embodiment, a sliding window is used to divide the data into a training sequence, a verification sequence, and a test sequence. GAM, LSTM, GM (1, N), ARIMA models and the like are respectively constructed on the training sequences, and after verification, model parameters are optimized, and iteration is updated, the model is the first layer of the modeling integrated model.

Generalized Addition Model (GAM)

GAM is an extension of the generalized linear model, originally proposed by hasie and Tibshirani, and can evaluate both the linear and nonlinear correlations of environmental factors, time, etc. with health effects. Confounding effects caused by time-dependent variables (e.g., seasonal and long-term trends) can be controlled. The GAM has less requirements on samples and wide applicability, and the expression is as follows:

Y＝g(u)+ε；

g(u _i )＝β ₀ +f(x _i )+f ₂ (x ₂ )+…+f _i (x _i )+…+f _m (x _m )；

wherein f (x) _i ) Is about the prediction index x _i Is a smooth function of (a). g (u) _i ) As a connecting function, because cancer morbidity and mortality are subject to the characteristics of Poisson distribution, a Poisson regression model is adopted to establish a lung cancer disease burden and risk prediction model.

Long-short period memory model (LSTM)

A long-short-term memory model (LSTM) is used as an improved cyclic neural network model (RNN), and in the robustness problem of treating long-term dependency, the problems of gradient disappearance and gradient explosion are solved, so that the LSTM model has more accurate prediction effect in a longer sequence compared with the common RNN.

Each unit has components such as an input door, a forget door, and an output door.

And (3) parameter determination: x is x _t Information input indicating time t, c _t-1 The network memory state at the time t-1 is h _t-1 The information output at time t-1 is also the information input at time t. i.e _t 、f _t And o _t Input gate unit variables, forget gate unit variables, and output gate unit variables, respectively. Sigma represents a Sigmoid activation function; tanh represents a tanh activation function; the ";a cell state update value at time t; w (W) _i Representing an input weight; u (U) _i Representing the output weight; b _i Indicating the deviation.

The forget gate decides the information to be selected for removal from the features stored in the hidden layer from the original output and the new input.

f _t ＝σ(W _f ×(h _t-1 ，x _t )+b _f )

The input gate determines new information to store in the module's characteristic information and is used to update the cell state.

i _t ＝σ(W _i ×(h _t-1 ，x _t )+b _i )

The output gate outputs state information at the current time and decides the value of the next hidden state.

h _t ＝σ(W _o ×(h _t-1 ，x _t )+b _o )×tanh(C _t )

Model training: the LSTM model is trained using training data. In the training process, the input sequence is provided to the LSTM model, so that the model and the characteristics of the sequence can be learned. During training, the weights and bias of the model are adjusted using the loss function and optimization algorithm to minimize the difference between the predicted output and the actual output.

GM (1, N) model

The gray prediction model is suitable for processing the problems of small sample size and poor information. The GM (1, N) model is a basic model of a multi-variable gray system modeling method, can perform overall and dynamic analysis on multiple factors, and reflects the dynamic change relation between a research variable sequence and a related factor sequence. The model contains one study variable and N-1 influencing factor variables. GM (1, N) time response function and subtraction reduction are respectively

Wherein, the liquid crystal display device comprises a liquid crystal display device,for the original study variable sequence, +.>The new data series generated by first-order accumulation of the original sequence is characterized in that a is a system development coefficient, and bi is a driving coefficient of each related factor.

ARIMA model

And judging the stability of the sequence according to the time sequence diagram and the stability test of the original data.

If the sequence is a non-stable sequence, the sequence is required to be stabilized through difference or data transformation, and the stability of the sequence after difference is determined through stability test.

Model types are preliminarily identified by an autocorrelation function (autocorrelation function, ACF) diagram and a partial autocorrelation function (partial autocorrelation function, PACF) diagram, and model orders are determined.

Depending on whether the original data sequence has a seasonal trend, the model can be divided into seasonal ARIMA (P, D, Q) S and non-seasonal ARIMA (P, D, Q), where (P, D, Q) and (P, D, Q) are the orders of non-seasonal and seasonal Autoregressions (ARs), differencing and Moving Averages (MA), respectively, and S represents the seasonal period.

The optimal model is filtered according to the red pool information criterion (AIC) and the Bayesian criterion (BIC).

Other models

The embodiment also builds a measurement model based on XGBoost algorithm, RFR algorithm, BP neural network, adaBoost and other algorithms.

In this embodiment, the model parameter tuning uses a hyper-parametric optimization algorithm of grid search and cross-validation evaluation based on a rolling prediction origin, which ensures that sufficient basic predictions are generated for model training through a rolling window technique. And combining the super-parameter intervals to be tested into a multi-dimensional space, dividing the test space into specific grids according to the search step length of each interval, wherein each grid corresponds to a parameter set value, then, each grid corresponds to a model test once to obtain evaluation indexes corresponding to the super-parameter combinations, and selecting super-parameters corresponding to the most optimal evaluation indexes as optimized super-parameters of a prediction model, thereby improving the prediction performance.

The time fine granularity optimization aims at a model with poor prediction effect, and time sequences with different time scales are selected for prediction, wherein the time sequences comprise fine granularity prediction and coarse granularity prediction.

The new data learning comprises a historical time sequence and new existing data supplementation, and new real data is dynamically added for model updating learning.

In this embodiment, the evaluation of the prediction effect of each model specifically includes: testing each prediction model on a test set respectively; and MER, MAPE, MAE, RMSE and other indexes are adopted to evaluate the performance of the prediction model, and the model precision is higher as the index value is smaller.

Average error rate (Modulation error ratio, MER):

MER = mean absolute value of mean error/mean actual value

Average absolute percentage error (Mean Absolute Percentage Error, MAPE), when MAPE is lower than 10% -15%, the prediction accuracy is better.

Mean absolute error (Mean Absolute Error, MAE)

Root mean square error (Root Mean Squared Error, RMSE), mean of the squares of the true and predicted error

And y is _i Respectively represented by a fitting value and an actual value,

Claims

1. A lung cancer disease burden risk early warning method based on ensemble learning is characterized in that: the method comprises the following steps:

2. The lung cancer disease burden risk early warning method based on ensemble learning according to claim 1, wherein: when the step 1 is executed, the data are integrated and cleaned, specifically, the abnormal data, the missing data, the repeated data and the inconsistent data are cleaned.

3. The lung cancer disease burden risk early warning method based on ensemble learning according to claim 2, wherein: and (2) when the step (1) is executed, filling the missing data by adopting a mathematical statistical method such as a mean value method, a regression method or a multiple filling method, removing the variable with the missing proportion exceeding 10%, and integrating and cleaning the data through the steps of data analysis, definition of a cleaning strategy, data inspection, execution of data cleaning, data quality evaluation and clean data backflow to obtain standard data.

4. The lung cancer disease burden risk early warning method based on ensemble learning according to claim 2, wherein: when the step 1 is executed, the visual display of database data is carried out through a chart, wherein the method specifically comprises the steps of collecting data as much as possible, after data mining and cleaning, arranging the data from different sources into primary indexes such as disease burden, weather, air pollution, economy and other environmental data, constructing a lung cancer disease burden risk early warning primary database, carrying out descriptive statistical analysis on environmental pollution, weather characteristic and economic characteristic distribution in the region through means, standard deviation, extremum and quartile, and calculating the annual average composite growth rate of the disease burden.

5. The lung cancer disease burden risk early warning method based on ensemble learning according to claim 1, wherein: when executing the step 2, the screening of the prediction index specifically includes the following steps:

step 2-3-1: performing standardized transformation on the matrix X to obtain Z;

6. the lung cancer disease burden risk early warning method based on ensemble learning according to claim 1, wherein: and (2) when the step (2) is executed, measuring and calculating the hysteresis effect of each prediction index on the burden of the lung cancer disease through gray correlation analysis, wherein the gray correlation analysis specifically comprises the steps of quantitatively comparing the geometrical shapes of a research variable sequence and a related factor sequence to judge the correlation degree of the related factor and the research variable, and analyzing the influence degree and the hysteresis effect of each prediction index on the lung cancer morbidity and mortality through Dunn correlation.

7. The lung cancer disease burden risk early warning method based on ensemble learning according to claim 1, wherein: when executing the step 3, the method specifically comprises the following steps: