CN116779172A - Lung cancer disease burden risk early warning method based on ensemble learning - Google Patents

Lung cancer disease burden risk early warning method based on ensemble learning Download PDF

Info

Publication number
CN116779172A
CN116779172A CN202310786560.3A CN202310786560A CN116779172A CN 116779172 A CN116779172 A CN 116779172A CN 202310786560 A CN202310786560 A CN 202310786560A CN 116779172 A CN116779172 A CN 116779172A
Authority
CN
China
Prior art keywords
model
data
prediction
lung cancer
disease burden
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310786560.3A
Other languages
Chinese (zh)
Inventor
马倩倩
赵杰
谭中科
孙东旭
高景宏
卢耀恩
石金铭
陈保站
陈昊天
王振博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
First Affiliated Hospital of Zhengzhou University
Original Assignee
First Affiliated Hospital of Zhengzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by First Affiliated Hospital of Zhengzhou University filed Critical First Affiliated Hospital of Zhengzhou University
Priority to CN202310786560.3A priority Critical patent/CN116779172A/en
Publication of CN116779172A publication Critical patent/CN116779172A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses a lung cancer disease burden risk early warning method based on ensemble learning, which belongs to the technical field of big data, and comprises the steps of integrating and cleaning data, screening prediction indexes, reducing dimension, measuring hysteresis effect, establishing a prediction model pool, verifying and optimizing models, evaluating the prediction effect of the models, and carrying out stacking integration combination on a plurality of models to solve the technical problem of providing more accurate reference data for predicting lung cancer disease burden.

Description

Lung cancer disease burden risk early warning method based on ensemble learning
Technical Field
The invention belongs to the technical field of big data, and particularly relates to a lung cancer disease burden risk early warning method based on ensemble learning.
Background
There are studies suggesting that most cancers are attributable to environmental factors rather than genetic factors, which are diseases caused by prolonged exposure to low doses of environmental carcinogens.
Numerous studies have demonstrated a significant relationship between air pollution and tumors, but the contaminants studied are limited to PM2.5, PM10, SO2, etc., and are less related to NH3, OC, BC, CO, NOx, NMVOC, etc. Meanwhile, a prediction model which is integrated with multidimensional characteristics such as environment, air pollution, economy, weather and the like is lacking;
considering that the influence of environmental economy and other factors has different hysteresis effects, the hysteresis analysis of the prediction index can greatly extend the external prediction window length of the model, and the current model lacks the consideration of the hysteresis effect;
there is no research in the prior art directed to the analysis of the burden-related relationship between air pollutants and lung cancer diseases in a longer time series.
ARIMA is a traditional multivariate time series data model, has relatively high requirements on data, needs a long continuous time series, has poor model reliability if the series is too short, and is relatively complex in model identification and calculation. Current common methods fail to meet the increasing medical big data demands. The different methods are applicable to different data, and the disease burden prediction method suitable for various data distribution, integrated deep learning, machine learning, statistical regression models and other models is proposed, so that time series data with high latitude and different time fine granularity can be processed, and the prediction precision is improved.
Disclosure of Invention
The invention aims to provide a lung cancer disease burden risk early warning method based on ensemble learning, which solves the technical problem of providing more accurate reference data for predicting lung cancer disease burden.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a lung cancer disease burden risk early warning method based on ensemble learning comprises the following steps:
step 1: establishing a database server, wherein the database server acquires disease burden data, meteorological data, air pollution data, regional economic data and time characteristic data through the Internet, integrates and cleans the data to construct a lung cancer disease burden characteristic database, and visually displays database data through a chart to display time sequence characteristics of diseases and characteristics;
step 2: establishing a model server, acquiring data integrated and cleaned in a database server by the model server, performing reduction and screening of prediction indexes through information entropy and main components, and analyzing and measuring and calculating hysteresis effects of the prediction indexes on the burden of lung cancer diseases through gray correlation;
respectively constructing a prediction model pool on a training sequence, wherein the prediction model pool comprises a GAM model, an LSTM model, a GM (1, N) model, an ARIMA model, an XGBoost algorithm model, an RFR algorithm model, a BP neural network model and an AdaBoost algorithm model, verifying each model in the prediction model pool, optimizing each model parameter, updating and iterating each model, evaluating the prediction performance of each model on a test set, and sequencing each model according to the prediction performance;
step 3: an integrated model server is established, and 4 models with the predictive performance arranged at the front 4 are selected from a predictive model pool by the integrated model server to be used as a first layer of base learning device of Stacking integrated learning; fitting is carried out on the verification set and the prediction set by each predictor respectively to form a new training set and a new testing set which are used as the input of the meta learner of the Stacking second layer; taking a linear regression model and a ridge regression model in the model as candidate element learners, and preferentially obtaining a final integrated model through predicting performance evaluation; providing relevant reference data for the prediction of the s-step future period based on the hysteresis effect index;
step 4: and (3) the integrated model server performs visual display on the result obtained in the step (3).
Preferably, when the step 1 is executed, the data is integrated and cleaned, specifically, the abnormal data, the missing data, the repeated data and the inconsistent data are cleaned.
Preferably, in executing step 1, the missing data is filled up by adopting a mathematical statistical method such as a mean value method, a regression method or a multiple filling method, a variable with the missing proportion exceeding 10% is removed, and standard data is obtained after integrating and cleaning the data through the steps of data analysis, definition of a cleaning strategy, data inspection, data cleaning execution, data quality evaluation and clean data backflow.
Preferably, when the step 1 is executed, the visual display of database data through a chart specifically comprises collecting as much data as possible, after data mining and cleaning, arranging the data from different sources into primary indexes such as disease burden, weather, air pollution, economy and other environmental data, constructing a lung cancer disease burden risk early warning primary database, carrying out descriptive statistical analysis on the environmental pollution, weather characteristic and economic characteristic distribution of the region through means, standard deviation, extremum and quartile, and calculating the annual average composite growth rate of the disease burden.
Preferably, when executing step 2, the screening of the prediction index specifically includes the following steps:
step 2-1: acquiring initial indexes based on importance screening through subjective expert interviews and literature theory collection;
step 2-2: screening initial indexes based on information entropy, calculating the comparison information entropy of different initial indexes and lung cancer disease burden, eliminating indexes with lower relevance to the disease burden from the initial indexes, and eliminating redundant indexes with higher relevance;
step 2-3: screening important indexes or extracting main components as new indexes based on main component analysis, and specifically comprises the following steps:
step 2-3-1: construction of an index matrixWherein x is np The p index value of the nth sample is represented, and n and p respectively represent the row number and the column number of the index in the matrix;
step 2-3-1: performing standardized transformation on the matrix X to obtain Z;
step 2-3-2: calculating a correlation coefficient matrix of the standardized matrix ZWherein m represents the number of samples, and T represents the matrix transposition;
step 2-3-3: calculating eigenvalue lambda of correlation coefficient matrix R j And corresponding orthogonalization unit feature vector a j
Obtaining a principal component score F i =a 1i x 1 +a 2i x 2 +…+a pi x p The method comprises the steps of carrying out a first treatment on the surface of the Wherein i is the number of the main component, and p is the total index number;
step 2-3-4: calculating factor load, index x j In the main component F i The load on isReflecting the principal component F i And index x j The degree of correlation between the two variables represents the importance of each variable in the main component and the contribution of each variable to the result, and the degree of correlation can be calculated by |l| #F i ,x j ) Screening out important indexes, wherein j is an index number, and i is a main component number;
step 2-3-5: when the index is excessive, k main components are selected as new indexes, the k value is determined by the information contribution rate of the main components reaching 80%,
preferably, in the step 2, hysteresis effects of the influence of the predictive indicators on the burden of the lung cancer disease are measured and calculated through gray correlation analysis, specifically, gray correlation analysis is carried out by quantitatively comparing the geometric shapes of the research variable sequence and the related factor sequence to judge the correlation degree of the related factors and the research variable, and the influence degree and the hysteresis effects of the predictive indicators on the morbidity and mortality of the lung cancer are analyzed through Dunn correlation.
Preferably, when executing step 3, the method specifically comprises the following steps:
step 3-1: the first layer of the modeling integrated model comprises a GAM model, an LSTM model, a GM (1, N) model, an ARIMA model, an XGBoost algorithm model, an RFR algorithm model, a BP neural network model and an AdaBoost algorithm model to form a prediction model pool, and 4 regression algorithm models with the prediction performance arranged at the front 4 are selected from the prediction model pool to serve as a modeling first layer;
step 3-2: fitting each predictor which is optimized by parameters in a verification set and a prediction set respectively, combining prediction results of the verification set to form a new training set, and forming a new test set by the prediction results of the test set through weighted average, wherein the new test set is used as input of a Stacking second layer;
step 3-3: introducing a meta learner into a second layer of the modeling integrated model, respectively carrying out regression training on the prediction result of the previous layer as a training set and a testing set, taking a linear regression model and a ridge regression model as the meta learner, and obtaining a final meta learner preferentially through prediction effect evaluation;
step 3-4: based on the hysteresis effect index, relevant reference data is provided for the prediction of the s-step future period.
The lung cancer disease burden risk early warning method based on ensemble learning solves the technical problem of providing more accurate reference data for predicting lung cancer disease burden, the method fuses multi-source data to provide more comprehensive information, fully and comprehensively utilizes various prediction model information, combines a plurality of model results to generate a strong predictor, fully utilizes the advantages of different models, reduces uncertainty and deviation of a single model, improves prediction accuracy and stability, and can provide more accurate prediction reference data than a single prediction model. The model of the present invention has different features and capabilities in processing time series data. By combining the models, the data with various characteristics can be processed, and different data characteristics and trends are considered more comprehensively, so that the accuracy of the data is improved, and the time sequence relation between different indexes and disease burden can be captured by analyzing the hysteresis effect of different prediction indexes. By considering the hysteresis effect, the prediction model can be established more accurately, and the accuracy of prediction is improved. The method can provide prediction reference data within a longer time range by using hysteresis effect, adopts a rolling window technology to realize cross verification of time series data, performs training of a single prediction model and a meta learning model, and can help model parameter estimation.
Drawings
FIG. 1 is a diagram of a data architecture of the present invention;
FIG. 2 is a schematic diagram of a data cleansing flow according to the present invention;
FIG. 3 is a flow chart of index screening of the present invention;
FIG. 4 is a schematic diagram of an LSTM network architecture of the present invention;
fig. 5 is a schematic view of the Stacking structure of the present invention.
Detailed Description
The lung cancer disease burden risk early warning method based on ensemble learning shown in fig. 1-5 comprises the following steps:
step 1: establishing a database server, wherein the database server acquires disease burden data, meteorological data, air pollution data, regional economic data and time characteristic data through the Internet, integrates and cleans the data to construct a lung cancer disease burden characteristic database, and visually displays database data through a chart to display time sequence characteristics of diseases and characteristics;
air pollution: primary particles (particulate matter PM10 and PM2.5, carbonaceous morphology (black carbon BC, organic carbon OC)), acidified gases (nitrogen oxides NOx, sulfur dioxide SO 2), ozone precursor gases (carbon monoxide CO, nitrogen oxides NOx, non-methane volatile organic compounds NMVOC), ammonia NH3, and the like.
Weather factors: average relative humidity, average air temperature, average rainfall, average barometric regional economic level: GDP, personnel income.
Time characteristic data: season, holiday, week data.
Other environmental pollution: water pollution data such as wastewater discharge, chemical oxygen demand, total ammonia nitrogen discharge and the like, and pollution data of the production amount of general industrial solid waste.
Disease burden: including sex, number of lung cancer, morbidity, mortality, DALYs, and DALYs rate.
When the step 1 is executed, the data are integrated and cleaned, specifically, abnormal data, missing data, repeated data and inconsistent data are cleaned, the missing data are filled by adopting a mathematical statistics method such as a mean value method, a regression method or a multiple filling method, variables with the missing proportion exceeding 10% are removed, standard data are obtained after the data are integrated and cleaned through the steps of data analysis, definition of a cleaning strategy, data inspection, execution of data cleaning, data quality evaluation and clean data backflow, the data are subjected to data visualization display through a chart, specifically, the data are collected as much as possible, after the data are mined and cleaned, the data from different sources are arranged into primary indexes such as disease burden, weather, air pollution, economy and other environmental data, a lung cancer disease burden risk early warning primary database is constructed, the regional environmental pollution, weather characteristics and economic characteristic distribution are subjected to descriptive statistical analysis through the mean value, standard deviation, extremum and quartile, the disease burden annual average composite growth rate (Compound Annual Growth Rate) is calculated, and the specific formula of the disease burden annual average composite growth rate is as follows:
wherein y represents a disease burden value, and n represents the years of the disease burden sequence.
Step 2: establishing a model server, acquiring data integrated and cleaned in a database server by the model server, performing reduction and screening of prediction indexes through information entropy and main components, and analyzing and measuring and calculating hysteresis effects of the prediction indexes on the burden of lung cancer diseases through gray correlation;
respectively constructing a prediction model pool on a training sequence, wherein the prediction model pool comprises a GAM model, an LSTM model, a GM (1, N) model, an ARIMA model, an XGBoost algorithm model, an RFR algorithm model, a BP neural network model and an AdaBoost algorithm model, verifying each model in the prediction model pool, optimizing each model parameter, updating and iterating each model, evaluating the prediction performance of each model on a test set, and sequencing each model according to the prediction performance;
the dataset typically contains indicators that are partially unimportant or redundant, severely impacting predictive performance in the model. In addition, redundancy tends to have a large correlation among indexes, which causes multiple collinearity problems in the regression model. It is therefore desirable to select indices that are highly correlated with the burden of lung cancer disease, while not correlating with each other. And removing indexes which are not actually related or redundant with the lung cancer disease burden prediction, wherein the removal of the indexes does not cause information loss, but can realize the effects of shortening the model training time, reducing the overfitting and the like, thereby establishing a real and effective prediction index system and improving the model accuracy.
Forming an initial index system set on the basis of disease burden risk factor analysis, and then forming a final prediction index system by adopting a method combining subjective analysis and objective analysis, wherein the screening of the prediction index specifically comprises the following steps:
step 2-1: acquiring initial indexes based on importance screening through subjective expert interviews and literature theory collection;
step 2-2: screening initial indexes based on information entropy, calculating the comparison information entropy of different initial indexes and lung cancer disease burden, eliminating indexes with lower relevance to the disease burden from the initial indexes, and eliminating redundant indexes with higher relevance;
information gain, g (x, y) =h (x) -H (x|y), is calculated, wherein H (x) is the information entropy of index x, and H (x|y) is the conditional entropy.
The entropy of the comparison information is calculated,which reflects the degree of correlation between the indicators or the degree of correlation between the indicators and the burden of lung cancer disease.
According to the above formula, calculating the correlation degree of the index and lung cancer prognosis, if IR (x i ,y)≤η 1 The index is considered to have low correlation with the burden of lung cancer disease, and is eliminated, wherein eta 1 Representing the information entropy threshold.
Calculating the correlation degree between the indexes after screening according to the above method, if IR (x i ,x j )≥η 2 If the two indexes are considered to have redundancy, eliminating the index with lower degree of relevance to the burden of lung cancer diseases, wherein eta 2 Representing the set information entropy threshold.
Step 2-3: screening important indexes or extracting main components as new indexes based on main component analysis, and specifically comprises the following steps:
step 2-3-1: construction of an index matrixWherein x is np The p index value of the nth sample is represented, and n and p respectively represent the row number and the column number of the index in the matrix;
step 2-3-1: performing standardized transformation on the matrix X to obtain Z;
step 2-3-2: calculating a correlation coefficient matrix of the standardized matrix ZWherein n represents the number of samples, and T represents the matrix transposition;
step 2-3-3: calculating eigenvalue lambda of correlation coefficient matrix R j And corresponding orthogonalization unit feature vector a j
Obtaining a principal component score F i =a 1i x 1 +a 2i x 2 +…+a pi x p The method comprises the steps of carrying out a first treatment on the surface of the Wherein p is the total index number, i is the number of the main component, and p is the total index number;
step 2-3-4: calculating factor load, index x j In the main component F i The load on isReflecting the principal component F i And index x j The degree of correlation between the two variables indicates the importance of each variable in the principal component and the contribution to the result by |l (F i ,x j ) Screening out important indexes, wherein j is an index number, and i is a main component number;
step 2-3-5: when the index is excessive, k main components are selected as new indexes, the k value is determined by the information contribution rate of the main components reaching 80%,
the hysteresis effect of the influence of each prediction index on the lung cancer disease burden is calculated through gray correlation analysis, specifically comprises the steps of quantitatively comparing the geometrical similarity or dissimilarity degree of a research variable sequence and a related factor sequence through gray correlation analysis so as to judge the correlation degree of the related factor and the research variable, and analyzing the influence degree and the hysteresis effect of each prediction index on lung cancer morbidity and mortality by adopting Dunn correlation.
In this embodiment, there is a hysteresis effect in calculating the influence of environmental pollution such as air, weather factors, and economic indicators on diseases, and the method specifically includes the following steps:
step S1: reference sequence X based on disease burden 0 =(x 0 (1),…,x 0 (k),…,x 0 (n));
Step S2: respectively takes environmental pollution, weather and other indexes of different lag phases as a comparison sequence X i =(x i (1),…,x i (k),…,x i (n));
Step S3: calculating the association coefficient and association degree of each index in the current period and the burden of the lung cancer diseases, and comparing the ith comparison sequence X i Reference sequence X for disease burden 0 The correlation coefficient at the point k is,the resolution coefficient phi is 0.5;
dunn association degree, i-th comparison sequence X i Reference sequence X for disease burden 0 The Deng's gray correlation degree of (C) is set,
step S4: calculating the association degree of different lag-t sequences and lung cancer disease burden, gamma i (-t);
Step S5: after T years, gamma i (-t) is the largest, giving X i The index hysteresis effect is T;
step S6: cycling until all indicators of hysteresis are obtained.
Step 3: an integrated model server is established, and 4 models with the predictive performance arranged at the front 4 are selected from a predictive model pool by the integrated model server to be used as a first layer of base learning device of Stacking integrated learning; fitting is carried out on the verification set and the prediction set by each predictor respectively to form a new training set and a new testing set which are used as the input of the meta learner of the Stacking second layer; taking a linear regression model and a ridge regression model in the model as candidate element learners, and preferentially obtaining a final integrated model through predicting performance evaluation; providing relevant reference data for the prediction of the s-step future period based on the hysteresis effect index;
the method specifically comprises the following steps:
step 3-1: the first layer of the modeling integrated model comprises a GAM model, an LSTM model, a GM (1, N) model, an ARIMA model, an XGBoost algorithm model, an RFR algorithm model, a BP neural network model and an AdaBoost algorithm model to form a prediction model pool, and 4 regression algorithm models with the prediction performance arranged at the front 4 are selected from the prediction model pool to serve as a modeling first layer;
in this embodiment, a sliding window is used to divide the data into a training sequence, a verification sequence, and a test sequence. GAM, LSTM, GM (1, N), ARIMA models and the like are respectively constructed on the training sequences, and after verification, model parameters are optimized, and iteration is updated, the model is the first layer of the modeling integrated model.
Generalized Addition Model (GAM)
GAM is an extension of the generalized linear model, originally proposed by hasie and Tibshirani, and can evaluate both the linear and nonlinear correlations of environmental factors, time, etc. with health effects. Confounding effects caused by time-dependent variables (e.g., seasonal and long-term trends) can be controlled. The GAM has less requirements on samples and wide applicability, and the expression is as follows:
Y=g(u)+ε;
g(u i )=β 0 +f(x i )+f 2 (x 2 )+…+f i (x i )+…+f m (x m );
wherein f (x) i ) Is about the prediction index x i Is a smooth function of (a). g (u) i ) As a connecting function, because cancer morbidity and mortality are subject to the characteristics of Poisson distribution, a Poisson regression model is adopted to establish a lung cancer disease burden and risk prediction model.
Long-short period memory model (LSTM)
A long-short-term memory model (LSTM) is used as an improved cyclic neural network model (RNN), and in the robustness problem of treating long-term dependency, the problems of gradient disappearance and gradient explosion are solved, so that the LSTM model has more accurate prediction effect in a longer sequence compared with the common RNN.
Each unit has components such as an input door, a forget door, and an output door.
And (3) parameter determination: x is x t Information input indicating time t, c t-1 The network memory state at the time t-1 is h t-1 The information output at time t-1 is also the information input at time t. i.e t 、f t And o t Input gate unit variables, forget gate unit variables, and output gate unit variables, respectively. Sigma represents a Sigmoid activation function; tanh represents a tanh activation function; the ";a cell state update value at time t; w (W) i Representing an input weight; u (U) i Representing the output weight; b i Indicating the deviation.
The forget gate decides the information to be selected for removal from the features stored in the hidden layer from the original output and the new input.
f t =σ(W f ×(h t-1 ,x t )+b f )
The input gate determines new information to store in the module's characteristic information and is used to update the cell state.
i t =σ(W i ×(h t-1 ,x t )+b i )
The output gate outputs state information at the current time and decides the value of the next hidden state.
h t =σ(W o ×(h t-1 ,x t )+b o )×tanh(C t )
Model training: the LSTM model is trained using training data. In the training process, the input sequence is provided to the LSTM model, so that the model and the characteristics of the sequence can be learned. During training, the weights and bias of the model are adjusted using the loss function and optimization algorithm to minimize the difference between the predicted output and the actual output.
GM (1, N) model
The gray prediction model is suitable for processing the problems of small sample size and poor information. The GM (1, N) model is a basic model of a multi-variable gray system modeling method, can perform overall and dynamic analysis on multiple factors, and reflects the dynamic change relation between a research variable sequence and a related factor sequence. The model contains one study variable and N-1 influencing factor variables. GM (1, N) time response function and subtraction reduction are respectively
Wherein, the liquid crystal display device comprises a liquid crystal display device,for the original study variable sequence, +.>The new data series generated by first-order accumulation of the original sequence is characterized in that a is a system development coefficient, and bi is a driving coefficient of each related factor.
ARIMA model
And judging the stability of the sequence according to the time sequence diagram and the stability test of the original data.
If the sequence is a non-stable sequence, the sequence is required to be stabilized through difference or data transformation, and the stability of the sequence after difference is determined through stability test.
Model types are preliminarily identified by an autocorrelation function (autocorrelation function, ACF) diagram and a partial autocorrelation function (partial autocorrelation function, PACF) diagram, and model orders are determined.
Depending on whether the original data sequence has a seasonal trend, the model can be divided into seasonal ARIMA (P, D, Q) S and non-seasonal ARIMA (P, D, Q), where (P, D, Q) and (P, D, Q) are the orders of non-seasonal and seasonal Autoregressions (ARs), differencing and Moving Averages (MA), respectively, and S represents the seasonal period.
The optimal model is filtered according to the red pool information criterion (AIC) and the Bayesian criterion (BIC).
Other models
The embodiment also builds a measurement model based on XGBoost algorithm, RFR algorithm, BP neural network, adaBoost and other algorithms.
In this embodiment, the model parameter tuning uses a hyper-parametric optimization algorithm of grid search and cross-validation evaluation based on a rolling prediction origin, which ensures that sufficient basic predictions are generated for model training through a rolling window technique. And combining the super-parameter intervals to be tested into a multi-dimensional space, dividing the test space into specific grids according to the search step length of each interval, wherein each grid corresponds to a parameter set value, then, each grid corresponds to a model test once to obtain evaluation indexes corresponding to the super-parameter combinations, and selecting super-parameters corresponding to the most optimal evaluation indexes as optimized super-parameters of a prediction model, thereby improving the prediction performance.
The time fine granularity optimization aims at a model with poor prediction effect, and time sequences with different time scales are selected for prediction, wherein the time sequences comprise fine granularity prediction and coarse granularity prediction.
The new data learning comprises a historical time sequence and new existing data supplementation, and new real data is dynamically added for model updating learning.
In this embodiment, the evaluation of the prediction effect of each model specifically includes: testing each prediction model on a test set respectively; and MER, MAPE, MAE, RMSE and other indexes are adopted to evaluate the performance of the prediction model, and the model precision is higher as the index value is smaller.
Average error rate (Modulation error ratio, MER):
MER = mean absolute value of mean error/mean actual value
Average absolute percentage error (Mean Absolute Percentage Error, MAPE), when MAPE is lower than 10% -15%, the prediction accuracy is better.
Mean absolute error (Mean Absolute Error, MAE)
Root mean square error (Root Mean Squared Error, RMSE), mean of the squares of the true and predicted error
And y is i Respectively represented by a fitting value and an actual value,
Step 3-2: fitting each predictor which is optimized by parameters in a verification set and a prediction set respectively, combining prediction results of the verification set to form a new training set, and forming a new test set by the prediction results of the test set through weighted average, wherein the new test set is used as input of a Stacking second layer;
step 3-3: introducing a meta learner into a second layer of the modeling integrated model, respectively carrying out regression training on the prediction result of the previous layer as a training set and a testing set, taking a linear regression model and a ridge regression model as the meta learner, and obtaining a final meta learner preferentially through prediction effect evaluation;
step 3-4: based on the hysteresis effect index, relevant reference data is provided for the prediction of the s-step future period.
Step 4: and (3) the integrated model server performs visual display on the result obtained in the step (3).
The lung cancer disease burden risk early warning method based on ensemble learning solves the technical problem of providing more accurate reference data for predicting lung cancer disease burden, the method fuses multi-source data to provide more comprehensive information, fully and comprehensively utilizes various prediction model information, combines a plurality of model results to generate a strong predictor, fully utilizes the advantages of different models, reduces uncertainty and deviation of a single model, improves prediction accuracy and stability, and can provide more accurate prediction reference data than a single prediction model. The model of the present invention has different features and capabilities in processing time series data. By combining the models, the data with various characteristics can be processed, and different data characteristics and trends are considered more comprehensively, so that the accuracy of the data is improved, and the time sequence relation between different indexes and disease burden can be captured by analyzing the hysteresis effect of different prediction indexes. By considering the hysteresis effect, the prediction model can be established more accurately, and the accuracy of prediction is improved. The method can provide prediction reference data within a longer time range by using hysteresis effect, adopts a rolling window technology to realize cross verification of time series data, performs training of a single prediction model and a meta learning model, and can help model parameter estimation.

Claims (7)

1. A lung cancer disease burden risk early warning method based on ensemble learning is characterized in that: the method comprises the following steps:
step 1: establishing a database server, wherein the database server acquires disease burden data, meteorological data, air pollution data, regional economic data and time characteristic data through the Internet, integrates and cleans the data to construct a lung cancer disease burden characteristic database, and visually displays database data through a chart to display time sequence characteristics of diseases and characteristics;
step 2: establishing a model server, acquiring data integrated and cleaned in a database server by the model server, performing reduction and screening of prediction indexes through information entropy and main components, and analyzing and measuring and calculating hysteresis effects of the prediction indexes on the burden of lung cancer diseases through gray correlation;
respectively constructing a prediction model pool on a training sequence, wherein the prediction model pool comprises a GAM model, an LSTM model, a GM (1, N) model, an ARIMA model, an XGBoost algorithm model, an RFR algorithm model, a BP neural network model and an AdaBoost algorithm model, verifying each model in the prediction model pool, optimizing each model parameter, updating and iterating each model, evaluating the prediction performance of each model on a test set, and sequencing each model according to the prediction performance;
step 3: an integrated model server is established, and 4 models with the predictive performance arranged at the front 4 are selected from a predictive model pool by the integrated model server to be used as a first layer of base learning device of Stacking integrated learning; fitting is carried out on the verification set and the prediction set by each predictor respectively to form a new training set and a new testing set which are used as the input of the meta learner of the Stacking second layer; taking a linear regression model and a ridge regression model in the model as candidate element learners, and preferentially obtaining a final integrated model through predicting performance evaluation; providing relevant reference data for the prediction of the s-step future period based on the hysteresis effect index;
step 4: and (3) the integrated model server performs visual display on the result obtained in the step (3).
2. The lung cancer disease burden risk early warning method based on ensemble learning according to claim 1, wherein: when the step 1 is executed, the data are integrated and cleaned, specifically, the abnormal data, the missing data, the repeated data and the inconsistent data are cleaned.
3. The lung cancer disease burden risk early warning method based on ensemble learning according to claim 2, wherein: and (2) when the step (1) is executed, filling the missing data by adopting a mathematical statistical method such as a mean value method, a regression method or a multiple filling method, removing the variable with the missing proportion exceeding 10%, and integrating and cleaning the data through the steps of data analysis, definition of a cleaning strategy, data inspection, execution of data cleaning, data quality evaluation and clean data backflow to obtain standard data.
4. The lung cancer disease burden risk early warning method based on ensemble learning according to claim 2, wherein: when the step 1 is executed, the visual display of database data is carried out through a chart, wherein the method specifically comprises the steps of collecting data as much as possible, after data mining and cleaning, arranging the data from different sources into primary indexes such as disease burden, weather, air pollution, economy and other environmental data, constructing a lung cancer disease burden risk early warning primary database, carrying out descriptive statistical analysis on environmental pollution, weather characteristic and economic characteristic distribution in the region through means, standard deviation, extremum and quartile, and calculating the annual average composite growth rate of the disease burden.
5. The lung cancer disease burden risk early warning method based on ensemble learning according to claim 1, wherein: when executing the step 2, the screening of the prediction index specifically includes the following steps:
step 2-1: acquiring initial indexes based on importance screening through subjective expert interviews and literature theory collection;
step 2-2: screening initial indexes based on information entropy, calculating the comparison information entropy of different initial indexes and lung cancer disease burden, eliminating indexes with lower relevance to the disease burden from the initial indexes, and eliminating redundant indexes with higher relevance;
step 2-3: screening important indexes or extracting main components as new indexes based on main component analysis, and specifically comprises the following steps:
step 2-3-1: construction of an index matrixWherein x is np The p index value of the nth sample is represented, and n and p respectively represent the row number and the column number of the index in the matrix;
step 2-3-1: performing standardized transformation on the matrix X to obtain Z;
step 2-3-2: calculating a correlation coefficient matrix of the standardized matrix ZWherein m represents the number of samples, and T represents the matrix transposition;
step 2-3-3: calculating eigenvalue lambda of correlation coefficient matrix R j And corresponding orthogonalization unit feature vector a j
Obtaining a principal component score F i =a 1i x 1 +a 2i x 2 +…+a pi x p The method comprises the steps of carrying out a first treatment on the surface of the Wherein i is the number of the main component, and p is the total index number;
step 2-3-4: calculating factor load, index x j In the main component F i The load on isReflecting the principal component F i And index x j The degree of correlation between the two variables indicates the importance of each variable in the principal component and the contribution to the result by |l (F i ,x j ) Screening out important indexes, wherein j is an index number, and i is a main component number;
step 2-3-5: when the index is excessive, k main components are selected as new indexes, the k value is determined by the information contribution rate of the main components reaching 80%,
6. the lung cancer disease burden risk early warning method based on ensemble learning according to claim 1, wherein: and (2) when the step (2) is executed, measuring and calculating the hysteresis effect of each prediction index on the burden of the lung cancer disease through gray correlation analysis, wherein the gray correlation analysis specifically comprises the steps of quantitatively comparing the geometrical shapes of a research variable sequence and a related factor sequence to judge the correlation degree of the related factor and the research variable, and analyzing the influence degree and the hysteresis effect of each prediction index on the lung cancer morbidity and mortality through Dunn correlation.
7. The lung cancer disease burden risk early warning method based on ensemble learning according to claim 1, wherein: when executing the step 3, the method specifically comprises the following steps:
step 3-1: the first layer of the modeling integrated model comprises a GAM model, an LSTM model, a GM (1, N) model, an ARIMA model, an XGBoost algorithm model, an RFR algorithm model, a BP neural network model and an AdaBoost algorithm model to form a prediction model pool, and 4 regression algorithm models with the prediction performance arranged at the front 4 are selected from the prediction model pool to serve as a modeling first layer;
step 3-2: fitting each predictor which is optimized by parameters in a verification set and a prediction set respectively, combining prediction results of the verification set to form a new training set, and forming a new test set by the prediction results of the test set through weighted average, wherein the new test set is used as input of a Stacking second layer;
step 3-3: introducing a meta learner into a second layer of the modeling integrated model, respectively carrying out regression training on the prediction result of the previous layer as a training set and a testing set, taking a linear regression model and a ridge regression model as the meta learner, and obtaining a final meta learner preferentially through prediction effect evaluation;
step 3-4: based on the hysteresis effect index, relevant reference data is provided for the prediction of the s-step future period.
CN202310786560.3A 2023-06-30 2023-06-30 Lung cancer disease burden risk early warning method based on ensemble learning Pending CN116779172A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310786560.3A CN116779172A (en) 2023-06-30 2023-06-30 Lung cancer disease burden risk early warning method based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310786560.3A CN116779172A (en) 2023-06-30 2023-06-30 Lung cancer disease burden risk early warning method based on ensemble learning

Publications (1)

Publication Number Publication Date
CN116779172A true CN116779172A (en) 2023-09-19

Family

ID=88007871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310786560.3A Pending CN116779172A (en) 2023-06-30 2023-06-30 Lung cancer disease burden risk early warning method based on ensemble learning

Country Status (1)

Country Link
CN (1) CN116779172A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117094184A (en) * 2023-10-19 2023-11-21 上海数字治理研究院有限公司 Modeling method, system and medium of risk prediction model based on intranet platform

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117094184A (en) * 2023-10-19 2023-11-21 上海数字治理研究院有限公司 Modeling method, system and medium of risk prediction model based on intranet platform
CN117094184B (en) * 2023-10-19 2024-01-26 上海数字治理研究院有限公司 Modeling method, system and medium of risk prediction model based on intranet platform

Similar Documents

Publication Publication Date Title
CN113919448B (en) Method for analyzing influence factors of carbon dioxide concentration prediction at any time-space position
Ma et al. A Lag-FLSTM deep learning network based on Bayesian Optimization for multi-sequential-variant PM2. 5 prediction
CN111832222A (en) Pollutant concentration prediction model training method, prediction method and device
CN110726694A (en) Characteristic wavelength selection method and system of spectral variable gradient integrated genetic algorithm
Middya et al. Pollutant specific optimal deep learning and statistical model building for air quality forecasting
CN116779172A (en) Lung cancer disease burden risk early warning method based on ensemble learning
CN115495991A (en) Rainfall interval prediction method based on time convolution network
CN115542429A (en) XGboost-based ozone quality prediction method and system
Sun et al. Spatial-temporal prediction of air quality based on recurrent neural networks
CN114372707A (en) High-cold-wetland degradation degree monitoring method based on remote sensing data
CN114595861A (en) MSTL (modeling, transformation, simulation and maintenance) and LSTM (least Square TM) model-based medium-and-long-term power load prediction method
CN114429077A (en) Time sequence multi-scale analysis method based on quantum migration
CN115456245A (en) Prediction method for dissolved oxygen in tidal river network area
CN115879607A (en) Electric energy meter state prediction method, system, equipment and storage medium
Li et al. A neural networks based method for multivariate time-series forecasting
Sharma et al. Forecasting and prediction of air pollutants concentrates using machine learning techniques: the case of India
CN115935283B (en) Drought cause tracing method based on multi-element nonlinear causal analysis
Wang et al. The prediction model for haze pollution based on stacking framework and feature extraction of time series images
CN116738866A (en) Instant learning soft measurement modeling method based on time sequence feature extraction
CN110852496A (en) Natural gas load prediction method based on LSTM recurrent neural network
CN116703644A (en) Attention-RNN-based short-term power load prediction method
CN115145903A (en) Data interpolation method based on production process
Asaei-Moamam et al. Air quality particulate-pollution prediction applying GAN network and the Neural Turing Machine
Kramar et al. Evaluation of the Machine Learning Techniques for Forecasting the Seasonal Time Series
CN114638039B (en) Structural health monitoring characteristic data interpretation method based on low-rank matrix recovery

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination