CN117334334B - Health risk prediction method, device, equipment and medium - Google Patents

Health risk prediction method, device, equipment and medium Download PDF

Info

Publication number
CN117334334B
CN117334334B CN202311272180.4A CN202311272180A CN117334334B CN 117334334 B CN117334334 B CN 117334334B CN 202311272180 A CN202311272180 A CN 202311272180A CN 117334334 B CN117334334 B CN 117334334B
Authority
CN
China
Prior art keywords
model
prediction
data
predicted
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311272180.4A
Other languages
Chinese (zh)
Other versions
CN117334334A (en
Inventor
冯思玲
汤乐
黄梦醒
王冠军
冯文龙
毋媛媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hainan University
Original Assignee
Hainan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hainan University filed Critical Hainan University
Priority to CN202311272180.4A priority Critical patent/CN117334334B/en
Publication of CN117334334A publication Critical patent/CN117334334A/en
Application granted granted Critical
Publication of CN117334334B publication Critical patent/CN117334334B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a health risk prediction method, a health risk prediction device, health risk prediction equipment and health risk prediction media. The method acquires current environment data; the current environment data is input into a prediction model, the prediction model comprises a VARLST mixed model and an AdaBoost model, wherein the VARLST model comprises a VARMA model and a BI-LSTM model, and a prediction result of the VARLST model and a prediction result of the AdaBoost model are subjected to weighted fusion to obtain a final prediction value; calculating a relative risk value for predicting health risk based on the final predicted value predicted under the current environmental data and the disease number and the non-disease number under the risk reference standard value in the past time period; and carrying out health risk early warning based on the relative danger value. According to the invention, when the environmental quality index is abnormal, the preparation for early prevention can be realized through rapid prediction and calculation of the relative dangerous value of the health risk.

Description

Health risk prediction method, device, equipment and medium
Technical Field
The present invention relates to the field of health risk prediction technologies, and in particular, to a health risk prediction method, apparatus, device, and medium.
Background
In recent years, with the continuous development of urban areas, the population density of cities is continuously increased, and the densely populated areas of the cities are also high-risk areas for pollutant transmission and mass infection. When the value of the pollutant is abnormal near the city, if no effective predictive warning exists and the people neglect the influence of the objective factor of the environment on the physical health, irrecoverable results are caused to the health of the local people. At present, the popularization range of smart cities is wider and wider, and various service resources of the cities can be integrated by a plurality of existing technologies, so that more accurate and more convenient services are provided for the cities. Therefore, there is a need for a model that combines analysis of environmental data with medical data to predict the impact of environmental changes on physical health, in order to address the above-mentioned problems.
Algorithms for predicting health risks by using a plurality of time series parameters are generally divided into three types, including autoregressive average movement, vector autoregressive, ARIMA models and the like for environmental time series prediction, using modeling mathematical principles, including machine learning models such as AdaBoost models and random forests, and deep learning models such as LSTM (least squares) suitable for processing sequence data, and can process a plurality of time series predictions to a certain extent. However, the existing technology has the problems of low model generalization capability, poor data fitting effect on nonlinear relations, high model calculation complexity and the like in multi-parameter health risk prediction, so that modeling is difficult to carry out on complex environmental data, and a plurality of algorithms are difficult to put into practical use. The prediction algorithm based on deep learning has great advantages in time sequence prediction tasks due to the unique network structure, but has the problems that the requirement on data is high, a large number of samples are needed when the network is applied to multi-element data, otherwise, the prediction accuracy is rapidly reduced, enough training data and calculation resources are needed, and obvious lag exists in the prediction result.
Disclosure of Invention
In order to solve the technical problems, the invention provides a health risk prediction method, a device, equipment and a medium, which can be used for preparing early prevention by rapidly predicting and calculating a relative risk value of health risk when an environmental quality index is abnormal.
In order to achieve the above purpose, the technical scheme of the invention is as follows:
a health risk prediction method comprising the steps of:
Acquiring current environment data;
Inputting the current environmental data into a prediction model, wherein the prediction model comprises a VARLST mixed model and an AdaBoost model, the VARLST model comprises a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction to obtain a prediction result of the VARMA prediction model, comparing the prediction result with an actual value to obtain a fitting residual, taking the fitting residual as an input of the BI-LSTM model to obtain a prediction result of the BI-LSTM model, and superposing the prediction result of the VARMA prediction model and the prediction result of the BI-LSTM model to obtain a prediction result of the VARLST model; carrying out weighted fusion on the predicted result of the VARLST model and the predicted result of the AdaBoost model to obtain a final predicted value;
calculating a relative risk value for predicting health risk based on the final predicted value predicted under the current environmental data and the disease number and the non-disease number under the risk reference standard value in the past time period; and carrying out health risk early warning based on the relative danger value.
Preferably, the training process of VARLST models is as follows:
acquiring historical environment data and medical statistics data, preprocessing the historical environment data and the medical statistics data, and dividing a preprocessed data set into a training set and a testing set according to a preset proportion;
carrying out differential smoothing on the data set, carrying out autocorrelation function and partial autocorrelation function analysis on the data subjected to differential smoothing, and estimating VARMA prediction model parameters and the orders of differences based on analysis results, wherein the VARMA prediction model is as follows:
Wherein y n,t is the number of medical staff on the t-th day of n departments; y n,t-p is the p-order lag of the number of medical staff on day t of the n departments; θ n,t is a time series composition structure of the number of medical staff in the nth department on the t-th day; a is a coefficient corresponding to a composition structure; b p is the coefficient corresponding to the p-order lag; c is the coefficient of the exogenous variable; x m,t is the m-term t-th day environmental index data, ε n is the residual term,
Inputting a training set into a VARLST model for training, wherein the VARLST model comprises a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction, and comparing the prediction result of the obtained VARMA prediction model with an actual value to obtain a fitting residual error;
Inputting the fitting residual error into a BI-LSTM model for training until reaching preset requirements or iteration times, and outputting a trained BI-LSTM model, wherein in each iteration, the BI-LSTM model is transmitted backwards, and the error of an output layer at each moment and the parameter derivatives of forward LSTM and backward LSTM are calculated through the output of backward transmission, and after the parameter derivatives are obtained, the network parameters of the BI-LSTM model are updated;
Inputting the test set into a trained BI-LSTM model to obtain a prediction result of the BI-LSTM model;
and superposing the predicted result of the VARMA predicted model and the predicted result of the BI-LSTM model to obtain the predicted result of the VARLST model.
Preferably, the training process of the AdaBoost model is as follows:
training the base learner based on the data set and the initialized weight distribution, and optimizing the base learner;
Calculating a prediction error, a proportion error and a connection weight of each training sample in the t-th iteration;
adjusting weight distribution of each training sample in t+1 iterations, and continuing training until the maximum iteration times are reached, so as to obtain K models;
and adding the prediction results of the K models to be used as the prediction results of the AdaBoost model.
Preferably, the historical environmental data includes air quality data, city drinking water quality data, waste gas and waste water and other total emission data.
Preferably, the historical environmental data and the medical treatment statistical data are acquired and preprocessed, and the method comprises the following steps:
Data cleaning is carried out on the environmental data and the medical treatment statistical data, and the data cleaning comprises the steps of processing missing values, abnormal values and repeated values;
Carrying out correlation analysis on the cleaned data, and calculating correlation coefficients among all the features through the Pearson correlation coefficients to finally obtain a correlation coefficient matrix C;
and according to the correlation coefficient matrix C, selecting an environmental index with higher correlation with medical treatment statistical data as an input variable of the model.
Preferably MAE, MAPE, RMSE is selected as an evaluation index to evaluate the performance of the prediction model.
Preferably, the calculation formula of the relative risk value RR of the predicted health risk is as follows:
wherein IE is the predicted number of ill patients and the predicted number of ill patients, IN is the predicted number of non-ill patients, CE is the number of ill patients under the risk reference benchmark, and CN is the number of non-ill patients under the risk reference benchmark.
Based on the above, the invention also discloses a health risk prediction device, comprising: a data acquisition module, a mixed multivariate time series prediction module and a health risk prediction module, wherein,
The data acquisition module is used for acquiring current environment data;
The mixed multi-element time sequence prediction module is used for inputting the current environment data into a prediction model, the prediction model comprises a VARLST mixed model and an AdaBoost model, wherein the VARLST model comprises a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction, a prediction result of the VARMA prediction model is obtained, the prediction result is compared with an actual value to obtain a fitting residual, the fitting residual is used as input of the BI-LSTM model, the prediction result of the BI-LSTM model is obtained, the prediction result of the VARMA prediction model is overlapped with the prediction result of the BI-LSTM model, and the prediction result of the VARLST model is obtained; carrying out weighted fusion on the predicted result of the VARLST model and the predicted result of the AdaBoost model to obtain a final predicted value;
the health risk prediction module is used for calculating a relative risk value of predicted health risk based on a final predicted value predicted by current environmental data and the number of sick people and non-sick people of the disease under a risk reference standard value in a past time period; and carrying out health risk early warning based on the relative danger value.
Based on the above, the present invention also discloses a computer device, including: a memory for storing a computer program; a processor for implementing a method as claimed in any one of the preceding claims when executing the computer program.
Based on the foregoing, the present invention also discloses a readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the above.
Based on the technical scheme, the invention has the beneficial effects that: according to the invention, firstly, the data of each index of air quality, the water quality data of urban drinking water, the total amount data of waste gas and waste water and other emissions and the statistical data of the number of medical persons are obtained, and the data cleaning, the data correlation analysis and the feature selection are carried out, so that the data has more integrity and accuracy. The selected feature data are trained through VARLST and AdaBoost models, respectively. Firstly, after the preprocessed data are analyzed by ACF and PACF, the parameter and difference sequence of a VARMA prediction model are estimated, a multidimensional data prediction network is established to conduct data analysis prediction, the obtained prediction result is compared with an actual value, the obtained fitting residual is trained through a BI-LSTM network, prediction data are output through neural network analysis and processing, and the prediction result of the two is calculated and corrected to obtain the prediction result of a VARLST model. And the other side carries out initial weight distribution, iterative training of the weak regressor, adjustment of the weight of the weak regressor and combination prediction on the preprocessed data to obtain a final prediction result of the AdaBoost model. And carrying out average fusion on the two model prediction results to obtain a final prediction result. And comparing the finally predicted diseased person number and the non-diseased person number with a dangerous reference standard value to obtain a relative dangerous value, judging the severity of the health risk, and achieving the purpose of predicting the health risk in advance. According to the invention, the analysis of the data correlation by using the Pearson correlation coefficient solves the problem that a large amount of data is required in prediction, improves the prediction efficiency, and strengthens the connection between environmental factors and the number of patients seeking medical attention; through VARLST models, the advantages of deep learning and the advantages of the traditional statistical prediction algorithm are combined, so that the prediction model has higher execution efficiency and generalization capability, and the accuracy and the practicability of the health risk prediction model are improved; and the AdaBoost model is fused, and the important characteristics are automatically selected by combining the integrated learning advantages of the AdaBoost model, so that the prediction accuracy and stability of the model are improved.
Drawings
FIG. 1 is a diagram of an application environment for a health risk prediction method in one embodiment;
FIG. 2 is a flow diagram of a method of health risk prediction in one embodiment;
FIG. 3 is a schematic diagram of a hybrid VARLST-AdaBoost model architecture in a health risk prediction method according to one embodiment;
FIG. 4 is a schematic diagram of a health risk prediction device in one embodiment;
FIG. 5 is an internal block diagram of a computer device in one embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
The health risk prediction method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. As shown in FIG. 1, the application environment includes a computer device 110. Computer device 110 may obtain current environmental data; the computer device 110 may input the current environmental data into a prediction model (VARLST-AdaBoost hybrid model), where the prediction model includes a VARLST hybrid model and an AdaBoost model, where the VARLST model includes a VARMA model and a BI-LSTM model, and the VARMA model is used for multidimensional data prediction to obtain a prediction result of the VARMA prediction model and compare the prediction result with an actual value to obtain a fitting residual, and the fitting residual is used as an input of the BI-LSTM model to obtain a prediction result of the BI-LSTM model, and superimpose the prediction result of the VARMA prediction model with the prediction result of the BI-LSTM model to obtain a prediction result of the VARLST model; carrying out weighted fusion on the predicted result of the VARLST model and the predicted result of the AdaBoost model to obtain a final predicted value; the computer device 110 may calculate a relative risk value for predicting health risk based on the final predicted value predicted from the current environmental data and the number of ill persons and non-ill persons at the risk reference value over the past period of time; and carrying out health risk early warning based on the relative danger value. The computer device 110 may be, but is not limited to, various personal computers, notebook computers, robots, tablet computers, and the like.
In one embodiment, as shown in fig. 2 and 3, a health risk prediction method is provided, which includes the following steps:
step 1: and acquiring air quality data such as AQI, PM2.5, PM10, CO, NO2, SO2, O3 and the like, urban drinking water quality data, waste gas and waste water emission total amount data and historical data of medical attendance statistical data.
Step 2: all the obtained historical data are preprocessed. The method specifically comprises the following steps:
step 2-1: performing data cleaning on all acquired historical data, including processing missing values, abnormal values and repeated values, so as to ensure the integrity and accuracy of the data;
Step 2-2: and carrying out correlation analysis on the cleaned data, and calculating correlation coefficients among all the features through the Pearson correlation coefficients to finally obtain a correlation coefficient matrix C. If the correlation coefficient is close to 1 or-1, the strong linear correlation exists between the two, and if the correlation coefficient is close to 0, the linear correlation does not exist between the two;
Step 2-3: and according to the correlation coefficient matrix C, selecting an environmental index with higher correlation with the doctor-seeking people counting data as an input variable of the model. The magnitude and significance of the correlation coefficient determine which environmental indexes can be used as input variables of a specific disease, and the index dimension which has higher correlation with the disease and significant correlation coefficient is selected as the input variables, so that the prediction performance of the model can be improved more effectively.
Step 3: the obtained environment index data with higher correlation and the doctor-seeking number of people are taken as input data, the autocorrelation and the partial autocorrelation of the time sequence are determined through an autocorrelation function (ACF) and a partial autocorrelation function (PACF), and a VARLST mixed model is built based on better analysis, so that a final prediction result is obtained. The method specifically comprises the following steps:
Step 3-1: the obtained multi-parameter time series data of the environmental index and the number of medical statistics are non-stable sequences, the time series are required to be smoothed through differential processing, the condition of applying a prediction model is ensured to be met, and then the prediction result is recovered through differential reduction;
Step 3-2: analyzing an autocorrelation function (ACF) and a partial autocorrelation function (PACF) on the data after differential smoothing, judging the autocorrelation and the partial autocorrelation of the time sequence, and better understanding the characteristics and the trend of the data;
Step 3-3: based on the autocorrelation function and the partial autocorrelation function of the multi-element environment time sequence, the order of the parameters and the difference of the vector autoregressive moving average (VARMA) prediction model is estimated, and the practicability of the prediction model and the stability of the time sequence are ensured. The VARMA model adopts a form of a multi-equation system, is generally used for predicting a system of interconnected time sequences, analyzes the dynamic influence of random disturbance on a variable system, assumes that the optimal order is p, the total number of days of a sample data set is T, the number of patients suffering from each disease is an endogenous variable, the environmental index data is an exogenous variable, and constructs a vector autoregressive model as follows:
Wherein y n,t is the number of medical staff on the t-th day of n departments; y n,t-p is the p-order lag of the number of medical staff on day t of the n departments; θ n,t is a time series composition structure of the number of medical staff in the nth department on the t-th day; a is a coefficient corresponding to a composition structure; b p is the coefficient corresponding to the p-order lag; c is the coefficient of the exogenous variable; x m,t is the m item of day t environmental index data, ε n is the residual item;
the above equation can be simplified as:
Yt=Aθt+B1Yt-1+···+BpYt-p+Cxtt,t=1,2,···,T (2)
step 3-4: and comparing the fitting value predicted by the model with the true value to obtain a fitting residual, wherein the calculation formula of the data fitting residual is as follows:
Et=Zt-Xt (3)
Wherein X t is the output fitting result, Z t is the real data, and E t is the data fitting residual;
Step 3-5: building a BI-LSTM neural network model, forgetting and retaining fitting residual data, training the BI-LSTM neural network model by using the residual data, testing the BI-LSTM neural network model, and reducing the error that VARMA model cannot accurately analyze, wherein the method specifically comprises the following steps:
Step 3-5-1: through the operation of point-by-point multiplication and sigmoid neural layer, the information is selectively passed, and a forgetting gate is constructed, and the formula is as follows:
ft=σ(Wf·[ht-1,xt]+bf) (4)
Step 3-5-2: an input gate structure is constructed, and the formula is as follows:
it=σ(Wi·[ht-1,xt]+bi) (5)
ct=tanh(WC·[ht-1,xt]+bC) (6)
yt=ft*yt-1+it*ct (7)
Wherein W C is a weight coefficient of c t, W i is an input weight coefficient, and tanh is an activation function;
step 3-5-3: an output gate structure is constructed, and the formula is as follows:
ot=σ(Wo[ht-1,xt]+bo) (8)
ht=ot*tanh(yt) (9)
after steps 3-5-1 and 3-5-2, the cell information y t is updated, the cell information is processed through the tanh activation function, a value with a section of [ -1,1] is output after the processing, and the output value is multiplied with the output of the sigmoid gate to finally obtain a determined output part;
Step 3-5-4: calculating the total output value of the BI-LSTM structure, wherein the total output value of the BI-LSTM structure at the time t is the sum of forward LSTM output and backward LSTM output, and the formula is as follows:
Wherein the method comprises the steps of Is vector measurement and operation;
Step 3-5-5: iteratively training parameters of the BI-LSTM network, in each iteration, firstly carrying out backward transmission on the BI-LSTM network, and then calculating errors of an output layer at each moment and parameter derivatives of forward LSTM and backward LSTM through output of backward transmission;
after obtaining the parameter derivatives, the network parameters of the BI-LSTM are finally updated.
Step 3-6: the trained BI-LSTM model is used for predicting the data residual error of the test set, and the predicted residual error result is recorded as E. Finally, the corrected test set prediction result is calculated and expressed as Y VARLST:
YVARLST=Xt+E (13)
Step 4: and (3) taking the environmental index data with higher correlation obtained in the step (2) and the doctor-seeking people counting data as input data, and finally combining all weak regressors according to weights through iterative training of an AdaBoost model to obtain a final prediction result. The method specifically comprises the following steps:
Step 4-1: let a sample set be U = { (x i,yi |i=1, 2, …, N) }, and the weight distribution of the samples at the t-th iteration is represented by the symbols D t(i), t=1, 2, …, K. When t=1 during the first iteration, the weight distribution is initialized
Step 4-2: the maximum error E t=max|yi-Gk(xi) on the training set is calculated, i=1, 2, …, N, and the predicted output of f t(xi) is calculated from the weight distribution, and the prediction error is calculated. The calculation formula is as follows:
step 4-3: and (3) calculating a proportion error, wherein the calculation formula is as follows:
Step 4-4: ; calculating connection weights Wherein/>
Step 4-5: adjusting weighted weight distribution/>
Step 4-6: after K iterations, the final prediction result can be obtained by
Step 4-7: and the predicted result Y VARLST obtained through the VARLST model and the predicted result Y AdaBoost obtained through the AdaBoost model are fused according to the respective weights to obtain a final predicted result Y. The calculation formula is as follows:
Y=W1YVARLST+W2YAdaBoost (17)
In order to reduce the risk of overfitting, reduce the variance of the model, improve the stability and robustness of the model, an average fusion was used in this example, so W 1=W2 =0.5.
Step 5: the nonlinear model training data fitting efficiency after the time series distribution is checked by using a mean square error (RMSE), an absolute percent error (MAPE) and a Mean Absolute Error (MAE), and the model evaluation index calculation formula for evaluating the prediction result is as follows:
where P i is the predicted data value, X i is the true data value, Is the average of the real data and n is the number of data sets.
Step 6: the relative risk value of health risk is calculated by predicting the number of sick people and the number of non-sick people predicted by the current environmental element data and combining the median of the sick rate and the non-sick rate of the past period of time as a risk reference value, and the calculation formula is as follows:
Wherein RR is the relative dangerous value of predicting health risk under the multi-parameter environment index, IE is the predicted number of sick people, IN is the predicted number of non-sick people, CE is the number of sick people under the dangerous reference value, and CN is the number of non-sick people under the dangerous reference value.
And dividing health risk early warning grades according to the relative risk values of the predicted health risks under the multi-parameter environment indexes, wherein the relative risk values of the predicted health risks are low risks when the relative risk values of the predicted health risks are less than 1.4, and medium risks when the relative risk values of the predicted health risks are 1.4 to 2.9, and the relative risk values of the predicted health risks are high risks when the relative risk values of the predicted health risks are greater than 2.9. The aim of predicting health risks in advance is fulfilled through the finally defined risk level.
It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with at least a part of the sub-steps or stages of other steps or other steps.
Simulation calculation example:
in order to verify the effectiveness of the algorithm provided by the embodiment, the month data of the air quality index, the month data of the total amount of waste gas and waste water and other emissions and the month data of the number of medical persons in the sea, from 2014 to 2023 in the sea, are selected, the VARMA model, the AdaBoost model and the VARLST model are used for prediction comparison with the prediction model (VARLST-AdaBoost mixed model) in the application, and finally the fitting efficiency of the model is compared according to the model evaluation index.
In this embodiment, mean square error (RMSE), absolute percent error (MAPE), mean Absolute Error (MAE) coefficient are used to test the fitting efficiency of the nonlinear model training data after time series distribution, and the final obtained index is shown in the following table.
Table 1 model comparison experiment results
It can be seen from table 1 that when the total amount of data is not large, the conventional VARMA and AdaBoost prediction models do not have significant prediction effects, while the VARLST model improves the prediction accuracy to some extent, but there is a risk of overfitting and instability. The VARLST-AdaBoost mixed model provided by the invention is improved on the original model, the sample weight is adjusted according to the classification error rate, the over-fitting problem of the model is reduced, the accuracy of the prediction result is better represented, and the prediction performance of the model is improved.
As shown in fig. 4, in one embodiment there is provided a health risk prediction apparatus 400 comprising: a data acquisition module 410, a mixed multivariate time series prediction module 420, and a health risk prediction module 430, wherein,
A data acquisition module 410, configured to acquire current environmental data;
The mixed multivariate time series prediction module 420 is configured to input the current environmental data into a prediction model, where the prediction model includes a VARLST mixed model and an AdaBoost model, where the VARLST model includes a VARMA model and a BI-LSTM model, and the VARMA model is used for multi-dimensional data prediction to obtain a prediction result of the VARMA prediction model and compare the prediction result with an actual value to obtain a fitting residual, and the fitting residual is used as an input of the BI-LSTM model to obtain a prediction result of the BI-LSTM model, and superimpose the prediction result of the VARMA prediction model with the prediction result of the BI-LSTM model to obtain a prediction result of the VARLST model; carrying out weighted fusion on the predicted result of the VARLST model and the predicted result of the AdaBoost model to obtain a final predicted value;
The health risk prediction module 430 is configured to calculate a relative risk value of the predicted health risk based on the final predicted value predicted by the current environmental data and the disease population and the non-disease population at the risk reference value in the past period; and carrying out health risk early warning based on the relative danger value.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a health risk prediction method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 5 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
Acquiring current environment data;
Inputting the current environmental data into a prediction model, wherein the prediction model comprises a VARLST mixed model and an AdaBoost model, the VARLST model comprises a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction to obtain a prediction result of the VARMA prediction model, comparing the prediction result with an actual value to obtain a fitting residual, taking the fitting residual as an input of the BI-LSTM model to obtain a prediction result of the BI-LSTM model, and superposing the prediction result of the VARMA prediction model and the prediction result of the BI-LSTM model to obtain a prediction result of the VARLST model; carrying out weighted fusion on the predicted result of the VARLST model and the predicted result of the AdaBoost model to obtain a final predicted value;
calculating a relative risk value for predicting health risk based on the final predicted value predicted under the current environmental data and the disease number and the non-disease number under the risk reference standard value in the past time period; and carrying out health risk early warning based on the relative danger value.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
Acquiring current environment data;
Inputting the current environmental data into a prediction model, wherein the prediction model comprises a VARLST mixed model and an AdaBoost model, the VARLST model comprises a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction to obtain a prediction result of the VARMA prediction model, comparing the prediction result with an actual value to obtain a fitting residual, taking the fitting residual as an input of the BI-LSTM model to obtain a prediction result of the BI-LSTM model, and superposing the prediction result of the VARMA prediction model and the prediction result of the BI-LSTM model to obtain a prediction result of the VARLST model; carrying out weighted fusion on the predicted result of the VARLST model and the predicted result of the AdaBoost model to obtain a final predicted value;
calculating a relative risk value for predicting health risk based on the final predicted value predicted under the current environmental data and the disease number and the non-disease number under the risk reference standard value in the past time period; and carrying out health risk early warning based on the relative danger value.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (7)

1. A method of predicting health risk, comprising the steps of:
Acquiring current environment data;
Inputting the current environmental data into a prediction model, wherein the prediction model comprises a VARLST mixed model and an AdaBoost model, the VARLST mixed model comprises a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction to obtain a prediction result of the VARMA prediction model, comparing the prediction result with an actual value to obtain a fitting residual, taking the fitting residual as an input of the BI-LSTM model to obtain a prediction result of the BI-LSTM model, and superposing the prediction result of the VARMA prediction model and the prediction result of the BI-LSTM model to obtain a prediction result of the VARLST mixed model; carrying out weighted fusion on the predicted result of the VARLST mixed model and the predicted result of the AdaBoost model to obtain a final predicted value;
Calculating a relative risk value for predicting health risk based on the final predicted value predicted under the current environmental data and the disease number and the non-disease number under the risk reference standard value in the past time period; and carrying out health risk early warning based on the relative risk value, wherein the relative risk value RR for predicting the health risk is calculated according to the following formula:
wherein IE is predicted ill and predicted ill, IN is predicted non-ill, CE is ill under the risk reference benchmark, CN is non-ill under the risk reference benchmark, wherein,
The training process of VARLST mixed models specifically comprises the following steps: acquiring historical environment data and medical statistics data, preprocessing the historical environment data and the medical statistics data, and dividing a preprocessed data set into a training set and a testing set according to a preset proportion; carrying out differential smoothing on the data set, carrying out autocorrelation function and partial autocorrelation function analysis on the data subjected to differential smoothing, and estimating VARMA prediction model parameters and the orders of differences based on analysis results, wherein the VARMA prediction model is as follows:
Wherein y n,t is the number of medical staff on the t-th day of n departments; y n,t-p is the p-order lag of the number of medical staff on day t of the n departments; θ n,t is a time series composition structure of the number of medical staff in the nth department on the t-th day; a is a coefficient corresponding to a composition structure; b p is the coefficient corresponding to the p-order lag; c is the coefficient of the exogenous variable; x m,t is m items of t-th environmental index data, epsilon n is a residual item, a training set is input into a VARLST mixed model for training, the VARLST mixed model comprises a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction, and a predicted result of the obtained VARMA predicted model is compared with an actual value to obtain a fitting residual; inputting the fitting residual error into a BI-LSTM model for training until reaching preset requirements or iteration times, and outputting a trained BI-LSTM model, wherein in each iteration, the BI-LSTM model is transmitted backwards, and the error of an output layer at each moment and the parameter derivatives of forward LSTM and backward LSTM are calculated through the output of backward transmission, and after the parameter derivatives are obtained, the network parameters of the BI-LSTM model are updated; inputting the test set into a trained BI-LSTM model to obtain a prediction result of the BI-LSTM model; superposing the predicted result of the VARMA predicted model and the predicted result of the BI-LSTM model to obtain the predicted result of the VARLST mixed model;
the training process of the AdaBoost model specifically comprises the following steps: training the base learner based on the data set and the initialized weight distribution, and optimizing the base learner; calculating a prediction error, a proportion error and a connection weight of each training sample in the t-th iteration; adjusting weight distribution of each training sample in t+1 iterations, and continuing training until the maximum iteration times are reached, so as to obtain K models; and adding the prediction results of the K models to be used as the prediction results of the AdaBoost model.
2. The method of claim 1, wherein the historical environmental data includes air quality data, city drinking water quality data, waste gas and waste water and other total emissions data.
3. The method for predicting health risk according to claim 1, wherein the steps of acquiring and preprocessing historical environmental data and medical statistics include the steps of:
Data cleaning is carried out on the environmental data and the medical treatment statistical data, and the data cleaning comprises the steps of processing missing values, abnormal values and repeated values;
Carrying out correlation analysis on the cleaned data, and calculating correlation coefficients among all the features through the Pearson correlation coefficients to finally obtain a correlation coefficient matrix C;
and according to the correlation coefficient matrix C, selecting an environmental index with higher correlation with medical treatment statistical data as an input variable of the model.
4. The method of claim 1, wherein MAE, MAPE, RMSE is selected as an evaluation index to evaluate the performance of the predictive model.
5. A health risk prediction apparatus, comprising: a data acquisition module, a mixed multivariate time series prediction module and a health risk prediction module, wherein,
The data acquisition module is used for acquiring current environment data;
The mixed multi-element time sequence prediction module is used for inputting the current environment data into a prediction model, wherein the prediction model comprises a VARLST mixed model and an AdaBoost model, the VARLST mixed model comprises a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction, a prediction result of the VARMA prediction model is obtained, the prediction result is compared with an actual value to obtain a fitting residual, the fitting residual is used as input of the BI-LSTM model, the prediction result of the BI-LSTM model is obtained, and the prediction result of the VARMA prediction model is overlapped with the prediction result of the BI-LSTM model to obtain a prediction result of the VARLST mixed model; and carrying out weighted fusion on the predicted result of the VARLST mixed model and the predicted result of the AdaBoost model to obtain a final predicted value, wherein the training process of the VARLST mixed model specifically comprises the following steps: acquiring historical environment data and medical statistics data, preprocessing the historical environment data and the medical statistics data, and dividing a preprocessed data set into a training set and a testing set according to a preset proportion; carrying out differential smoothing on the data set, carrying out autocorrelation function and partial autocorrelation function analysis on the data subjected to differential smoothing, and estimating VARMA prediction model parameters and the orders of differences based on analysis results, wherein the VARMA prediction model is as follows:
Wherein y n,t is the number of medical staff on the t-th day of n departments; y n,t-p is the p-order lag of the number of medical staff on day t of the n departments; θ n,t is a time series composition structure of the number of medical staff in the nth department on the t-th day; a is a coefficient corresponding to a composition structure; b p is the coefficient corresponding to the p-order lag; c is the coefficient of the exogenous variable; x m,t is m items of t-th environmental index data, epsilon n is a residual item, a training set is input into a VARLST mixed model for training, the VARLST mixed model comprises a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction, and a predicted result of the obtained VARMA predicted model is compared with an actual value to obtain a fitting residual; inputting the fitting residual error into a BI-LSTM model for training until reaching preset requirements or iteration times, and outputting a trained BI-LSTM model, wherein in each iteration, the BI-LSTM model is transmitted backwards, and the error of an output layer at each moment and the parameter derivatives of forward LSTM and backward LSTM are calculated through the output of backward transmission, and after the parameter derivatives are obtained, the network parameters of the BI-LSTM model are updated; inputting the test set into a trained BI-LSTM model to obtain a prediction result of the BI-LSTM model; superposing the predicted result of the VARMA predicted model and the predicted result of the BI-LSTM model to obtain the predicted result of the VARLST mixed model; the training process of the AdaBoost model specifically comprises the following steps: training the base learner based on the data set and the initialized weight distribution, and optimizing the base learner; calculating a prediction error, a proportion error and a connection weight of each training sample in the t-th iteration; adjusting weight distribution of each training sample in t+1 iterations, and continuing training until the maximum iteration times are reached, so as to obtain K models; adding the prediction results of the K models to be used as the prediction results of the AdaBoost model;
the health risk prediction module is used for calculating a relative risk value of predicted health risk based on a final predicted value predicted by current environmental data and the number of sick people and non-sick people of the disease under a risk reference standard value in a past time period; and carrying out health risk early warning based on the relative risk value, wherein the relative risk value RR for predicting the health risk is calculated according to the following formula:
wherein IE is the predicted number of ill patients and the predicted number of ill patients, IN is the predicted number of non-ill patients, CE is the number of ill patients under the risk reference benchmark, and CN is the number of non-ill patients under the risk reference benchmark.
6. A computer device, comprising: a memory for storing a computer program; a processor for implementing the method according to any one of claims 1 to 4 when executing the computer program.
7. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 4.
CN202311272180.4A 2023-09-28 2023-09-28 Health risk prediction method, device, equipment and medium Active CN117334334B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311272180.4A CN117334334B (en) 2023-09-28 2023-09-28 Health risk prediction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311272180.4A CN117334334B (en) 2023-09-28 2023-09-28 Health risk prediction method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN117334334A CN117334334A (en) 2024-01-02
CN117334334B true CN117334334B (en) 2024-05-03

Family

ID=89282406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311272180.4A Active CN117334334B (en) 2023-09-28 2023-09-28 Health risk prediction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117334334B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7031838B1 (en) * 2003-03-25 2006-04-18 Integrated Environmental Services. Inc. System and method for a cradle-to-grave solution for investigation and cleanup of hazardous waste impacted property and environmental media
CN114819340A (en) * 2022-04-24 2022-07-29 浙江浙能天然气运行有限公司 Time sequence prediction method for daily load of natural gas
CN114842979A (en) * 2021-02-01 2022-08-02 伊顿智能动力有限公司 Wearable personal protection system and method for assessing personal health risks in an environment
CN116130105A (en) * 2023-03-31 2023-05-16 江苏亚寰软件股份有限公司 Health risk prediction method based on neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7031838B1 (en) * 2003-03-25 2006-04-18 Integrated Environmental Services. Inc. System and method for a cradle-to-grave solution for investigation and cleanup of hazardous waste impacted property and environmental media
CN114842979A (en) * 2021-02-01 2022-08-02 伊顿智能动力有限公司 Wearable personal protection system and method for assessing personal health risks in an environment
CN114819340A (en) * 2022-04-24 2022-07-29 浙江浙能天然气运行有限公司 Time sequence prediction method for daily load of natural gas
CN116130105A (en) * 2023-03-31 2023-05-16 江苏亚寰软件股份有限公司 Health risk prediction method based on neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于生理毒代动力学模型和剂量-反应模型的苯暴露健康风险评价方法;王阳;刘茂;;中国工业医学杂志;20090225(第01期);34-37页 *
广州市区1541例2~6岁健康散居儿童及肥胖儿血脂水平分析;景尉;王方;戴若丹;罗旋;;中国妇幼保健;20060828(第15期);2084-2086页 *

Also Published As

Publication number Publication date
CN117334334A (en) 2024-01-02

Similar Documents

Publication Publication Date Title
Mamalakis et al. Neural network attribution methods for problems in geoscience: A novel synthetic benchmark dataset
Wang et al. Predicting construction cost and schedule success using artificial neural networks ensemble and support vector machines classification models
Li et al. Uncertainty analysis in ecological studies
Azadeh et al. A hybrid simulation-adaptive network based fuzzy inference system for improvement of electricity consumption estimation
Zhang et al. Comparison of econometric models and artificial neural networks algorithms for the prediction of baltic dry index
Bučar et al. A neural network approach to describing the scatter of S–N curves
Karaca et al. A novel R/S fractal analysis and wavelet entropy characterization approach for robust forecasting based on self-similar time series modeling
Azadeh et al. An integrated artificial neural network fuzzy C-means-normalization algorithm for performance assessment of decision-making units: The cases of auto industry and power plant
Mohamad et al. Assessment of the expected construction company’s net profit using neural network and multiple regression models
Li et al. Forecasting mortality with international linkages: A global vector-autoregression approach
Wang et al. A Decomposition-based Multi-model and Multi-parameter ensemble forecast framework for monthly streamflow forecasting
Zadmirzaei et al. A novel integrated fuzzy DEA–artificial intelligence approach for assessing environmental efficiency and predicting CO2 emissions
Weng et al. Time-series generative adversarial networks for flood forecasting
Giebel et al. Simulation and prediction of wind speeds: A neural network for Weibull
Gao et al. Time-varying group lasso Granger causality graph for high dimensional dynamic system
CN117334334B (en) Health risk prediction method, device, equipment and medium
Fang et al. Building a cross-border e-commerce talent training platform based on logistic regression model
Bučar et al. An improved neural computing method for describing the scatter of S–N curves
Zhang et al. Multivariate discrete grey model base on dummy drivers
Paroli et al. Bayesian inference in non-homogeneous Markov mixtures of periodic autoregressions with state-dependent exogenous variables
Karim et al. Empirical detection of parameter variation in growth curve models using interval specific estimators
Siraj et al. Data mining and neural networks: the impact of data representation
Cao et al. Probabilistic runoff forecasting considering stepwise decomposition framework and external factor integration structure
US20220383110A1 (en) System and method for machine learning architecture with invertible neural networks
Kang et al. Flexible Bayesian Modeling for Longitudinal Binary and Ordinal Responses

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant