WO2019227716A1 - 流感预测模型的生成方法、装置及计算机可读存储介质 - Google Patents

流感预测模型的生成方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2019227716A1
WO2019227716A1 PCT/CN2018/102221 CN2018102221W WO2019227716A1 WO 2019227716 A1 WO2019227716 A1 WO 2019227716A1 CN 2018102221 W CN2018102221 W CN 2018102221W WO 2019227716 A1 WO2019227716 A1 WO 2019227716A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
candidate
features
prediction model
prediction
Prior art date
Application number
PCT/CN2018/102221
Other languages
English (en)
French (fr)
Inventor
李弦
徐亮
阮晓雯
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Priority to JP2019556833A priority Critical patent/JP6815708B2/ja
Publication of WO2019227716A1 publication Critical patent/WO2019227716A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method, a device, and a computer-readable storage medium for generating a flu prediction model.
  • influenza prediction generally adopts a time series model based on time series autocorrelation or establishes regression models using exogenous features, or combines different models to make predictions.
  • the combination of models can take advantage of the advantages of each model algorithm.
  • the change law of the sequence itself and the modification of the time series model by external characteristics improve the generalization ability of the model.
  • the currently commonly used model combination method is the average method, that is, calculating the average value of the prediction results of different models, and using the calculated average value as the prediction result of the combination model.
  • This model combination method cannot judge the prediction ability of each model.
  • the weight of each model cannot be adjusted dynamically, resulting in a lower prediction accuracy of the combined model.
  • the present application provides a method, a device, and a computer-readable storage medium for generating an influenza prediction model, whose main purpose is to improve the prediction accuracy of the influenza prediction model.
  • the present application also provides a method for generating an influenza prediction model, which method includes:
  • the present application further provides a device for generating a flu prediction model.
  • the device includes a memory and a processor.
  • the memory stores a model generating program that can be run on the processor.
  • the model When the generated program is executed by the processor, the following steps are implemented:
  • the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores a model generation program, and the model generation program can be executed by one or more processors to implement Steps of the method for generating an influenza prediction model as described above.
  • the method, device and computer-readable storage medium for the influenza prediction model proposed in this application are used to obtain the data of the percentage of influenza-like cases in multiple consecutive time units, and to establish an autoregressive integral moving average ARIMA model; to obtain public opinion keywords, according to the public opinion key Obtain public opinion data sequences in multiple time units using the public opinion data sequences in the public opinion data sequence as prediction features, and train the xgboost prediction model constructed based on the xgboost algorithm to determine the model parameters; according to the ARIMA model and the xgboost prediction model, construct a Car-based Influenza prediction model based on Mann filter algorithm; in the process of using influenza prediction model for influenza prediction, the first prediction value of the ARIMA model for the target time unit is used as the measurement value of the state variable, and the xgboost prediction model is used for the second time of the target time unit.
  • the predicted value is used as a prior estimate of the state variable to calculate the Kalman gain of the current influenza prediction model; the weights of the two models in the influenza prediction model are updated based on the calculated Kalman gain, and the weighted influenza prediction model is updated For the next time unit In this way, the dynamic update of the weights of the two models in the influenza prediction model is achieved, so that the prediction model obtained by the combination tends to the output of the model with better current performance, which improves the accuracy of the prediction model. .
  • FIG. 1 is a schematic flowchart of a method for generating an influenza prediction model according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of an internal structure of a device for generating an influenza prediction model according to an embodiment of the present application
  • FIG. 3 is a schematic block diagram of a model generation program in an influenza prediction model generation device provided by an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a method for generating an influenza prediction model according to an embodiment of the present application. The method may be performed by a device, which may be implemented by software and / or hardware.
  • a method for generating an influenza prediction model includes:
  • step S10 the data of the percentage of influenza-like cases in multiple consecutive time units are acquired, and an autoregressive integrated moving average ARIMA model is established.
  • Step S20 Obtain public opinion keywords, obtain public opinion data sequences in the multiple time units according to the public opinion keywords, use the public opinion data in the public opinion data sequence as a prediction feature, and train an xgboost prediction model constructed based on the xgboost algorithm. To determine model parameters.
  • the keywords related to the public opinion related to influenza mainly include influenza virus, high fever, cough, nasal congestion, crack, Tylenol, upper respiratory tract infection, cough, influenza A and other keywords; according to the above public opinion keywords from The preset channels obtain public opinion data of the target area to be predicted.
  • the preset channels include Baidu Search and Weibo
  • the public opinion data mainly includes the Baidu search index of the above public opinion keywords on Baidu, as well as on Weibo. Posts. If a certain area is used as the analysis object, the area is used as the target area, and the Baidu search index and Weibo posting times of public opinion keywords in the area are obtained.
  • the week is used as a time unit, and the Baidu search index on Baidu and the number of postings on Weibo for each week in the past 5 years are obtained as public opinion data.
  • the public opinion data of the public opinion keyword on a preset channel can form a sequence containing 260 data, each data in the sequence is a candidate feature, and all candidate features constitute a candidate feature set. Use the features in this set to train an xgboost prediction model based on the xgboost (eXtreme Gradient Boosting) algorithm to determine model parameters.
  • Determine public opinion keywords obtain public opinion data sequences in multiple consecutive time units according to the public opinion keywords, and use the public opinion data in the public opinion data sequence as candidate features to construct a candidate feature set;
  • Candidate features are processed by wavelet denoising and detrending; determine the preset number of features, and select the preset number of candidate features from the candidate feature set after wavelet denoising and detrending processing to form a prediction Feature set; training the xgboost prediction model built based on the xgboost algorithm using the predicted feature set and actual observations of the percentage of influenza-like cases in the multiple consecutive time units to determine model parameters.
  • the implementation method is as follows: determine a wavelet basis function, perform wavelet decomposition on a sequence formed by each feature in the candidate feature set according to the wavelet basis function, and determine the number of decomposition layers; determine The threshold of wavelet denoising is to adjust the coefficients of each level of the predicted features after wavelet decomposition according to the determined threshold; perform inverse transform reconstruction on the adjusted wavelet coefficients to obtain the candidate features after denoising; for wavelet denoising processing
  • Candidate features corresponding to each time unit in the subsequent candidate feature set obtain data of consecutive multiple time units before the time unit and perform linear regression to build a trend prediction model, and obtain the corresponding corresponding time unit according to the trend prediction model.
  • Baseline prediction value subtracting the baseline prediction value using the actual value of the candidate feature in this time unit to obtain the candidate feature after detrending.
  • a wavelet basis function is determined, and a sequence formed by each feature in the candidate feature set is subjected to wavelet decomposition according to the wavelet basis function, and the number of decomposition layers is determined. For example, wavelet decomposition is performed on the weekly Baidu index formed by the public opinion keyword "high fever". Based on the principle of close to the measured signal waveform, db4 is selected as the wavelet basis function for public opinion data decomposition. In the selection of the decomposition scale, according to the length test of the public opinion data, under different decomposition scales in a certain range, the number of decomposition layers with better denoising effect and lower signal distortion is selected.
  • a soft threshold algorithm is used to set the smaller wavelet coefficients to zero and shrink the larger wavelet coefficients toward zero to adjust the coefficients of each level of the candidate feature after decomposition.
  • the specific formula is as follows, where w is the value before adjustment Coefficient, d is the adjusted coefficient:
  • Inverse transform reconstruction is performed on the adjusted wavelet coefficients to obtain candidate features after denoising.
  • linear regression is obtained for data of multiple consecutive time units before the time unit to build a trend prediction model, and the time is obtained according to the trend prediction model.
  • the baseline prediction value corresponding to the unit; the actual value of the candidate feature of the time unit is used to subtract the baseline prediction value to obtain the candidate feature after detrending.
  • the first 52 weeks of data are taken to perform linear regression to build a trend prediction model. It can be understood that if a The historical data of one data point is less than 52 weeks, then linear regression is used to build a trend prediction model using all historical data. Baseline predicted values of current data points are obtained through the trend prediction model. The baseline predicted value is subtracted from the actual value of the predicted feature of the current point to obtain the predicted feature after detrending.
  • the number of different filtering features can be set, the prediction result can be obtained, and the appropriate number of filtering features can be selected according to the accuracy of the prediction result; or, in other embodiments, the number of filtered features You can also use the following methods:
  • Use the prediction features in the prediction feature set to train the xgboost prediction model Specifically, obtain the actual observed values of the percentage of influenza-like cases in the consecutive multiple time units, and compare the prediction features obtained in one week with the influenza-like in the next week of the week.
  • the case percentage is used as a training sample, and data from multiple consecutive weeks before the current prediction week that reflects the latest trend of influenza changes are selected, for example, data from the first 52 weeks of the current prediction week are used as the training set for rolling prediction.
  • gbtree general balanced trees
  • a forward distribution algorithm is used to construct a new regression tree to fit the residuals or residuals of the current model, and to optimize the regular term to suppress overfitting and parallelize processing to improve the performance of the algorithm.
  • Step S30 Construct an influenza prediction model based on the Kalman filter algorithm according to the ARIMA model and the xgboost prediction model.
  • Step S40 Use the first predicted value of the ARIMA model for the target time unit as the measurement value of the state variable, and use the second predicted value of the xgboost prediction model for the target time unit as the prior estimation value of the state variable to calculate the current Kalman gain of the influenza prediction model.
  • step S50 the weights of the ARIMA model and the xgboost prediction model in the influenza prediction model are updated according to the calculated Kalman gain, and the updated weight prediction model is used to predict the lower of the target time unit. Percentage of influenza-like cases over a time unit.
  • the first predicted value y A output from the ARIMA model for the target time unit K is used as the measured value of the state variable obtained through the measurement equation in the discrete time process, and the second predicted value y x output from the xgboost prediction model for the target time unit K is taken.
  • the current predicted Kalman gain is calculated, and the weight of the influenza prediction model obtained by the combination is determined according to the Kalman gain.
  • the predicted value of the influenza prediction model can be obtained, that is, the posterior estimate of the state variable in the Kalman filter.
  • the expression is:
  • K k is the Kalman gain, which is a constant in this embodiment, and the weights of the ARIMA model and the xgboost prediction model are determined in the combined prediction model.
  • the covariance of the prior estimation error at time k-1 can be calculated according to the covariance of the posterior estimation error at time k-1.
  • A may change with time. It is assumed here that it is constant. In this embodiment, it is set to 1.
  • the observed noise covariance R value takes the covariance of the historical prediction error of the xgboost prediction model
  • the process excitation noise covariance Q value takes the covariance of the historical prediction error of the ARIMA model.
  • k represents the time series number of the current prediction
  • k-1 represents the previous time of k.
  • the flu prediction process indicates the current week and the previous week.
  • the posterior covariance P k-1 of the state at time k-1 is updated, and then the prior covariance at time k is calculated forward. Furthermore, according to the iterative calculation formula of K k in the Kalman filter, the updated Kalman gain K k is obtained , that is, the weight of the model combination. That is to say, after using the two models to obtain the predicted value at time k-1 (the week before the current week), calculate the Kalman gain, that is, to update the weight of the influenza prediction model once, and use the updated influenza prediction.
  • the method for generating the influenza prediction model proposed in this embodiment is to obtain the data of the percentage of influenza-like cases in multiple consecutive time units, to establish an autoregressive integral moving average ARIMA model; to obtain public opinion keywords, and to obtain multiple time units according to the public opinion keywords.
  • Public opinion data series using public opinion data in public opinion data series as prediction features, training xgboost prediction model based on xgboost algorithm to determine model parameters; according to ARIMA model and xgboost prediction model, construct influenza prediction model based on Kalman filter algorithm ;
  • influenza prediction model for influenza prediction the first prediction value of the ARIMA model for the target time unit is used as the measurement value of the state variable, and the second prediction value of the xgboost prediction model for the target time unit is used as the first state variable.
  • the Kalman gain of the current influenza prediction model is calculated based on the estimated value; the weights of the two models in the influenza prediction model are updated according to the calculated Kalman gain, and the updated weighted influenza prediction model is used for the next time unit. Percent of influenza-like cases, passed In this way, the dynamic update of the weights of the two models in the influenza prediction model is realized.
  • the model fusion based on Kalman filtering takes into account the change law of the time series itself, and combines public opinion data to correct the interference to the series. Make the model prediction more accurate, and by dynamically adjusting the model weights in real time, the combined prediction model can make the current model with better performance output, and improve the accuracy of the prediction model.
  • the application also provides a device for generating an influenza prediction model.
  • a schematic diagram of an internal structure of an apparatus for generating an influenza prediction model according to an embodiment of the present application is shown.
  • the device 1 for generating the influenza prediction model may be a PC (Personal Computer) or a terminal device such as a smart phone, a tablet computer, or a portable computer.
  • the apparatus 1 for generating the influenza prediction model includes at least a memory 11, a processor 12, a network interface 13, and a communication bus 14.
  • the memory 11 includes at least one type of readable storage medium.
  • the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like.
  • the memory 11 may be an internal storage unit of the influenza prediction model generating device 1 in some embodiments, such as a hard disk of the influenza prediction model generating device 1.
  • the memory 11 may also be an external storage device of the influenza prediction model generating device 1 in other embodiments, for example, a plug-in hard disk and a Smart Memory Card (SMC) provided on the influenza prediction model generating device 1. Secure Digital (SD) card, Flash Card, etc.
  • SD Secure Digital
  • the memory 11 may include both an internal storage unit and an external storage device of the influenza prediction model generating device 1.
  • the memory 11 can be used not only to store application software and various types of data installed in the influenza prediction model generation device 1, such as the code of the model generation program 01, but also to temporarily store data that has been or will be output.
  • the processor 12 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chip in some embodiments, and is configured to run program codes or processes stored in the memory 11 Data, for example, the model generation program 01 is executed.
  • CPU central processing unit
  • controller a controller
  • microcontroller a microprocessor
  • microprocessor or other data processing chip in some embodiments, and is configured to run program codes or processes stored in the memory 11 Data, for example, the model generation program 01 is executed.
  • the network interface 13 may optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the device 1 and other electronic devices.
  • a standard wired interface such as a WI-FI interface
  • the communication bus 14 is used to implement connection communication between these components.
  • the device 1 may further include a user interface.
  • the user interface may include a display, an input unit such as a keyboard, and the optional user interface may further include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-type liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, or the like.
  • the display may also be appropriately referred to as a display screen or a display unit for displaying information processed in the influenza prediction model generating device 1 and for displaying a visualized user interface.
  • FIG. 2 only shows an influenza prediction model generating device 1 having components 11-14 and a model generating program 01. Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute an influenza prediction model generating device.
  • the definition of 1 may include fewer or more components than shown, or some components may be combined, or different component arrangements.
  • a model generating program 01 is stored in the memory 11; when the processor 12 executes the model generating program 01 stored in the memory 11, the following steps are implemented:
  • step S10 the data of the percentage of influenza-like cases in multiple consecutive time units are acquired, and an autoregressive integrated moving average ARIMA model is established.
  • Step S20 Obtain public opinion keywords, obtain public opinion data sequences in the multiple time units according to the public opinion keywords, use the public opinion data in the public opinion data sequence as a prediction feature, and train an xgboost prediction model constructed based on the xgboost algorithm. To determine model parameters.
  • the keywords related to the public opinion related to influenza mainly include influenza virus, high fever, cough, nasal congestion, crack, Tylenol, upper respiratory tract infection, cough, influenza A and other keywords; according to the above public opinion keywords from The preset channels obtain public opinion data of the target area to be predicted.
  • the preset channels include Baidu Search and Weibo
  • the public opinion data mainly includes the Baidu search index of the above public opinion keywords on Baidu, as well as on Weibo. Posts. If a certain area is used as the analysis object, the area is used as the target area, and the Baidu search index and Weibo posting times of public opinion keywords in the area are obtained.
  • the week is used as a time unit, and the Baidu search index on Baidu and the number of postings on Weibo for each week in the past 5 years are obtained as public opinion data.
  • the public opinion data of the public opinion keyword on a preset channel can form a sequence containing 260 data, each data in the sequence is a candidate feature, and all candidate features constitute a candidate feature set. Use the features in this set to train an xgboost prediction model built on the xgboost algorithm to determine model parameters.
  • step S20 may include the following detailed steps:
  • Determine public opinion keywords obtain public opinion data sequences in multiple consecutive time units according to the public opinion keywords, and use the public opinion data in the public opinion data sequence as candidate features to construct a candidate feature set;
  • Candidate features are processed by wavelet denoising and detrending; determine the preset number of features, and select the preset number of candidate features from the candidate feature set after wavelet denoising and detrending processing to form a prediction Feature set; training the xgboost prediction model built based on the xgboost algorithm using the predicted feature set and actual observations of the percentage of influenza-like cases in the multiple consecutive time units to determine model parameters.
  • the wavelet denoising processing and detrending processing are implemented as follows:
  • Determine the wavelet basis function perform wavelet decomposition on the sequence formed by each feature in the candidate feature set according to the wavelet basis function, and determine the number of decomposition layers; determine the threshold of wavelet denoising, and decompose the wavelet according to the determined threshold
  • the coefficients at each level of the predicted feature are adjusted; the inverse transform reconstruction of the adjusted wavelet coefficients is performed to obtain the candidate features after denoising; the candidate features corresponding to each time unit in the candidate feature set after wavelet denoising are processed
  • a wavelet basis function is determined, and a sequence formed by each feature in the candidate feature set is subjected to wavelet decomposition according to the wavelet basis function, and the number of decomposition levels is determined. For example, wavelet decomposition is performed on the weekly Baidu index formed by the public opinion keyword "high fever". Based on the principle of close to the measured signal waveform, db4 is selected as the wavelet basis function for public opinion data decomposition. In the selection of the decomposition scale, according to the length test of the public opinion data, under different decomposition scales in a certain range, the number of decomposition layers with better denoising effect and lower signal distortion is selected.
  • a soft threshold algorithm is used to set the smaller wavelet coefficients to zero and shrink the larger wavelet coefficients toward zero to adjust the coefficients of each level of the candidate feature after decomposition.
  • the specific formula is as follows, where w is the value before adjustment Coefficient, d is the adjusted coefficient:
  • Inverse transform reconstruction is performed on the adjusted wavelet coefficients to obtain candidate features after denoising.
  • linear regression is obtained for data of multiple consecutive time units before the time unit to build a trend prediction model, and the time is obtained according to the trend prediction model.
  • the baseline prediction value corresponding to the unit; the actual value of the candidate feature of the time unit is used to subtract the baseline prediction value to obtain the candidate feature after detrending.
  • the first 52 weeks of data are taken to perform linear regression to build a trend prediction model. It can be understood that if a The historical data of one data point is less than 52 weeks, then linear regression is used to build a trend prediction model using all historical data. Baseline predicted values of current data points are obtained through the trend prediction model. The baseline predicted value is subtracted from the actual value of the predicted feature of the current point to obtain the predicted feature after detrending.
  • the number of different filtering features can be set, the prediction result can be obtained, and the appropriate number of filtering features can be selected according to the accuracy of the prediction result; or, in other embodiments, the number of filtered features You can also use the following methods:
  • Use the prediction features in the prediction feature set to train the xgboost prediction model Specifically, obtain the actual observed values of the percentage of influenza-like cases in the consecutive multiple time units, and compare the prediction features obtained in one week with the influenza-like in the next week of the week.
  • the case percentage is used as a training sample, and data from multiple consecutive weeks before the current prediction week that reflects the latest trend of influenza changes are selected, for example, data from the first 52 weeks of the current prediction week are used as the training set for rolling prediction.
  • gbtree general balanced trees
  • a forward distribution algorithm is used to construct a new regression tree to fit the residuals or residuals of the current model, and to optimize the regular term to suppress overfitting and parallelize processing to improve the performance of the algorithm.
  • an influenza prediction model based on a Kalman filter algorithm is constructed.
  • the first predicted value y A output from the ARIMA model for the target time unit K is used as the measured value of the state variable obtained through the measurement equation in the discrete time process, and the second predicted value y x output from the xgboost prediction model for the target time unit K is taken.
  • the current predicted Kalman gain is calculated, and the weight of the influenza prediction model obtained by the combination is determined according to the Kalman gain.
  • the predicted value of the influenza prediction model can be obtained, that is, the posterior estimate of the state variable in the Kalman filter.
  • the expression is:
  • K k is the Kalman gain, which is a constant in this embodiment, and the weights of the ARIMA model and the xgboost prediction model are determined in the combined prediction model.
  • the covariance of the prior estimation error at time k-1 can be calculated from the covariance of the posterior estimation error at time k-1, where A is the n ⁇ n order gain A matrix that linearly maps the state of k-1 at the previous time to the state of k at the current time.
  • A may change with time. It is assumed here that it is constant, and it is set to 1 in this embodiment.
  • the observed noise covariance R value takes the covariance of the historical prediction error of the xgboost prediction model
  • the process excitation noise covariance Q value takes the covariance of the historical prediction error of the ARIMA model.
  • k represents the time series number of the current prediction
  • k-1 represents the previous time of k.
  • the flu prediction process indicates the current week and the previous week.
  • the posterior covariance P k-1 of the state at time k-1 is updated, and then the prior covariance at time k is calculated forward. Furthermore, according to the iterative calculation formula of K k in the Kalman filter, the updated Kalman gain K k is obtained , that is, the weight of the model combination. That is to say, after using the two models to obtain the predicted value at time k-1 (the week before the current week), calculate the Kalman gain, that is, to update the weight of the influenza prediction model once, and use the updated influenza prediction.
  • the apparatus for generating an influenza prediction model proposed in this embodiment obtains data on the percentage of influenza-like cases in multiple consecutive time units, and establishes an autoregressive integral moving average ARIMA model; obtains public opinion keywords, and acquires multiple time units according to the public opinion keywords.
  • Public opinion data series using public opinion data in public opinion data series as prediction features, training xgboost prediction model based on xgboost algorithm to determine model parameters; according to ARIMA model and xgboost prediction model, construct influenza prediction model based on Kalman filter algorithm ;
  • influenza prediction model for influenza prediction the first prediction value of the ARIMA model for the target time unit is used as the measurement value of the state variable, and the second prediction value of the xgboost prediction model for the target time unit is used as the first state variable.
  • the Kalman gain of the current influenza prediction model is calculated based on the estimated value; the weights of the two models in the influenza prediction model are updated according to the calculated Kalman gain, and the updated weighted influenza prediction model is used for the next time unit. Percent of influenza-like cases, passed In this way, the dynamic update of the weights of the two models in the influenza prediction model is realized.
  • the model fusion based on Kalman filtering takes into account the change law of the time series itself, and combines public opinion data to correct the interference to the series. Make the model prediction more accurate, and by dynamically adjusting the model weights in real time, the combined prediction model can make the current model with better performance output, and improve the accuracy of the prediction model.
  • the model generation program may be further divided into one or more modules, and the one or more modules are stored in the memory 11 and are implemented by one or more processors (in this embodiment, The processor 12) executes to complete this application.
  • the modules referred to in this application refer to a series of computer program instruction segments capable of performing specific functions, and are used to describe the execution process of the model generation program in the influenza prediction model generation device.
  • FIG. 3 it is a schematic diagram of a program module of a model generating program in an embodiment of an apparatus for generating a flu prediction model of the present application.
  • the model generating program may be divided into a first prediction module 10 and a second The prediction module 20, the model combination module 30, the gain calculation module 40, and the model update module 50, for example:
  • the first prediction module 10 is configured to: obtain data on the percentage of influenza-like cases in multiple consecutive time units, and establish an autoregressive integrated moving average ARIMA model;
  • the second prediction module 20 is configured to obtain public opinion keywords, obtain public opinion data sequences in the multiple time units according to the public opinion keywords, use the public opinion data in the public opinion data sequence as a prediction feature, and train an xgboost-based algorithm. Constructed xgboost prediction model to determine model parameters;
  • the model combination module 30 is configured to construct an influenza prediction model based on the Kalman filter algorithm according to the ARIMA model and the xgboost prediction model;
  • the gain calculation module 40 is configured to: use the first predicted value of the ARIMA model for the target time unit as the measurement value of the state variable, and use the second predicted value of the xgboost prediction model for the target time unit as the prior estimate of the state variable Value to calculate the Kalman gain of the current influenza prediction model;
  • the model update module 50 is configured to update the weights of the ARIMA model and the xgboost prediction model in the influenza prediction model according to the calculated Kalman gain, and the influenza prediction model after updating the weights is used to predict the target. Percentage of influenza-like cases in the next time unit of the time unit.
  • an embodiment of the present application further provides a computer-readable storage medium.
  • the computer-readable storage medium stores a model generation program, and the model generation program may be executed by one or more processors to implement the following operations:

Landscapes

  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

本申请公开了一种流感预测模型的生成方法,该方法包括:获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;根据舆情关键词获取多个时间单元内的舆情数据序列,将舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;根据ARIMA模型和xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;在使用流感预测模型进行流感预测的过程中,将ARIMA模型的第一预测值作为状态变量的测量值,将xgboost预测模型的第二预测值作为状态变量的先验估计值,动态更新流感预测模型的卡尔曼增益。本申请还提出一种流感预测模型的生成装置以及一种计算机可读存储介质。本申请提高了流感预测模型的预测准确度。

Description

流感预测模型的生成方法、装置及计算机可读存储介质
本申请基于巴黎公约申明享有2018年05月31日递交的申请号为201810543749.9、名称为“流感预测模型的生成方法、装置及计算机可读存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种流感预测模型的生成方法、装置及计算机可读存储介质。
背景技术
目前流感预测一般采用基于时间序列自相关性的时间序列模型或是利用外源特征建立回归模型,或者,将不同的模型组合到一起进行预测,采用模型组合可以发挥各模型算法的优势,同时拟合序列本身的变化规律以及外源特征对时序模型的修正,提升模型的泛化能力。
但是,目前常用的模型组合方法是平均法,即计算不同的模型的预测结果的均值,将计算得到的均值作为组合模型的预测结果,这种模型组合方式无法对各个模型的预测能力作出判断,进而无法动态调整各模型的权重,导致组合模型的预测准确度较低。
发明内容
本申请提供一种流感预测模型的生成方法、装置及计算机可读存储介质,其主要目的在于提高流感预测模型的预测准确度。
为实现上述目的,本申请还提供一种流感预测模型的生成方法,该方法包括:
获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;
获取舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;
根据所述ARIMA模型和所述xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;
将所述ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将所述xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的所述流感预测模型的卡尔曼增益;
根据计算得到的卡尔曼增益更新所述流感预测模型中所述ARIMA模型和所述xgboost预测模型的权重,经更新权重后的所述流感预测模型用于预测所述目标时间单元的下一个时间单元的流感样病例百分比。
此外,为实现上述目的,本申请还提供一种流感预测模型的生成装置,该装置包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的模型生成程序,所述模型生成程序被所述处理器执行时实现如下步骤:
获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;
获取舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;
根据所述ARIMA模型和所述xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;
将所述ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将所述xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的所述流感预测模型的卡尔曼增益;
根据计算得到的卡尔曼增益更新所述流感预测模型中所述ARIMA模型和所述xgboost预测模型的权重,经更新权重后的所述流感预测模型用于预测所述目标时间单元的下一个时间单元的流感样病例百分比。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有模型生成程序,所述模型生成程序可被一个或者多个处理器执行,以实现如上所述的流感预测模型的生成方法的步骤。
本申请提出的流感预测模型的生成方法、装置及计算机可读存储介质,获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;获取舆情关键词,根据舆情关键词获取多个时间单元内的舆情数据序列,将舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;根据ARIMA模型和xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;在使用流感预测模型进行流感预测的过程中,将ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的流感预测模型的卡尔曼增益;根据本次计算的卡尔曼增益更新流感预测模型中的两个模型的权重,经更新权重后的流感预测模型用于下一个时间单元的流感样病例百分比,通过这样的方式,实现了对流感预测模型中的两个模型的权重的动态更新,使得组合得到的预测模型倾向于当前性能较好的模型输出,提高预测模型的精准度。
附图说明
图1为本申请一实施例提供的流感预测模型的生成方法的流程示意图;
图2为本申请一实施例提供的流感预测模型的生成装置的内部结构示意图;
图3为本申请一实施例提供的流感预测模型的生成装置中模型生成程序的模块示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步 说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供一种流感预测模型的生成方法。参照图1所示,为本申请一实施例提供的流感预测模型的生成方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。
在本实施例中,流感预测模型的生成方法包括:
步骤S10,获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型。
获取在多个时间单元内的流感样病例百分比数据,基于时间序列本身的自相关性建立ARIMA(Autoregressive Integrated Moving Average,自回归积分滑动平均)模型。例如,若对目标时间单元的流感样病例百分比进行预测,则获取该时间单元之前的连续多个时间单元的历史流感样病例百分比数据建立ARIMA模型。在本实施例中以周作为时间单元,对流感进行预测。
步骤S20,获取舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数。
本申请实施例中,流感相关的舆情关键词主要包括流感病毒、高烧、咳嗽、鼻塞、快克、泰诺、上呼吸道感染、止咳、甲型流感等多个关键词;根据上述舆情关键词从预设渠道获取待预测的目标区域的舆情数据,其中,预设渠道包括百度搜索和微博等社交网络,舆情数据主要包括上述舆情关键词在百度上的百度搜索指数,以及在微博上的发布次数。如果针对某一地区作为分析对象,则将地区作为目标区域,获取该地区的舆情关键词的百度搜索指数和微博发布次数。
此外,本实施例中,将周作为时间单元,获取过去5年内,每一周的上述舆情关键词在百度上的百度搜索指数以及在微博上的发布次数作为舆情数据,针对每一个舆情关键词来说,该舆情关键词在一个预设渠道上的舆情数据可以形成一个包含有260个数据的序列,序列中的每一个数据是一个候选特征,所有的候选特征构成候选特征集合。使用该集合中的特征训练基于xgboost(eXtreme Gradient Boosting,极端梯度提升)算法构建的xgboost预测模型,以确定模型参数。
进一步地,在一些实施例中,为了提高特征的相关性,对候选特征集合中的特征进行预处理后,进行特征筛选,使用筛选得到的特征训练xgboost预测模型。具体地,可以包括如下细化步骤:
确定舆情关键词,根据所述舆情关键词获取连续多个时间单元内的舆情数据序列,并将所述舆情数据序列中的舆情数据作为候选特征,构建候选特征集合;对所述候选特征集合中的候选特征进行小波去噪处理和去趋势处理;确定特征的预设数量,并从经过小波去噪处理和去趋势处理后的候选特征集 合中筛选出所述预设数量的候选特征,构成预测特征集合;使用所述预测特征集合以及所述多个连续时间单元内的流感样病例百分比的实际观测值,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数。
关于小波去噪处理和去趋势处理,实现方式如下:确定小波基函数,按照所述小波基函数对所述候选特征集合中的每个特征形成的序列进行小波分解,并确定分解层数;确定小波去噪的阈值,按照确定的阈值对小波分解后的预测特征的各层次的系数进行调整;对调整过的小波系数做逆变换重构,得到去噪之后的候选特征;针对小波去噪处理后的候选特征集合中每个时间单元对应的候选特征,获取该时间单元之前的连续多个时间单元的数据进行线性回归,以构建趋势预测模型,根据所述趋势预测模型获取该时间单元对应的基线预测值;使用该时间单元的候选特征的实际值减去所述基线预测值,得到去趋势之后的候选特征。
确定小波基函数,按照所述小波基函数对所述候选特征集合中的每个特征形成的序列进行小波分解,并确定分解层数。例如,对舆情关键词“高烧”的每周百度指数形成的序列进行小波分解,基于与被测信号波形接近的原则,选定db4为舆情数据分解的小波基函数。而在分解尺度的选择上,则根据舆情数据的长度测试在一定范围内不同分解尺度下,选取去噪效果较好而信号失真度较低的分解层数。确定小波去噪的阈值,按照确定的阈值对小波分解后的候选特征的各层次的系数进行调整。具体地:根据每一个特征的序列的长度N,确定小波去噪的阈值thr,假设使用的是过去52个周的历史数据,则每一个特征序列的长度N=52:
Figure PCTCN2018102221-appb-000001
采用软阈值算法,将较小的小波系数置零,对较大的小波系数向零作收缩处理,以调整分解后的候选特征的各层次的系数,具体公式如下,其中,w为调整前的系数,d为调整后的系数:
Figure PCTCN2018102221-appb-000002
对调整过的小波系数做逆变换重构,得到去噪之后的候选特征。
针对小波去噪处理后的候选特征集合中每个时间单元对应的候选特征,获取该时间单元之前的连续多个时间单元的数据进行线性回归,以构建趋势预测模型,根据趋势预测模型获取该时间单元对应的基线预测值;使用该时间单元的候选特征的实际值减去基线预测值,得到去趋势之后的候选特征。
例如,针对小波去噪预处理后的候选特征的每个数据点(即一个时间单元对应的候选特征),取其前52周的数据进行线性回归构建趋势预测模型,可以理解的是,如果某一数据点的历史数据不足52周,则以所有历史数据进行线性回归构建趋势预测模型。通过趋势预测模型得到当前数据点的基线预测值。用当前点的预测特征的实际值减去基线预测值,得到去趋势之后的预测特征。
可选地,在一些实施例中,可以设置不同的筛选特征的数量,获取预测结果,根据预测结果的准确度选择合适的筛选特征的数量;或者,在其他实施例中,关于筛选的特征数量的确定,也可以采用如下方式:
基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,使用特征递归消除交叉验证算法选择模型性能达到预设条件时的特征数量作为所述预设数量。
确定预设数量后,基于xgboost算法构建模型作为学习器,将候选特征集合中的候选特征输入学习器,并按照特征递归消除算法进行迭代运算;获取学习器经过运算返回的模型系数,根据模型系数确定每个候选特征集合中各候选特征的重要程度;根据各候选特征的重要程度从当前的候选特征集合中移除重要程度最小的K个候选特征;重复执行上述步骤,直至筛选得到的候选特征的数量达到预设数量;预设数量的候选特征构成预测特征集合。
使用预测特征集合中的预测特征训练xgboost预测模型,具体地,获取所述连续多个时间单元内的流感样病例百分比的实际观测值,将一周得到的预测特征与该周的下一周的流感样病例百分比作为一个训练样本,选择能反映最新的流感变化趋势的当前预测周的前连续多个周的数据,例如当前预测周的前52周的数据,作为训练集进行滚动预测。基于xgboost算法构建预测模型,以gbtree(general balanced trees,通用二叉查找树)作为booster(加速器),基于平方误差损失函数训练该预测模型,使得上述损失函数极小化,确定模型参数,获取最终的xgboost预测模型。此外,采用前向分布算法,通过构建新的回归树拟合当前模型的残差或残差近似值,并通过优化正则项抑制过拟合及并行化处理提升算法性能。
步骤S30,根据所述ARIMA模型和所述xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型。
步骤S40,将所述ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将所述xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的所述流感预测模型的卡尔曼增益。
步骤S50,根据计算得到的卡尔曼增益更新所述流感预测模型中所述ARIMA模型和所述xgboost预测模型的权重,经更新权重后的所述流感预测模型用于预测所述目标时间单元的下一个时间单元的流感样病例百分比。
将ARIMA模型对目标时间单元K输出的第一预测值y A作为离散时间过程中通过量测方程得到的状态变量的测量值,将xgboost预测模型对目标时间单元K输出的第二预测值y x作为离散时间过程中通过状态转移方程得到的状态变量的先验估计值,计算得到当前预测的卡尔曼增益,根据卡尔曼增益确定组合得到的流感预测模型的权重。
根据卡尔曼滤波算法的表达式,可得到流感预测模型的预测值,即卡尔曼滤波中状态变量的后验估计,其表达式为:
y=y x+K k(y A-Hy x)
其中,在上述公式中,测量方程的测量增益H=1,K k为卡尔曼增益,在 本实施例中为常数,在组合预测模型中决定了ARIMA模型和xgboost预测模型的权重。
在使得后验估计误差协方差最小的目标条件下,卡尔曼滤波中K k的迭代计算式为:
Figure PCTCN2018102221-appb-000003
其中,先验估计误差的协方差
Figure PCTCN2018102221-appb-000004
Figure PCTCN2018102221-appb-000005
为先验估计误差的协方差,从上述公式可以看出,可以根据k-1时刻的后验估计误差的协方差计算得到k时刻的先验估计误差的协方差,A为n×n阶增益矩阵,将上一时刻k-1的状态线性映射到当前时刻k的状态,实际中A可能会随时间变化,在此处假设其为常数,本实施例中将其设为1。观测噪声协方差R值取xgboost预测模型的历史预测误差的协方差,过程激励噪声协方差Q值取ARIMA模型的历史预测误差的协方差。公式中的k表示当前预测的时间序列号,k-1表示k的前一时刻。在流感预测过程中则表示当前周和前一周。
在获得ARIMA模型和xgboost预测模型k-1时刻的预测值后,更新k-1时刻状态的后验协方差P k-1,进而向前推算得到k时刻的先验协方差
Figure PCTCN2018102221-appb-000006
进而根据卡尔曼滤波中K k的迭代计算式计算得到更新卡尔曼增益K k,即模型组合的权重。也就是说,在分别使用两个模型得到k-1时刻(即当前周的前一周)的预测值后,计算卡尔曼增益,即对流感预测模型的权重进行一次更新,使用更新后的流感预测模型进行对k时刻(即当前周)的流感样比例百分比进行预测。即根据公式y=y x+K k(y A-Hy x),计算组合预测模型的输出,作为最终的预测结果。
本实施例提出的流感预测模型的生成方法,获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;获取舆情关键词,根据舆情关键词获取多个时间单元内的舆情数据序列,将舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;根据ARIMA模型和xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;在使用流感预测模型进行流感预测的过程中,将ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的流感预测模型的卡尔曼增益;根据本次计算的卡尔曼增益更新流感预测模型中的两个模型的权重,经更新权重后的流感预测模型用于下一个时间单元的流感样病例百分比,通过这样的方式,实现了对流感预测模型中的两个模型的权重的动态更新,基于卡尔曼滤波的模型融合即考虑了时间序列自身的变化规律,又结合了舆情数据,修正序列受到的干扰,使得模型预测更加准确,且通过实时动态地调整模型权重,可以使得组合得到的预测模型倾向于当前性能较好的模型输出,提高预测模型的精准度。
本申请还提供一种流感预测模型的生成装置。参照图2所示,为本申请 一实施例提供的流感预测模型的生成装置的内部结构示意图。
在本实施例中,流感预测模型的生成装置1可以是PC(Personal Computer,个人电脑),也可以是智能手机、平板电脑、便携计算机等终端设备。该流感预测模型的生成装置1至少包括存储器11、处理器12,网络接口13,以及通信总线14。
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是流感预测模型的生成装置1的内部存储单元,例如该流感预测模型的生成装置1的硬盘。存储器11在另一些实施例中也可以是流感预测模型的生成装置1的外部存储设备,例如流感预测模型的生成装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括流感预测模型的生成装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于流感预测模型的生成装置1的应用软件及各类数据,例如模型生成程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行模型生成程序01等。
网络接口13可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置1与其他电子设备之间建立通信连接。
通信总线14用于实现这些组件之间的连接通信。
可选地,该装置1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在流感预测模型的生成装置1中处理的信息以及用于显示可视化的用户界面。
图2仅示出了具有组件11-14以及模型生成程序01的流感预测模型的生成装置1,本领域技术人员可以理解的是,图1示出的结构并不构成对流感预测模型的生成装置1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。
在图2所示的装置1实施例中,存储器11中存储有模型生成程序01;处理器12执行存储器11中存储的模型生成程序01时实现如下步骤:
步骤S10,获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型。
获取在多个时间单元内的流感样病例百分比数据,基于时间序列本身的自相关性建立ARIMA(Autoregressive Integrated Moving Average,自回归积分 滑动平均)模型。例如,若对目标时间单元的流感样病例百分比进行预测,则获取该时间单元之前的连续多个时间单元的历史流感样病例百分比数据建立ARIMA模型。在本实施例中以周作为时间单元,对流感进行预测。
步骤S20,获取舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数。
本申请实施例中,流感相关的舆情关键词主要包括流感病毒、高烧、咳嗽、鼻塞、快克、泰诺、上呼吸道感染、止咳、甲型流感等多个关键词;根据上述舆情关键词从预设渠道获取待预测的目标区域的舆情数据,其中,预设渠道包括百度搜索和微博等社交网络,舆情数据主要包括上述舆情关键词在百度上的百度搜索指数,以及在微博上的发布次数。如果针对某一地区作为分析对象,则将地区作为目标区域,获取该地区的舆情关键词的百度搜索指数和微博发布次数。
此外,本实施例中,将周作为时间单元,获取过去5年内,每一周的上述舆情关键词在百度上的百度搜索指数以及在微博上的发布次数作为舆情数据,针对每一个舆情关键词来说,该舆情关键词在一个预设渠道上的舆情数据可以形成一个包含有260个数据的序列,序列中的每一个数据是一个候选特征,所有的候选特征构成候选特征集合。使用该集合中的特征训练基于xgboost算法构建的xgboost预测模型,以确定模型参数。
进一步地,在一些实施例中,为了提高特征的相关性,对候选特征集合中的特征进行预处理后,进行特征筛选,使用筛选得到的特征训练xgboost预测模型。具体地,步骤S20可以包括如下细化步骤:
确定舆情关键词,根据所述舆情关键词获取连续多个时间单元内的舆情数据序列,并将所述舆情数据序列中的舆情数据作为候选特征,构建候选特征集合;对所述候选特征集合中的候选特征进行小波去噪处理和去趋势处理;确定特征的预设数量,并从经过小波去噪处理和去趋势处理后的候选特征集合中筛选出所述预设数量的候选特征,构成预测特征集合;使用所述预测特征集合以及所述多个连续时间单元内的流感样病例百分比的实际观测值,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数。
关于小波去噪处理和去趋势处理,实现方式如下:
确定小波基函数,按照所述小波基函数对所述候选特征集合中的每个特征形成的序列进行小波分解,并确定分解层数;确定小波去噪的阈值,按照确定的阈值对小波分解后的预测特征的各层次的系数进行调整;对调整过的小波系数做逆变换重构,得到去噪之后的候选特征;针对小波去噪处理后的候选特征集合中每个时间单元对应的候选特征,获取该时间单元之前的连续多个时间单元的数据进行线性回归,以构建趋势预测模型,根据所述趋势预测模型获取该时间单元对应的基线预测值;使用该时间单元的候选特征的实际值减去所述基线预测值,得到去趋势之后的候选特征。
确定小波基函数,按照所述小波基函数对所述候选特征集合中的每个特 征形成的序列进行小波分解,并确定分解层数。例如,对舆情关键词“高烧”的每周百度指数形成的序列进行小波分解,基于与被测信号波形接近的原则,选定db4为舆情数据分解的小波基函数。而在分解尺度的选择上,则根据舆情数据的长度测试在一定范围内不同分解尺度下,选取去噪效果较好而信号失真度较低的分解层数。确定小波去噪的阈值,按照确定的阈值对小波分解后的候选特征的各层次的系数进行调整。具体地:根据每一个特征的序列的长度N,确定小波去噪的阈值thr,假设使用的是过去52个周的历史数据,则每一个特征序列的长度N=52:
Figure PCTCN2018102221-appb-000007
采用软阈值算法,将较小的小波系数置零,对较大的小波系数向零作收缩处理,以调整分解后的候选特征的各层次的系数,具体公式如下,其中,w为调整前的系数,d为调整后的系数:
Figure PCTCN2018102221-appb-000008
对调整过的小波系数做逆变换重构,得到去噪之后的候选特征。
针对小波去噪处理后的候选特征集合中每个时间单元对应的候选特征,获取该时间单元之前的连续多个时间单元的数据进行线性回归,以构建趋势预测模型,根据趋势预测模型获取该时间单元对应的基线预测值;使用该时间单元的候选特征的实际值减去基线预测值,得到去趋势之后的候选特征。
例如,针对小波去噪预处理后的候选特征的每个数据点(即一个时间单元对应的候选特征),取其前52周的数据进行线性回归构建趋势预测模型,可以理解的是,如果某一数据点的历史数据不足52周,则以所有历史数据进行线性回归构建趋势预测模型。通过趋势预测模型得到当前数据点的基线预测值。用当前点的预测特征的实际值减去基线预测值,得到去趋势之后的预测特征。
可选地,在一些实施例中,可以设置不同的筛选特征的数量,获取预测结果,根据预测结果的准确度选择合适的筛选特征的数量;或者,在其他实施例中,关于筛选的特征数量的确定,也可以采用如下方式:
基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,使用特征递归消除交叉验证算法选择模型性能达到预设条件时的特征数量作为所述预设数量。
确定预设数量后,基于xgboost算法构建模型作为学习器,将候选特征集合中的候选特征输入学习器,并按照特征递归消除算法进行迭代运算;获取学习器经过运算返回的模型系数,根据模型系数确定每个候选特征集合中各候选特征的重要程度;根据各候选特征的重要程度从当前的候选特征集合中移除重要程度最小的K个候选特征;重复执行上述步骤,直至筛选得到的候选特征的数量达到预设数量;预设数量的候选特征构成预测特征集合。
使用预测特征集合中的预测特征训练xgboost预测模型,具体地,获取所 述连续多个时间单元内的流感样病例百分比的实际观测值,将一周得到的预测特征与该周的下一周的流感样病例百分比作为一个训练样本,选择能反映最新的流感变化趋势的当前预测周的前连续多个周的数据,例如当前预测周的前52周的数据,作为训练集进行滚动预测。基于xgboost算法构建预测模型,以gbtree(general balanced trees,通用二叉查找树)作为booster(加速器),基于平方误差损失函数训练该预测模型,使得上述损失函数极小化,确定模型参数,获取最终的xgboost预测模型。此外,采用前向分布算法,通过构建新的回归树拟合当前模型的残差或残差近似值,并通过优化正则项抑制过拟合及并行化处理提升算法性能。
根据所述ARIMA模型和所述xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型。
将所述ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将所述xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的所述流感预测模型的卡尔曼增益。
根据计算得到的卡尔曼增益更新所述流感预测模型中所述ARIMA模型和所述xgboost预测模型的权重,经更新权重后的所述流感预测模型用于预测所述目标时间单元的下一个时间单元的流感样病例百分比。
将ARIMA模型对目标时间单元K输出的第一预测值y A作为离散时间过程中通过量测方程得到的状态变量的测量值,将xgboost预测模型对目标时间单元K输出的第二预测值y x作为离散时间过程中通过状态转移方程得到的状态变量的先验估计值,计算得到当前预测的卡尔曼增益,根据卡尔曼增益确定组合得到的流感预测模型的权重。
根据卡尔曼滤波算法的表达式,可得到流感预测模型的预测值,即卡尔曼滤波中状态变量的后验估计,其表达式为:
y=y x+K k(y A-Hy x)
其中,在上述公式中,测量方程的测量增益H=1,K k为卡尔曼增益,在本实施例中为常数,在组合预测模型中决定了ARIMA模型和xgboost预测模型的权重。
在使得后验估计误差协方差最小的目标条件下,卡尔曼滤波中K k的迭代计算式为:
Figure PCTCN2018102221-appb-000009
其中,先验估计误差的协方差
Figure PCTCN2018102221-appb-000010
Figure PCTCN2018102221-appb-000011
为先验估计误差的协方差,从上述公式可以看出,可以根据k-1时刻的后验估计误差的协方差计算得到k时刻的先验估计误差的协方差,A为n×n阶增益矩阵,将上一时刻k-1的状态线性映射到当前时刻k的状态,实际中A可能会随时间变化,在此处假设其为常数,本实施例中将其设为1。观测噪声协方差R值取xgboost预测模型的历史预测误差的协方差,过程激励噪声协方差Q值取ARIMA模型的历史预测误差的协方差。公式中的k表示当前预测 的时间序列号,k-1表示k的前一时刻。在流感预测过程中则表示当前周和前一周。
在获得ARIMA模型和xgboost预测模型k-1时刻的预测值后,更新k-1时刻状态的后验协方差P k-1,进而向前推算得到k时刻的先验协方差
Figure PCTCN2018102221-appb-000012
进而根据卡尔曼滤波中K k的迭代计算式计算得到更新卡尔曼增益K k,即模型组合的权重。也就是说,在分别使用两个模型得到k-1时刻(即当前周的前一周)的预测值后,计算卡尔曼增益,即对流感预测模型的权重进行一次更新,使用更新后的流感预测模型进行对k时刻(即当前周)的流感样比例百分比进行预测。即根据公式y=y x+K k(y A-Hy x),计算组合预测模型的输出,作为最终的预测结果。
本实施例提出的流感预测模型的生成装置,获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;获取舆情关键词,根据舆情关键词获取多个时间单元内的舆情数据序列,将舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;根据ARIMA模型和xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;在使用流感预测模型进行流感预测的过程中,将ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的流感预测模型的卡尔曼增益;根据本次计算的卡尔曼增益更新流感预测模型中的两个模型的权重,经更新权重后的流感预测模型用于下一个时间单元的流感样病例百分比,通过这样的方式,实现了对流感预测模型中的两个模型的权重的动态更新,基于卡尔曼滤波的模型融合即考虑了时间序列自身的变化规律,又结合了舆情数据,修正序列受到的干扰,使得模型预测更加准确,且通过实时动态地调整模型权重,可以使得组合得到的预测模型倾向于当前性能较好的模型输出,提高预测模型的精准度。
可选地,在其他的实施例中,模型生成程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本申请,本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述模型生成程序在流感预测模型的生成装置中的执行过程。
例如,参照图3所示,为本申请流感预测模型的生成装置一实施例中的模型生成程序的程序模块示意图,该实施例中,模型生成程序可以被分割为第一预测模块10、第二预测模块20、模型组合模块30、增益计算模块40和模型更新模块50,示例性地:
第一预测模块10用于:获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;
第二预测模块20用于:获取舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;
模型组合模块30用于:根据所述ARIMA模型和所述xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;
增益计算模块40用于:将所述ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将所述xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的所述流感预测模型的卡尔曼增益;
模型更新模块50用于:根据计算得到的卡尔曼增益更新所述流感预测模型中所述ARIMA模型和所述xgboost预测模型的权重,经更新权重后的所述流感预测模型用于预测所述目标时间单元的下一个时间单元的流感样病例百分比。
上述第一预测模块10、第二预测模块20、模型组合模块30、增益计算模块40和模型更新模块50等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有模型生成程序,所述模型生成程序可被一个或多个处理器执行,以实现如下操作:
获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;
获取舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;
根据所述ARIMA模型和所述xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;
将所述ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将所述xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的所述流感预测模型的卡尔曼增益;
根据计算得到的卡尔曼增益更新所述流感预测模型中所述ARIMA模型和所述xgboost预测模型的权重,经更新权重后的所述流感预测模型用于预测所述目标时间单元的下一个时间单元的流感样病例百分比。
本申请计算机可读存储介质具体实施方式与上述流感预测模型的生成装置和方法各实施例基本相同,在此不作累述。
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述 实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种流感预测模型的生成方法,其特征在于,所述方法包括:
    获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;
    获取舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;
    根据所述ARIMA模型和所述xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;
    将所述ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将所述xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的所述流感预测模型的卡尔曼增益;
    根据计算得到的卡尔曼增益更新所述流感预测模型中所述ARIMA模型和所述xgboost预测模型的权重,经更新权重后的所述流感预测模型用于预测所述目标时间单元的下一个时间单元的流感样病例百分比。
  2. 如权利要求1所述的流感预测模型的生成方法,其特征在于,所述确定舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数的步骤包括:
    确定舆情关键词,根据所述舆情关键词获取连续多个时间单元内的舆情数据序列,并将所述舆情数据序列中的舆情数据作为候选特征,构建候选特征集合;
    对所述候选特征集合中的候选特征进行小波去噪处理和去趋势处理;
    确定特征的预设数量,并从经过小波去噪处理和去趋势处理后的候选特征集合中筛选出所述预设数量的候选特征,构成预测特征集合;
    使用所述预测特征集合以及所述多个连续时间单元内的流感样病例百分比的实际观测值,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数。
  3. 如权利要求2所述的流感预测模型的生成方法,其特征在于,所述对所述候选特征集合中的候选特征进行小波去噪处理和去趋势处理的步骤包括:
    确定小波基函数,按照所述小波基函数对所述候选特征集合中的每个特征形成的序列进行小波分解,并确定分解层数;
    确定小波去噪的阈值,按照确定的阈值对小波分解后的预测特征的各层次的系数进行调整;
    对调整过的小波系数做逆变换重构,得到去噪之后的候选特征;
    针对小波去噪处理后的候选特征集合中每个时间单元对应的候选特征,获取该时间单元之前的连续多个时间单元的数据进行线性回归,以构建趋势预测模型,根据所述趋势预测模型获取该时间单元对应的基线预测值;
    使用该时间单元的候选特征的实际值减去所述基线预测值,得到去趋势 之后的候选特征。
  4. 如权利要求2所述的流感预测模型的生成方法,其特征在于,所述确定特征的预设数量的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,使用特征递归消除交叉验证算法选择模型性能达到预设条件时的特征数量作为所述预设数量。
  5. 如权利要求3所述的流感预测模型的生成方法,其特征在于,所述确定特征的预设数量的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,使用特征递归消除交叉验证算法选择模型性能达到预设条件时的特征数量作为所述预设数量。
  6. 如权利要求2所述的流感预测模型的生成方法,其特征在于,所述从经过小波去噪处理和去趋势处理后的候选特征集合中筛选出所述预设数量的候选特征,构成预测特征集合的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,并按照特征递归消除算法进行迭代运算;
    获取所述学习器经过运算返回的模型系数,根据所述模型系数确定每个候选特征集合中各候选特征的重要程度;
    根据各候选特征的重要程度从当前的候选特征集合中移除重要程度最小的K个候选特征;
    重复执行上述步骤,直至筛选得到的候选特征的数量达到所述预设数量;
    所述预设数量的候选特征构成预测特征集合。
  7. 如权利要求3所述的流感预测模型的生成方法,其特征在于,所述从经过小波去噪处理和去趋势处理后的候选特征集合中筛选出所述预设数量的候选特征,构成预测特征集合的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,并按照特征递归消除算法进行迭代运算;
    获取所述学习器经过运算返回的模型系数,根据所述模型系数确定每个候选特征集合中各候选特征的重要程度;
    根据各候选特征的重要程度从当前的候选特征集合中移除重要程度最小的K个候选特征;
    重复执行上述步骤,直至筛选得到的候选特征的数量达到所述预设数量;
    所述预设数量的候选特征构成预测特征集合。
  8. 一种流感预测模型的生成装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的模型生成程序,所述模型生成程序被所述处理器执行时实现如下步骤:
    获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;
    获取舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情 数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;
    根据所述ARIMA模型和所述xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;
    将所述ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将所述xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的所述流感预测模型的卡尔曼增益;
    根据计算得到的卡尔曼增益更新所述流感预测模型中所述ARIMA模型和所述xgboost预测模型的权重,经更新权重后的所述流感预测模型用于预测所述目标时间单元的下一个时间单元的流感样病例百分比。
  9. 如权利要求8所述的流感预测模型的生成装置,其特征在于,所述确定舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数的步骤包括:
    确定舆情关键词,根据所述舆情关键词获取连续多个时间单元内的舆情数据序列,并将所述舆情数据序列中的舆情数据作为候选特征,构建候选特征集合;
    对所述候选特征集合中的候选特征进行小波去噪处理和去趋势处理;
    确定特征的预设数量,并从经过小波去噪处理和去趋势处理后的候选特征集合中筛选出所述预设数量的候选特征,构成预测特征集合;
    使用所述预测特征集合以及所述多个连续时间单元内的流感样病例百分比的实际观测值,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数。
  10. 如权利要求9所述的流感预测模型的生成装置,其特征在于,所述对所述候选特征集合中的候选特征进行小波去噪处理和去趋势处理的步骤包括:
    确定小波基函数,按照所述小波基函数对所述候选特征集合中的每个特征形成的序列进行小波分解,并确定分解层数;
    确定小波去噪的阈值,按照确定的阈值对小波分解后的预测特征的各层次的系数进行调整;
    对调整过的小波系数做逆变换重构,得到去噪之后的候选特征;
    针对小波去噪处理后的候选特征集合中每个时间单元对应的候选特征,获取该时间单元之前的连续多个时间单元的数据进行线性回归,以构建趋势预测模型,根据所述趋势预测模型获取该时间单元对应的基线预测值;
    使用该时间单元的候选特征的实际值减去所述基线预测值,得到去趋势之后的候选特征。
  11. 如权利要求9所述的流感预测模型的生成装置,其特征在于,所述确定特征的预设数量的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特 征输入所述学习器,使用特征递归消除交叉验证算法选择模型性能达到预设条件时的特征数量作为所述预设数量。
  12. 如权利要求10所述的流感预测模型的生成装置,其特征在于,所述确定特征的预设数量的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,使用特征递归消除交叉验证算法选择模型性能达到预设条件时的特征数量作为所述预设数量。
  13. 如权利要求9所述的流感预测模型的生成装置,其特征在于,所述从经过小波去噪处理和去趋势处理后的候选特征集合中筛选出所述预设数量的候选特征,构成预测特征集合的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,并按照特征递归消除算法进行迭代运算;
    获取所述学习器经过运算返回的模型系数,根据所述模型系数确定每个候选特征集合中各候选特征的重要程度;
    根据各候选特征的重要程度从当前的候选特征集合中移除重要程度最小的K个候选特征;
    重复执行上述步骤,直至筛选得到的候选特征的数量达到所述预设数量;
    所述预设数量的候选特征构成预测特征集合。
  14. 如权利要求10所述的流感预测模型的生成装置,其特征在于,所述从经过小波去噪处理和去趋势处理后的候选特征集合中筛选出所述预设数量的候选特征,构成预测特征集合的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,并按照特征递归消除算法进行迭代运算;
    获取所述学习器经过运算返回的模型系数,根据所述模型系数确定每个候选特征集合中各候选特征的重要程度;
    根据各候选特征的重要程度从当前的候选特征集合中移除重要程度最小的K个候选特征;
    重复执行上述步骤,直至筛选得到的候选特征的数量达到所述预设数量;
    所述预设数量的候选特征构成预测特征集合。
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有模型生成程序,所述模型生成程序可被一个或者多个处理器执行,以实现如下步骤:
    获取连续多个时间单元内的流感样病例百分比数据,建立自回归积分滑动平均ARIMA模型;
    获取舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数;
    根据所述ARIMA模型和所述xgboost预测模型,构建基于卡尔曼滤波算法的流感预测模型;
    将所述ARIMA模型对目标时间单元的第一预测值作为状态变量的测量值,将所述xgboost预测模型对目标时间单元的第二预测值作为状态变量的先验估计值,计算当前的所述流感预测模型的卡尔曼增益;
    根据计算得到的卡尔曼增益更新所述流感预测模型中所述ARIMA模型和所述xgboost预测模型的权重,经更新权重后的所述流感预测模型用于预测所述目标时间单元的下一个时间单元的流感样病例百分比。
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述确定舆情关键词,根据所述舆情关键词获取所述多个时间单元内的舆情数据序列,将所述舆情数据序列中的舆情数据作为预测特征,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数的步骤包括:
    确定舆情关键词,根据所述舆情关键词获取连续多个时间单元内的舆情数据序列,并将所述舆情数据序列中的舆情数据作为候选特征,构建候选特征集合;
    对所述候选特征集合中的候选特征进行小波去噪处理和去趋势处理;
    确定特征的预设数量,并从经过小波去噪处理和去趋势处理后的候选特征集合中筛选出所述预设数量的候选特征,构成预测特征集合;
    使用所述预测特征集合以及所述多个连续时间单元内的流感样病例百分比的实际观测值,训练基于xgboost算法构建的xgboost预测模型,以确定模型参数。
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述对所述候选特征集合中的候选特征进行小波去噪处理和去趋势处理的步骤包括:
    确定小波基函数,按照所述小波基函数对所述候选特征集合中的每个特征形成的序列进行小波分解,并确定分解层数;
    确定小波去噪的阈值,按照确定的阈值对小波分解后的预测特征的各层次的系数进行调整;
    对调整过的小波系数做逆变换重构,得到去噪之后的候选特征;
    针对小波去噪处理后的候选特征集合中每个时间单元对应的候选特征,获取该时间单元之前的连续多个时间单元的数据进行线性回归,以构建趋势预测模型,根据所述趋势预测模型获取该时间单元对应的基线预测值;
    使用该时间单元的候选特征的实际值减去所述基线预测值,得到去趋势之后的候选特征。
  18. 如权利要求16所述的计算机可读存储介质,其特征在于,所述确定特征的预设数量的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,使用特征递归消除交叉验证算法选择模型性能达到预设条件时的特征数量作为所述预设数量。
  19. 如权利要求17所述的计算机可读存储介质,其特征在于,所述确定特征的预设数量的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特 征输入所述学习器,使用特征递归消除交叉验证算法选择模型性能达到预设条件时的特征数量作为所述预设数量。
  20. 如权利要求16所述的计算机可读存储介质,其特征在于,所述从经过小波去噪处理和去趋势处理后的候选特征集合中筛选出所述预设数量的候选特征,构成预测特征集合的步骤包括:
    基于xgboost算法构建模型作为学习器,将所述候选特征集合中的候选特征输入所述学习器,并按照特征递归消除算法进行迭代运算;
    获取所述学习器经过运算返回的模型系数,根据所述模型系数确定每个候选特征集合中各候选特征的重要程度;
    根据各候选特征的重要程度从当前的候选特征集合中移除重要程度最小的K个候选特征;
    重复执行上述步骤,直至筛选得到的候选特征的数量达到所述预设数量;
    所述预设数量的候选特征构成预测特征集合。
PCT/CN2018/102221 2018-05-31 2018-08-24 流感预测模型的生成方法、装置及计算机可读存储介质 WO2019227716A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2019556833A JP6815708B2 (ja) 2018-05-31 2018-08-24 インフルエンザ予測モデルの生成方法、装置及びコンピュータ可読記憶媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810543749.9 2018-05-31
CN201810543749.9A CN108766585A (zh) 2018-05-31 2018-05-31 流感预测模型的生成方法、装置及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2019227716A1 true WO2019227716A1 (zh) 2019-12-05

Family

ID=64004677

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102221 WO2019227716A1 (zh) 2018-05-31 2018-08-24 流感预测模型的生成方法、装置及计算机可读存储介质

Country Status (3)

Country Link
JP (1) JP6815708B2 (zh)
CN (1) CN108766585A (zh)
WO (1) WO2019227716A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931848A (zh) * 2020-08-10 2020-11-13 中国平安人寿保险股份有限公司 数据的特征提取方法、装置、计算机设备及存储介质
CN112163723A (zh) * 2020-11-02 2021-01-01 西安热工研究院有限公司 基于情景划分的水电站中长期径流预测方法、介质及设备
CN112700885A (zh) * 2021-01-13 2021-04-23 大连海事大学 一种基于卡尔曼滤波辨识新冠病毒传播模型参数的方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110111902B (zh) * 2019-04-04 2022-05-27 平安科技(深圳)有限公司 急性传染病的发病周期预测方法、装置及存储介质
CN111242347B (zh) * 2019-12-28 2021-01-01 浙江大学 基于历史权重更新的桥梁管养辅助决策系统
CN112015778A (zh) * 2020-08-19 2020-12-01 上海满盛信息技术有限公司 一种水指纹预测算法
CN112951440A (zh) * 2021-02-04 2021-06-11 汕头大学医学院 一种登革热传播风险预测方法及受影响人口大小确定方法
CN113436751A (zh) * 2021-06-29 2021-09-24 山东健康医疗大数据有限公司 一种周ili占比趋势预测系统及方法
CN114360739B (zh) * 2022-01-05 2023-07-21 中国科学院地理科学与资源研究所 一种基于遥感云计算与深度学习的登革热风险预测方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826090A (zh) * 2009-09-15 2010-09-08 电子科技大学 基于最优模型的web舆情趋势预测方法
CN101847179A (zh) * 2010-04-13 2010-09-29 中国疾病预防控制中心病毒病预防控制所 通过模型预测流感抗原的方法及应用
CN105678080A (zh) * 2016-01-11 2016-06-15 浪潮集团有限公司 通过大数据搜寻分析预测流感爆发可能性的方法
CN107688872A (zh) * 2017-08-20 2018-02-13 平安科技(深圳)有限公司 预测模型建立装置、方法及计算机可读存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104517159A (zh) * 2014-12-18 2015-04-15 上海交通大学 一种公交短时客流的预测方法
WO2017120579A1 (en) * 2016-01-10 2017-07-13 Presenso, Ltd. System and method for validating unsupervised machine learning models
CN105824897A (zh) * 2016-03-14 2016-08-03 湖南大学 基于卡尔曼滤波的混合推荐系统及方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826090A (zh) * 2009-09-15 2010-09-08 电子科技大学 基于最优模型的web舆情趋势预测方法
CN101847179A (zh) * 2010-04-13 2010-09-29 中国疾病预防控制中心病毒病预防控制所 通过模型预测流感抗原的方法及应用
CN105678080A (zh) * 2016-01-11 2016-06-15 浪潮集团有限公司 通过大数据搜寻分析预测流感爆发可能性的方法
CN107688872A (zh) * 2017-08-20 2018-02-13 平安科技(深圳)有限公司 预测模型建立装置、方法及计算机可读存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931848A (zh) * 2020-08-10 2020-11-13 中国平安人寿保险股份有限公司 数据的特征提取方法、装置、计算机设备及存储介质
CN112163723A (zh) * 2020-11-02 2021-01-01 西安热工研究院有限公司 基于情景划分的水电站中长期径流预测方法、介质及设备
CN112163723B (zh) * 2020-11-02 2023-09-12 西安热工研究院有限公司 基于情景划分的水电站中长期径流预测方法、介质及设备
CN112700885A (zh) * 2021-01-13 2021-04-23 大连海事大学 一种基于卡尔曼滤波辨识新冠病毒传播模型参数的方法
CN112700885B (zh) * 2021-01-13 2023-12-15 大连海事大学 一种基于卡尔曼滤波辨识新冠病毒传播模型参数的方法

Also Published As

Publication number Publication date
JP6815708B2 (ja) 2021-01-20
JP2020525872A (ja) 2020-08-27
CN108766585A (zh) 2018-11-06

Similar Documents

Publication Publication Date Title
WO2019227716A1 (zh) 流感预测模型的生成方法、装置及计算机可读存储介质
WO2019227711A1 (zh) 流感预测模型的生成方法、装置及计算机可读存储介质
CN110033018B (zh) 图形相似度判断方法、装置及计算机可读存储介质
CN113361578B (zh) 图像处理模型的训练方法、装置、电子设备及存储介质
CN108197592B (zh) 信息获取方法和装置
US20170103337A1 (en) System and method to discover meaningful paths from linked open data
CN110309251B (zh) 文本数据的处理方法、装置和计算机可读存储介质
WO2020010710A1 (zh) 预测模型的生成方法、装置及计算机可读存储介质
CN110597965B (zh) 文章的情感极性分析方法、装置、电子设备及存储介质
WO2023029507A1 (zh) 基于数据分析的服务分发方法、装置、设备及存储介质
CN108985501B (zh) 基于指数特征提取的股指预测方法、服务器及存储介质
CN114547267A (zh) 智能问答模型的生成方法、装置、计算设备和存储介质
JP2007323315A (ja) 協調フィルタリング方法、協調フィルタリング装置、および協調フィルタリングプログラムならびにそのプログラムを記録した記録媒体
CN112949433B (zh) 视频分类模型的生成方法、装置、设备和存储介质
CN110968802A (zh) 一种用户特征的分析方法、分析装置及可读存储介质
CN114220536A (zh) 基于机器学习的疾病分析方法、装置、设备及存储介质
CN110348581B (zh) 用户特征群中用户特征寻优方法、装置、介质及电子设备
CN115186738B (zh) 模型训练方法、装置和存储介质
CN113704256B (zh) 数据识别方法、装置、电子设备及存储介质
JP2020139914A (ja) 物質構造分析装置、方法及びプログラム
CN116091276A (zh) 基于深度学习的长时间序列预测方法、装置、设备及介质
CN112783949B (zh) 人体数据预测方法、装置、电子设备和存储介质
CN114970732A (zh) 分类模型的后验校准方法、装置、计算机设备及介质
CN116821327A (zh) 文本数据处理方法、装置、设备、可读存储介质及产品
CN113591570A (zh) 视频处理方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019556833

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18921258

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23/03/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18921258

Country of ref document: EP

Kind code of ref document: A1