WO2019196280A1 - Disease prediction method and device, computer device and readable storage medium - Google Patents

Disease prediction method and device, computer device and readable storage medium Download PDF

Info

Publication number
WO2019196280A1
WO2019196280A1 PCT/CN2018/099612 CN2018099612W WO2019196280A1 WO 2019196280 A1 WO2019196280 A1 WO 2019196280A1 CN 2018099612 W CN2018099612 W CN 2018099612W WO 2019196280 A1 WO2019196280 A1 WO 2019196280A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
weather
disease
layer
public opinion
Prior art date
Application number
PCT/CN2018/099612
Other languages
French (fr)
Chinese (zh)
Inventor
阮晓雯
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019196280A1 publication Critical patent/WO2019196280A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu

Definitions

  • the present application relates to the field of prediction technologies, and in particular, to a disease prediction method and apparatus, a computer apparatus, and a non-volatile readable storage medium.
  • disease prediction An important task in the early warning of public health emergencies is disease prediction, which predicts future disease surveillance data based on historical disease surveillance data (ie, patient data).
  • disease prediction With the development of machine learning technology, more and more machine learning methods are applied to disease prediction.
  • traditional machine learning applied to disease prediction often requires artificially defining feature sets, and then searching for the best feature combinations from the defined feature sets, and the effects are often not good enough, thus affecting the accuracy of disease prediction.
  • a first aspect of the present application provides a disease prediction method, the method comprising:
  • the disease monitoring data is time series data
  • weather data related to the disease monitoring data the weather data being time series data corresponding to the disease monitoring data
  • Pre-processing the disease monitoring data, weather data, and public opinion data Pre-processing the disease monitoring data, weather data, and public opinion data
  • the optimized multi-layer GRU model is input to obtain a disease prediction result at the predicted time point.
  • a second aspect of the present application provides a disease prediction apparatus, the apparatus comprising:
  • a first acquiring unit configured to acquire disease monitoring data, where the disease monitoring data is time series data
  • a second acquiring unit configured to acquire weather data related to the disease monitoring data, where the weather data is time series data corresponding to the disease monitoring data;
  • a third obtaining unit configured to acquire public opinion data related to the disease monitoring data, where the public opinion data is time series data corresponding to the disease monitoring data;
  • a pre-processing unit for pre-processing the disease monitoring data, weather data, and public opinion data
  • a building unit for constructing a multi-layer gated recursive unit neural network model that is, a multi-layer GRU model
  • An optimization unit configured to acquire training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and use the training data and the verification data to train and perform performance on the multi-layer GRU model Verification, obtaining an optimized multi-layer GRU model;
  • a prediction unit configured to obtain disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and the disease monitoring data before the predicted time point,
  • the weather data and the public opinion data are input to the optimized multi-layer GRU model to obtain the disease prediction result at the predicted time point.
  • a third aspect of the present application provides a computer apparatus comprising a memory and a processor, the memory for storing at least one computer readable instruction, the processor for executing the at least one computer readable instruction Implement the following steps:
  • the disease monitoring data is time series data
  • weather data related to the disease monitoring data the weather data being time series data corresponding to the disease monitoring data
  • Pre-processing the disease monitoring data, weather data, and public opinion data Pre-processing the disease monitoring data, weather data, and public opinion data
  • the optimized multi-layer GRU model is input to obtain a disease prediction result at the predicted time point.
  • a fourth aspect of the present application provides a non-volatile readable storage medium storing at least one computer readable instruction when executed by a processor Implement the following steps:
  • the disease monitoring data is time series data
  • weather data related to the disease monitoring data the weather data being time series data corresponding to the disease monitoring data
  • Pre-processing the disease monitoring data, weather data, and public opinion data Pre-processing the disease monitoring data, weather data, and public opinion data
  • the optimized multi-layer GRU model is input to obtain a disease prediction result at the predicted time point.
  • the present application acquires disease monitoring data, which is time-series data; acquires weather data related to the disease monitoring data, the weather data is time-series data corresponding to the disease monitoring data; and acquiring the disease monitoring Data-related public opinion data, wherein the public opinion data is time-series data corresponding to the disease monitoring data; pre-processing the disease monitoring data, weather data, and public opinion data; constructing a multi-layer gated recursive unit neural network model, a multi-layer GRU model; obtaining training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and training the multi-layer GRU model using the training data and the verification data Performance verification, obtaining an optimized multi-layer GRU model; obtaining disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and predicting the predicted time Disease monitoring data, weather data and public opinion data before the point are input into the optimized GRU multilayer model to obtain a prediction result predicted disease point
  • the present application predicts disease data through a multi-layer GRU model.
  • the GRU model can extract knowledge directly from the data, construct a feature vector that is favorable for prediction, and improve the prediction accuracy.
  • the present application adds weather data and public opinion data as influencing factors to the disease prediction, thereby improving the accuracy of disease prediction.
  • the GRU model used in this application has a simple structure and can be quickly optimized to speed up the entire disease prediction process. Therefore, the present application achieves rapid and high accuracy disease prediction.
  • FIG. 1 is a flowchart of a disease prediction method according to Embodiment 1 of the present application.
  • FIG. 2 is a detailed flowchart of acquiring weather data related to disease monitoring data in the disease prediction method provided in the second embodiment of the present application.
  • FIG. 3 is a structural diagram of a disease prediction apparatus according to Embodiment 3 of the present application.
  • FIG. 4 is a detailed structural diagram of a second acquisition unit in the disease prediction apparatus provided in Embodiment 4 of the present application.
  • FIG. 5 is a schematic diagram of a computer device according to Embodiment 5 of the present application.
  • the disease prediction method of the present application is applied to one or more computer devices.
  • the computer device is a device capable of automatically performing numerical calculation and/or information processing according to an instruction set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor and an application specific integrated circuit (ASIC). , Field-Programmable Gate Array (FPGA), Digital Signal Processor (DSP), embedded devices, etc.
  • ASIC application specific integrated circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded devices etc.
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device.
  • FIG. 1 is a flowchart of a disease prediction method according to Embodiment 1 of the present application.
  • the disease prediction method is applied to a computer device.
  • the disease prediction method predicts disease monitoring data by using a gated recursive unit neural network model to obtain a high-accuracy disease prediction result.
  • the disease prediction method specifically includes the following steps:
  • step 101 disease monitoring data is acquired, and the disease monitoring data is time series data.
  • the disease monitoring data may include disease data for diseases such as influenza, hand, foot and mouth disease, measles, and mumps.
  • a disease monitoring network composed of a plurality of monitoring points may be established in a preset area (for example, a province, a city, a region), and disease monitoring data is acquired from the monitoring points, and the disease monitoring data constitutes time series data of disease monitoring.
  • Medical institutions, schools, child care institutions, pharmacies, etc. can be selected as monitoring points to conduct disease monitoring and data collection for the corresponding target population.
  • a place that meets the preset conditions can be selected as the monitoring point.
  • the preset condition may include a number of people, a scale, and the like. For example, select a school with a predetermined number of schools and child care institutions as monitoring points. Another example is to select a pharmacy that has reached the preset size (for example, by daily turnover) as a monitoring point. For another example, select a hospital (for example, the number of people who seek medical treatment in Japan) to reach a preset size as a monitoring point.
  • Disease monitoring data at different times constitute time series data for disease surveillance.
  • disease monitoring data collected on a daily basis can be used to form time series data for disease surveillance.
  • the disease monitoring data collected on a weekly basis may constitute time series data for disease monitoring.
  • Medical institutions (mainly including hospitals) are the best place to capture early warning signs of disease and are the first choice for disease surveillance.
  • Disease surveillance data can be obtained based on patient visits.
  • the disease monitoring data can be obtained according to the drug sales of the pharmacy.
  • the medical institution, the school, the child care institution, and the pharmacy are mainly selected for the collection of disease monitoring data.
  • the above selection of data sources does not limit the addition or replacement of other focused populations or sites in other implementations as a source of data for monitoring.
  • hotels can be included in the disease surveillance area to obtain disease surveillance data for hotel residents.
  • the disease monitoring data collected by any type of monitoring point can constitute time series data of disease monitoring.
  • the disease monitoring data collected by the hospital can be taken to constitute time series data of disease monitoring.
  • the disease monitoring data collected by the plurality of types of monitoring points can be combined to form time series data of disease monitoring.
  • the disease monitoring data collected by the hospital can be mainly used, supplemented by the disease monitoring data participated by the pharmacy, and constitute time series data of disease monitoring.
  • the disease monitoring data may include disease data such as the number of visits to the disease, the rate of visits, the number of cases, and the incidence rate.
  • disease data such as the number of visits to the disease, the rate of visits, the number of cases, and the incidence rate.
  • the number of daily visits to a disease eg, flu
  • a medical institution eg, a hospital
  • the number of daily visits of the disease eg, flu
  • the daily incidence of a student's disease eg, influenza
  • influenza can be obtained from the school, and the daily incidence of the disease (eg, influenza) can be used as disease monitoring data.
  • Step 102 Acquire weather data related to the disease monitoring data, where the weather data is time series data corresponding to the disease monitoring data.
  • Weather data related to disease surveillance data refers to weather data that affect disease surveillance data (ie disease disease data).
  • the influence of different weather data on the disease monitoring data may be analyzed in advance, and weather data having influence or influence on the disease monitoring data may be determined according to the analysis result.
  • the weather data may include humidity, temperature, air pressure, precipitation, water vapor pressure, wind speed, wind direction, and sunshine hours.
  • the weather data may include daily average temperature, average air pressure, maximum temperature, minimum temperature, average relative humidity, minimum relative humidity, precipitation, average wind speed, sunshine hours, and average water vapor pressure.
  • the weather data is the same as the time period corresponding to the disease monitoring data, and the weather data is the same as the statistical period (eg, daily, weekly) of the disease monitoring data.
  • the disease monitoring data is the number of daily visits from January to February 2018, and the weather data is daily weather data for January-February 2018.
  • the disease monitoring data is the number of weekly visits from January to December 2017, and the weather data is weekly weather data (eg, weekly average temperature) from January to December 2017.
  • the weather data can be captured from weather information websites (such as China Weather Network, Sina Weather, Sohu Weather, etc.) to improve the reliability of the weather data. It can be understood that the weather data can be captured from any webpage.
  • weather information websites such as China Weather Network, Sina Weather, Sohu Weather, etc.
  • Weather data for a predetermined area can be captured.
  • the predetermined area may include a province, a city, a region, and the like. For example, grab weather data from Shenzhen.
  • the predetermined time may include a year, a month, a day, and the like. For example, grab daily weather data for January-February 2018.
  • the weather data can be captured by a web crawler.
  • a web crawler is an application that automatically extracts the content of web page data. Web crawlers usually start with a URL (also called a seed URL) of one or several initial web pages, obtain the URL of the initial web page, and fetch the web page according to specific algorithms and strategies (such as depth-first search strategy). In the process, the new URL is continuously extracted from the current web page and placed in the corresponding queue until the stop condition is satisfied.
  • the URL is an abbreviation of Uniform Resource Locator, which is a uniform resource locator.
  • the weather data can be captured by using an open API interface of the weather information website (for example, an API interface opened by the China Weather Network).
  • the API is an abbreviation of application interface, which can realize mutual communication between computer software through an API interface.
  • the open API interface of the weather information website can return data in JSON format or XML format.
  • the weather data can be captured by a web crawler using an open API interface of the weather information website. See Figure 2 for the specific process of crawling the weather data through the web crawler using the open API interface of the weather information website.
  • Step 103 Acquire public opinion data related to the disease monitoring data, where the public opinion data is time series data corresponding to the disease monitoring data.
  • the public opinion data related to the disease surveillance data refers to the public opinion data reflecting the disease monitoring data.
  • a disease such as the flu
  • many people go online to search for disease-related words (such as flu, Tamiflu, high fever, etc.), which have a large search volume. increase.
  • disease-related content such as illness information, treatment information, etc.
  • news websites such as news, forums, blogs, and post bars increases. Therefore, disease prediction data can be used to assist in disease prediction.
  • the lyric data may include the number of searches for a particular word.
  • the number of searches for a particular word by a predetermined search engine can be counted (eg, a specific region pre-sets the number of daily searches by a search engine for a particular word).
  • the sensation data may also include the number of lyric information containing a particular word for a particular sensation website (eg, news, forums, blogs, post bars, etc.).
  • the specific word is a word related to the predicted disease, for example, the specific word is a word related to the disease symptom, and when the predicted disease is influenza, the specific word may include: sudden onset, high fever, chills, headache , weakness, inflammation of the throat, muscle soreness, dry cough, etc.
  • the specific words when the predicted disease is hand, foot and mouth, the specific words may include: mouth pain, anorexia, hypothermia, hand herpes, small mouth ulcers, and the like.
  • the time period corresponding to the disease monitoring data is the same, and the public opinion data is the same as the statistical period of the disease monitoring (eg, daily, weekly).
  • the disease monitoring data is the number of daily visits from January to February 2018, and the public opinion data is daily sensation data of January-February 2018 (for example, the number of search times for a specific word day).
  • the disease monitoring data is the number of weekly visits from January to December 2017, and the public opinion data is weekly sensation data of January-December 2017 (for example, a specific number of word searches).
  • steps 101-103 may be performed in any order or in parallel.
  • step 104 the disease monitoring data, the weather data, and the public opinion data are preprocessed.
  • Pre-processing of disease monitoring data, weather data, and public opinion data may include anomalous data processing.
  • Abnormal data processing of disease surveillance data, weather data and public opinion data is to correct abnormal data in the disease monitoring data, weather data and public opinion data, and improve the reliability and accuracy of disease prediction.
  • the abnormal data processing can include filling missing values in the disease monitoring data, weather data, and public opinion data.
  • the missing values can be filled by the mean or median of the data before and after the missing values, or the missing values can be filled by regression fitting.
  • the abnormal data processing may further include correcting abnormal values in the disease monitoring data, weather data, and public opinion data.
  • the outlier is a value that deviates significantly from other data. The outlier can be corrected by interpolation.
  • Pre-processing of disease monitoring data, weather data, and public opinion data may also include data format conversion of the disease monitoring data, weather data, and public opinion data.
  • disease surveillance data, weather data, and public opinion data are standardized so that disease surveillance data, weather data, and public opinion data have a consistent standard format to fit the input data as a GRU model.
  • Step 105 Construct a Gated Recurrent Unit Neural Network model, that is, a multi-layer GRU model.
  • the multi-layer GRU model includes two layers of GRU unit layers and one layer of fully connected layers, and the first layer of GRU unit layers is used to construct features for input data (eg, input data composed of disease monitoring data, weather data, and public opinion data)
  • Obtaining a first hidden layer unit where the second layer GRU unit layer is configured to combine the first hidden layer unit to obtain a second hidden layer unit, where the fully connected layer is used according to the second hidden layer
  • the unit obtains prediction results (eg, disease prediction results), and each GRU unit layer includes a reset gate and an update gate that controls the memory state of the GRU unit layer.
  • the GRU model is a time recurrent neural network model. Compared with the traditional Recurrent Neural Network (RNN) model, the GRU model stores information by constructing some gates at the GRU unit layer, so the gradient does not disappear quickly during the model training.
  • RNN Recurrent Neural Network
  • the multi-layer GRU model used in the method includes two layers of GRU unit layers and one layer of fully connected layers, and the first layer of GRU unit layers is used to construct features for input data (such as disease monitoring data, weather data, and input data composed of public opinion data).
  • Obtaining a first hidden layer unit wherein the second layer GRU unit layer is configured to combine the first hidden layer units to obtain a second hidden layer unit.
  • the fully connected layer obtains a predicted value according to the second hidden layer unit.
  • the first hidden layer unit is a local feature
  • the second hidden layer unit is a global feature. That is, the first layer GRU unit layer is used to extract local information, and the second layer GRU unit layer is used to combine global features to obtain global features, and the fully connected layer is used to obtain prediction results according to global features (eg, disease prediction results). .
  • the GRU unit layer includes an update gate z t and a reset gate r t .
  • the update gate z t is a logic gate that updates the hidden layer unit h t .
  • Reset gate r t decides to choose candidate hidden layer unit When to discard the previous hidden layer unit h t .
  • the update gate z t of the GRU unit layer, the reset gate r t , and the candidate hidden layer unit And the hidden layer unit h t is calculated as follows:
  • r t ⁇ (W r x t +U r h t-1 +b r ).
  • is the Sigmoid activation function
  • tanh is the Tanh activation function
  • W z , U z , b z are the parameters of the update gate z t
  • W r , U r , b r are the parameters of the reset gate r t
  • W, U , b is a candidate hidden layer unit Parameters.
  • Step 106 Obtain training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and perform training and performance verification on the multi-layer GRU model by using the training data and the verification data.
  • the optimized multi-layer GRU model is obtained.
  • the time series data may be intercepted from the disease monitoring data, the weather data, and the public opinion data after the pre-processing to constitute the training data and the verification data.
  • the input data of the multi-layer GRU model is a vector of a preset dimension (for example, 1000 dimensions).
  • the pre-processed disease monitoring data, weather data and public opinion data corresponding to each time point may be constructed into a preset dimension vector from the intercepted time series data, and the vectors corresponding to the respective time points are sequentially input into the time sequence.
  • a multi-layer GRU model is used to train or verify the multi-layer GRU model.
  • first time series data for training the multi-layer GRU model For example, intercepting first time series data for training the multi-layer GRU model from the pre-processed disease monitoring data, weather data, and public opinion data; each time point from the intercepted first time series data Corresponding pre-processed disease monitoring data, weather data, and public opinion data construct a first vector of a preset dimension, and sequentially input the first vector corresponding to each time point into the multi-layer GRU model in time sequence, for The multi-layer GRU model is trained.
  • the pre-processed disease monitoring data, the weather data, and the public opinion data construct a second vector of a preset dimension, and sequentially input the second vector corresponding to each time point into the multi-layer GRU model in time sequence, for Multi-layer GRU model for verification.
  • the loss function of the multi-layer GRU model may be defined as a mean square error, and the parameters of the multi-layer GRU model are adjusted such that the mean square error takes a minimum value.
  • the training process can use the RMSprop algorithm.
  • RMSprop is an improved stochastic gradient descent algorithm.
  • the mean square error and RMSprop algorithm are prior art and will not be described here.
  • Step 107 Obtain disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and use the disease monitoring data and weather data before the predicted time point. And the lyrical data is input into the optimized multi-layer GRU model to obtain a disease prediction result at the predicted time point.
  • the disease monitoring data, weather data, and public opinion data before the predicted time point are obtained as time series data.
  • the disease monitoring data, the weather data and the public opinion data before the predicted time point are obtained, and the pre-processed disease monitoring data, the weather data and the public opinion data corresponding to each time point are constructed into a third vector of a preset dimension. In a chronological order, the third vector corresponding to each time point is sequentially input to the multi-layer GRU model to perform disease prediction on the predicted time point.
  • the optimized multi-layer GRU model obtains the hidden layer units of the current time point through the input data of the current time point and the hidden layer unit of the previous time point, according to the current
  • the hidden layer unit at the time point obtains the predicted value of the current time point, and continuously recursively acquires the hidden layer unit of the next time point and the predicted value according to the chronological order until the predicted value of the given time point is obtained.
  • Example 1 predicts disease data by a multi-layer GRU model.
  • the GRU model can extract knowledge directly from the data, construct a feature vector that is favorable for prediction, and improve the prediction accuracy.
  • the weather data and the public opinion data are included as influence factors in the disease prediction, and the accuracy of the disease prediction is improved.
  • the GRU model compared with the disease prediction method based on LSTM (Long Short-term Memory) model, the GRU model has a simple structure and can be quickly optimized to speed up the entire disease prediction process. Therefore, the first embodiment achieves a fast and high accuracy rate of disease prediction.
  • FIG. 2 is a detailed flowchart of obtaining weather data related to disease monitoring data (ie, step 102 in FIG. 1) in the disease prediction method provided in the second embodiment of the present application.
  • the weather data can be captured by a web crawler using an open API interface of the weather information website. Referring to FIG. 2, the following steps may be specifically included:
  • Step 201 Generate a seed URL for the API interface of the weather information website and a subsequent URL.
  • the seed URL is the basis and premise for the web crawler to do everything.
  • the seed URL can be one or more.
  • the structural characteristics of the URL of the weather information website can be analyzed, and the subsequent URLs are obtained according to the structural characteristics of the URL.
  • Step 202 Send an HTTP request to an API interface of the weather information website, requesting access to the API interface.
  • the HTTP request can be sent to the API interface of the weather information website in GET mode.
  • an HTTP response is returned to inform that the weather data can be acquired.
  • Step 203 Analyze and identify the data content provided by the weather information website to view the data content.
  • the weather information website provides data content in a specific format, and needs to analyze and identify the data content in a specific format provided by the weather information website to view the data content.
  • the data format provided by the API interface of the weather information website is in JSON format.
  • JSON is a data exchange format that uses a grammar convention similar to C.
  • the data content of the JSON format is analyzed and identified to view the data content.
  • Step 204 Determine whether the data content is a predetermined information content.
  • the data content is a predetermined information content. If the data content is not the predetermined information content, the data content is discarded, otherwise the next step is performed.
  • Step 205 If the data content is a predetermined information content, the data content is captured.
  • a depth-first search strategy may be used for the state space search when the data content is captured.
  • Step 206 Save the captured data content as the weather data to the local.
  • a database can be created on the computing device to save the weather data to the database.
  • the traditional web crawler first sets one or more portal URLs.
  • a new URL is extracted from the current webpage into the queue, so as to obtain the webpage content corresponding to the URL. , save the content of the webpage to the local, and then extract the effective address as the next entry URL until the crawl is completed.
  • traditional web crawlers download a large number of irrelevant web pages.
  • FIG. 3 is a structural diagram of a disease prediction apparatus according to Embodiment 3 of the present application.
  • the disease prediction apparatus 10 may include: a first acquisition unit 301, a second acquisition unit 302, a third acquisition unit 303, a pre-processing unit 304, a construction unit 305, an optimization unit 306, and a prediction unit 307.
  • the first obtaining unit 301 is configured to acquire disease monitoring data, where the disease monitoring data is time series data.
  • the disease monitoring data may include disease data for diseases such as influenza, hand, foot and mouth disease, measles, and mumps.
  • a disease monitoring network composed of a plurality of monitoring points may be established in a preset area (for example, a province, a city, a region), and disease monitoring data is acquired from the monitoring points, and the disease monitoring data constitutes time series data of disease monitoring.
  • Medical institutions, schools, child care institutions, pharmacies, etc. can be selected as monitoring points to conduct disease monitoring and data collection for the corresponding target population.
  • a place that meets the preset conditions can be selected as the monitoring point.
  • the preset condition may include a number of people, a scale, and the like. For example, select a school with a predetermined number of schools and child care institutions as monitoring points. Another example is to select a pharmacy that has reached the preset size (for example, by daily turnover) as a monitoring point. For another example, select a hospital (for example, the number of people who seek medical treatment in Japan) to reach a preset size as a monitoring point.
  • Disease monitoring data at different times constitute time series data for disease surveillance.
  • disease monitoring data collected on a daily basis can be used to form time series data for disease surveillance.
  • the disease monitoring data collected on a weekly basis may constitute time series data for disease monitoring.
  • Medical institutions (mainly including hospitals) are the best place to capture early warning signs of disease and are the first choice for disease surveillance.
  • Disease surveillance data can be obtained based on patient visits.
  • the disease monitoring data can be obtained according to the drug sales of the pharmacy.
  • the medical institution, the school, the child care institution, and the pharmacy are mainly selected for the collection of disease monitoring data.
  • the above selection of data sources does not limit the addition or replacement of other focused populations or sites in other embodiments as a source of data for monitoring.
  • hotels can be included in the disease surveillance area to obtain disease surveillance data for hotel residents.
  • the disease monitoring data collected by any type of monitoring point can constitute time series data of disease monitoring.
  • the disease monitoring data collected by the hospital can be taken to constitute time series data of disease monitoring.
  • the disease monitoring data collected by the plurality of types of monitoring points can be combined to form time series data of disease monitoring.
  • the disease monitoring data collected by the hospital can be mainly used, supplemented by the disease monitoring data participated by the pharmacy, and constitute time series data of disease monitoring.
  • the disease monitoring data may include disease data such as the number of visits to the disease, the rate of visits, the number of cases, and the incidence rate.
  • disease data such as the number of visits to the disease, the rate of visits, the number of cases, and the incidence rate.
  • the number of daily visits to a disease eg, flu
  • a medical institution eg, a hospital
  • the number of daily visits of the disease eg, flu
  • the daily incidence of a student's disease eg, influenza
  • influenza can be obtained from the school, and the daily incidence of the disease (eg, influenza) can be used as disease monitoring data.
  • the second obtaining unit 302 is configured to acquire weather data related to the disease monitoring data, where the weather data is time series data corresponding to the disease monitoring data.
  • Weather data related to disease surveillance data refers to weather data that affect disease surveillance data (ie disease disease data).
  • the influence of different weather data on the disease monitoring data may be analyzed in advance, and weather data having influence or influence on the disease monitoring data may be determined according to the analysis result.
  • the weather data may include humidity, temperature, air pressure, precipitation, water vapor pressure, wind speed, wind direction, and sunshine hours.
  • the weather data may include daily average temperature, average air pressure, maximum temperature, minimum temperature, average relative humidity, minimum relative humidity, precipitation, average wind speed, sunshine hours, and average water vapor pressure.
  • the weather data is the same as the time period corresponding to the disease monitoring data, and the weather data is the same as the statistical period (eg, daily, weekly) of the disease monitoring data.
  • the disease monitoring data is the number of daily visits from January to February 2018, and the weather data is daily weather data for January-February 2018.
  • the disease monitoring data is the number of weekly visits from January to December 2017, and the weather data is weekly weather data (eg, weekly average temperature) from January to December 2017.
  • the weather data can be captured from weather information websites (such as China Weather Network, Sina Weather, Sohu Weather, etc.) to improve the reliability of the weather data. It can be understood that the weather data can be captured from any webpage.
  • weather information websites such as China Weather Network, Sina Weather, Sohu Weather, etc.
  • Weather data for a predetermined area can be captured.
  • the predetermined area may include a province, a city, a region, and the like. For example, grab weather data from Shenzhen.
  • the predetermined time may include a year, a month, a day, and the like. For example, grab daily weather data for January-February 2018.
  • the weather data can be captured by a web crawler.
  • a web crawler is an application that automatically extracts the content of web page data. Web crawlers usually start with a URL (also called a seed URL) of one or several initial web pages, obtain the URL of the initial web page, and fetch the web page according to specific algorithms and strategies (such as depth-first search strategy). In the process, the new URL is continuously extracted from the current web page and placed in the corresponding queue until the stop condition is satisfied.
  • the URL is an abbreviation of Uniform Resource Locator, which is a uniform resource locator.
  • the weather data can be captured by using an open API interface of the weather information website (for example, an API interface opened by the China Weather Network).
  • the API is an abbreviation of application interface, which can realize mutual communication between computer software through an API interface.
  • the open API interface of the weather information website can return data in JSON format or XML format.
  • the weather data can be captured by a web crawler using an open API interface of the weather information website. See Figure 2 for the specific process of crawling the weather data through the web crawler using the open API interface of the weather information website.
  • the third obtaining unit 303 is configured to acquire public opinion data related to the disease monitoring data, where the public opinion data is time series data corresponding to the disease monitoring data.
  • the public opinion data related to the disease surveillance data refers to the public opinion data reflecting the disease monitoring data.
  • a disease such as the flu
  • many people go online to search for disease-related words (such as flu, Tamiflu, high fever, etc.), which have a large search volume. increase.
  • disease-related content such as illness information, treatment information, etc.
  • news websites such as news, forums, blogs, and post bars increases. Therefore, disease prediction data can be used to assist in disease prediction.
  • the lyric data may include the number of searches for a particular word.
  • the number of searches for a particular word by a predetermined search engine can be counted (eg, a specific region pre-sets the number of daily searches by a search engine for a particular word).
  • the sensation data may also include the number of lyric information containing a particular word for a particular sensation website (e.g., news, forums, blogs, post bars, etc.).
  • the specific word is a word related to the predicted disease, for example, the specific word is a word related to the disease symptom, and when the predicted disease is influenza, the specific word may include: sudden onset, high fever, chills, headache , weakness, inflammation of the throat, muscle soreness, dry cough, etc.
  • the specific words when the predicted disease is hand, foot and mouth, the specific words may include: mouth pain, anorexia, hypothermia, hand herpes, small mouth ulcers, and the like.
  • the time period corresponding to the disease monitoring data is the same, and the public opinion data is the same as the statistical period of the disease monitoring (eg, daily, weekly).
  • the disease monitoring data is the number of daily visits from January to February 2018, and the public opinion data is daily sensation data of January-February 2018 (for example, the number of search times for a specific word day).
  • the disease monitoring data is the number of weekly visits from January to December 2017, and the public opinion data is weekly sensation data of January-December 2017 (for example, a specific number of word searches).
  • the pre-processing unit 304 is configured to pre-process the disease monitoring data, the weather data, and the public opinion data.
  • Pre-processing of disease monitoring data, weather data, and public opinion data may include anomalous data processing.
  • Abnormal data processing of disease surveillance data, weather data and public opinion data is to correct abnormal data in the disease monitoring data, weather data and public opinion data, and improve the reliability and accuracy of disease prediction.
  • the abnormal data processing can include filling missing values in the disease monitoring data, weather data, and public opinion data.
  • the missing values can be filled by the mean or median of the data before and after the missing values, or the missing values can be filled by regression fitting.
  • the abnormal data processing may further include correcting abnormal values in the disease monitoring data, weather data, and public opinion data.
  • the outlier is a value that deviates significantly from other data. The outlier can be corrected by interpolation.
  • Pre-processing of disease monitoring data, weather data, and public opinion data may also include data format conversion of the disease monitoring data, weather data, and public opinion data.
  • disease surveillance data, weather data, and public opinion data are standardized so that disease surveillance data, weather data, and public opinion data have a consistent standard format to fit the input data as a GRU model.
  • the building unit 305 is configured to construct a Gated Recurrent Unit Recurrent Neural Network model, that is, a multi-layer GRU model.
  • the multi-layer GRU model includes two layers of GRU unit layers and one layer of fully connected layers, and the first layer of GRU unit layers is used to construct features for input data (eg, input data composed of disease monitoring data, weather data, and public opinion data)
  • Obtaining a first hidden layer unit where the second layer GRU unit layer is configured to combine the first hidden layer unit to obtain a second hidden layer unit, where the fully connected layer is used according to the second hidden layer
  • the unit obtains prediction results (eg, disease prediction results), and each GRU unit layer includes a reset gate and an update gate that controls the memory state of the GRU unit layer.
  • the GRU model is a time recurrent neural network model. Compared with the traditional Recurrent Neural Network (RNN) model, the GRU model stores information by constructing some gates at the GRU unit layer, so the gradient does not disappear quickly during the model training.
  • RNN Recurrent Neural Network
  • the multi-layer GRU model used in the method includes two layers of GRU unit layers and one layer of fully connected layers, and the first layer of GRU unit layers is used to construct features for input data (such as disease monitoring data, weather data, and input data composed of public opinion data).
  • Obtaining a first hidden layer unit wherein the second layer GRU unit layer is configured to combine the first hidden layer units to obtain a second hidden layer unit.
  • the fully connected layer obtains a predicted value according to the second hidden layer unit.
  • the first hidden layer unit is a local feature
  • the second hidden layer unit is a global feature. That is, the first layer GRU unit layer is used to extract local information, and the second layer GRU unit layer is used to combine global features to obtain global features, and the fully connected layer is used to obtain prediction results according to global features (eg, disease prediction results). .
  • the GRU unit layer includes an update gate z t and a reset gate r t .
  • the update gate z t is a logic gate that updates the hidden layer unit h t .
  • Reset gate r t decides to choose candidate hidden layer unit When to discard the previous hidden layer unit h t .
  • the update gate z t of the GRU unit layer, the reset gate r t , and the candidate hidden layer unit And the hidden layer unit h t is calculated as follows:
  • r t ⁇ (W r x t +U r h t-1 +b r ).
  • is the Sigmoid activation function
  • tanh is the Tanh activation function
  • W z , U z , b z are the parameters of the update gate z t
  • W r , U r , b r are the parameters of the reset gate r t
  • W, U , b is a candidate hidden layer unit Parameters.
  • the optimization unit 306 is configured to obtain training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and use the training data and the verification data to train the multi-layer GRU model and Performance verification, optimized multi-layer GRU model.
  • the time series data may be intercepted from the disease monitoring data, the weather data, and the public opinion data after the pre-processing to constitute the training data and the verification data.
  • the input data of the multi-layer GRU model is a vector of a preset dimension (for example, 1000 dimensions).
  • the pre-processed disease monitoring data, weather data and public opinion data corresponding to each time point may be constructed into a preset dimension vector from the intercepted time series data, and the vectors corresponding to the respective time points are sequentially input into the time sequence.
  • a multi-layer GRU model is used to train or verify the multi-layer GRU model.
  • first time series data for training the multi-layer GRU model For example, intercepting first time series data for training the multi-layer GRU model from the pre-processed disease monitoring data, weather data, and public opinion data; each time point from the intercepted first time series data Corresponding pre-processed disease monitoring data, weather data, and public opinion data construct a first vector of a preset dimension, and sequentially input the first vector corresponding to each time point into the multi-layer GRU model in time sequence, for The multi-layer GRU model is trained.
  • the pre-processed disease monitoring data, the weather data, and the public opinion data construct a second vector of a preset dimension, and sequentially input the second vector corresponding to each time point into the multi-layer GRU model in time sequence, for Multi-layer GRU model for verification.
  • the loss function of the multi-layer GRU model may be defined as a mean square error, and the parameters of the multi-layer GRU model are adjusted such that the mean square error takes a minimum value.
  • the training process can use the RMSprop algorithm.
  • RMSprop is an improved random gradient descent algorithm.
  • the mean square error and RMSprop algorithm are prior art and will not be described here.
  • the predicting unit 307 is configured to obtain disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and the disease monitoring data before the predicted time point.
  • the weather data and the public opinion data are input into the optimized multi-layer GRU model to obtain a disease prediction result at the predicted time point.
  • the disease monitoring data, weather data, and public opinion data before the predicted time point are obtained as time series data.
  • the disease monitoring data, the weather data and the public opinion data before the predicted time point are obtained, and the pre-processed disease monitoring data, the weather data and the public opinion data corresponding to each time point are constructed into a third vector of a preset dimension. In a chronological order, the third vector corresponding to each time point is sequentially input to the multi-layer GRU model to perform disease prediction on the predicted time point.
  • the optimized multi-layer GRU model obtains the hidden layer units of the current time point through the input data of the current time point and the hidden layer unit of the previous time point, according to the current
  • the hidden layer unit at the time point obtains the predicted value of the current time point, and continuously recursively acquires the hidden layer unit of the next time point and the predicted value according to the chronological order until the predicted value of the given time point is obtained.
  • Example 3 predicts disease data by a multi-layer GRU model.
  • the GRU model can extract knowledge directly from the data, construct a feature vector that is favorable for prediction, and improve the prediction accuracy.
  • the weather data and the public opinion data are included as influence factors in the disease prediction, and the accuracy of the disease prediction is improved.
  • the GRU model compared with the disease prediction method based on LSTM (Long Short-term Memory) model, the GRU model has a simple structure and can be quickly optimized to speed up the entire disease prediction process. Therefore, the third embodiment achieves a fast and high accuracy disease prediction.
  • FIG. 4 is a detailed structural diagram of a second acquisition unit (ie, 302 in FIG. 3) in the disease prediction apparatus provided in Embodiment 4 of the present application.
  • the second obtaining unit 302 can capture the weather data through a web crawler by using an API interface opened by the weather information website.
  • the second obtaining unit 302 may include: a generating subunit 3021, a requesting subunit 3022, an analyzing subunit 3023, a determining subunit 3024, a grabbing subunit 3025, and a storing subunit 3026.
  • a generating subunit 3021 is configured to generate a seed URL for the API interface of the weather information website and a subsequent URL.
  • the seed URL is the basis and premise for the web crawler to do everything.
  • the seed URL can be one or more.
  • the structural characteristics of the URL of the weather information website can be analyzed, and the subsequent URLs are obtained according to the structural characteristics of the URL.
  • the requesting subunit 3022 is configured to send an HTTP request to the API interface of the weather information website to request access to the API interface.
  • the HTTP request can be sent to the API interface of the weather information website in GET mode.
  • an HTTP response is returned to inform that the weather data can be acquired.
  • the analyzing subunit 3023 is configured to analyze and identify the data content provided by the weather information website to view the data content.
  • the weather information website provides data content in a specific format, and needs to analyze and identify the data content in a specific format provided by the weather information website to view the data content.
  • the data format provided by the API interface of the weather information website is in JSON format.
  • JSON is a data exchange format that uses a grammar convention similar to C.
  • the data content of the JSON format is analyzed and identified to view the data content.
  • the determining subunit 3024 is configured to determine whether the data content is a predetermined information content.
  • the data content is a predetermined information content. If the data content is not the predetermined information content, the data content is discarded, otherwise the next step is performed.
  • the capture subunit 3025 is configured to capture the data content if the data content is a predetermined information content.
  • a depth-first search strategy may be used for the state space search when the data content is captured.
  • the storage subunit 3026 is configured to save the captured data content as the weather data to the local.
  • a database can be created on the computing device to save the weather data to the database.
  • the traditional web crawler first sets one or more portal URLs.
  • a new URL is extracted from the current webpage into the queue, so as to obtain the webpage content corresponding to the URL. , save the content of the webpage to the local, and then extract the effective address as the next entry URL until the crawl is completed.
  • the second obtaining unit 302 uses the API interface opened by the weather information website to capture the weather data through the web crawler, thereby avoiding downloading irrelevant web pages and efficiently acquiring weather data, thereby improving the efficiency of disease prediction.
  • FIG. 5 is a schematic diagram of a computer apparatus according to Embodiment 5 of the present application.
  • the computer device 1 includes a memory 20, a processor 30, and a computer program 40, such as a disease prediction program, stored in the memory 20 and executable on the processor 30.
  • the processor 30 executes the computer program 40 to implement the steps in the above-described disease prediction method embodiment, such as steps 101-107 shown in FIG.
  • the processor 30, when executing the computer program 40, implements the functions of the various modules/units in the above-described apparatus embodiments, such as units 301-307 in FIG.
  • the computer program 40 can be partitioned into one or more modules/units that are stored in the memory 20 and executed by the processor 30 to complete This application.
  • the one or more modules/units may be a series of computer program instruction segments capable of performing a particular function for describing the execution of the computer program 40 in the computer device 1.
  • the computer program 40 may be divided into a first obtaining unit 301, a second obtaining unit 302, a third obtaining unit 303, a pre-processing unit 304, a building unit 305, an optimizing unit 306, and a predicting unit 307 in FIG.
  • the computer program 40 may be divided into a first obtaining unit 301, a second obtaining unit 302, a third obtaining unit 303, a pre-processing unit 304, a building unit 305, an optimizing unit 306, and a predicting unit 307 in FIG.
  • the third embodiment For the specific functions of each unit, refer to the third embodiment.
  • the processor 30 may be a central processing unit (CPU), or may be other general-purpose processors, a digital signal processor (DSP), an application specific integrated circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.
  • the general purpose processor may be a microprocessor or the processor 30 may be any conventional processor or the like, and the processor 30 is a control center of the computer device 1, and connects the entire computer device 1 by using various interfaces and lines. Various parts.
  • the memory 20 can be used to store the computer program 40 and/or modules/units by running or executing computer programs and/or modules/units stored in the memory 20, and by calling in memory.
  • the data within 20 implements various functions of the computer device 1.
  • the memory 20 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be Data (such as audio data, phone book, etc.) created according to the use of the computer device 1 is stored.
  • the memory 20 may include a high-speed random access memory, and may also include a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a secure digital (Secure Digital, SD).
  • a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a secure digital (Secure Digital, SD).
  • SMC smart memory card
  • SD Secure Digital
  • Card flash card, at least one disk storage device, flash device, or other volatile solid state storage device.
  • the modules/units integrated by the computer device 1 can be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the present application implements all or part of the processes in the foregoing embodiments, and may also be completed by a computer program to instruct related hardware.
  • the computer program may be stored in a non-volatile readable storage medium.
  • the computer program when executed by the processor, implements the steps of the various method embodiments described above.
  • the computer program comprises computer program code, which may be in the form of source code, object code form, executable file or some intermediate form.
  • the non-transitory readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read only memory (ROM, Read- Only Memory), Random Access Memory (RAM), electrical carrier signals, telecommunications signals, and software distribution media.
  • ROM Read Only Memory
  • RAM Random Access Memory
  • the contents of the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, Volatile readable media does not include electrical carrier signals and telecommunication signals.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Biomedical Technology (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A disease prediction method, comprises: acquiring disease monitoring data (101), weather data (102) and public opinion data (103); preprocessing the disease monitoring data, weather data and public opinion data (104); constructing a multi-layer GRU model (105); performing training and performance verification on the multi-layer GRU model to obtain an optimized multi-layer GRU model (106); using the optimized multi-layer GRU model to perform prediction at a prediction time point, so as to obtain a disease prediction result at the prediction time point (107). The present application further provides a disease prediction device, a computer device and a readable storage medium, and can achieve rapid disease prediction with high accuracy.

Description

疾病预测方法及装置、计算机装置及可读存储介质Disease prediction method and device, computer device and readable storage medium
本申请要求于2018年04月11日提交中国专利局,申请号为201810322431.8发明名称为“疾病预测方法及装置、计算机装置及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 201101322431.8, entitled "Disease Prediction Method and Apparatus, Computer Apparatus, and Readable Storage Medium", filed on April 11, 2018, the entire contents of which are incorporated by reference. Combined in this application.
技术领域Technical field
本申请涉及预测技术领域,具体涉及一种疾病预测方法及装置、计算机装置和非易失性可读存储介质。The present application relates to the field of prediction technologies, and in particular, to a disease prediction method and apparatus, a computer apparatus, and a non-volatile readable storage medium.
背景技术Background technique
随着全球经济一体化进程的加快,经济与交流活动增加,人群流动日益频繁,为疾病的传播与爆发提供了有利环境,公共卫生健康问题越来越严峻。同时,社会与自然环境也发生着变化,环境污染、自然灾害等影响公众健康事件的增多也增加了突发公共卫生事件爆发的可能性。With the acceleration of the process of global economic integration, economic and exchange activities have increased, and population movements have become more frequent, providing a favorable environment for the spread and outbreak of diseases, and public health problems are becoming more and more serious. At the same time, social and natural environments have also changed. The increase in public health events such as environmental pollution and natural disasters has also increased the possibility of public health emergencies.
如何能早期识别到疾病突发公共卫生事件,及时发出预警,尽早采取相应的控制措施,将突发公共卫生事件造成的损失降到最低,是公共卫生领域长期以来关注的焦点,也是卫生应急工作的重要内容。突发公共卫生事件预警,是通过对有关数据的收集,整理、分析和整合,运用计算机、网络、通讯等现代先进的技术,对事件的征兆进行监测、识别、诊断与评价,及时报警,告知有关部门和公众做好相关的应对和准备工作,及时采取有效的防控措施,尽可能阻止或减缓突发事件的发生或减少事件的危害。How to identify early public health emergencies, promptly issue early warnings, and take appropriate control measures as soon as possible to minimize the losses caused by public health emergencies, which has long been the focus of public health and health emergency work. Important content. The early warning of public health emergencies is through the collection, collation, analysis and integration of relevant data, using modern advanced technologies such as computers, networks, communications, etc. to monitor, identify, diagnose and evaluate the signs of the incident, timely alarm, inform Relevant departments and the public should do relevant response and preparation work, and take effective prevention and control measures in time to prevent or slow down the occurrence of emergencies or reduce the harm of the incidents.
突发公共卫生事件预警中的一个重要工作是疾病预测,即根据历史的疾病监测数据(即病患数据)预测未来的疾病监测数据。随着机器学习技术的发展,越来越多的机器学习方法应用在疾病预测上。然而,应用于疾病预测的传统的机器学习往往需要人为去定义特征集,然后从定义好的特征集中搜寻最好的特征组合,且效果往往都不够好,从而影响了疾病预测的准确率。An important task in the early warning of public health emergencies is disease prediction, which predicts future disease surveillance data based on historical disease surveillance data (ie, patient data). With the development of machine learning technology, more and more machine learning methods are applied to disease prediction. However, traditional machine learning applied to disease prediction often requires artificially defining feature sets, and then searching for the best feature combinations from the defined feature sets, and the effects are often not good enough, thus affecting the accuracy of disease prediction.
发明内容Summary of the invention
鉴于以上内容,有必要提出一种疾病预测方法及装置、计算机装置和非易失性可读存储介质,其可以实现快速高准确率的疾病预测。In view of the above, it is necessary to propose a disease prediction method and apparatus, a computer apparatus, and a non-volatile readable storage medium that can achieve fast and high-accuracy disease prediction.
本申请的第一方面提供一种疾病预测方法,所述方法包括:A first aspect of the present application provides a disease prediction method, the method comprising:
获取疾病监测数据,所述疾病监测数据是时间序列数据;Obtaining disease monitoring data, the disease monitoring data is time series data;
获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据;Obtaining weather data related to the disease monitoring data, the weather data being time series data corresponding to the disease monitoring data;
获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据;Obtaining public opinion data related to the disease monitoring data, wherein the public opinion data is time series data corresponding to the disease monitoring data;
对所述疾病监测数据、天气数据和舆情数据进行预处理;Pre-processing the disease monitoring data, weather data, and public opinion data;
构建多层门控递归单元神经网络模型,即多层GRU模型;Construct a multi-layer gated recursive unit neural network model, that is, a multi-layer GRU model;
从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型;Obtaining training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and using the training data and the verification data to perform training and performance verification on the multi-layer GRU model, and then optimized Multi-layer GRU model;
从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。Obtaining disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and monitoring disease data, weather data, and public opinion data before the predicted time point The optimized multi-layer GRU model is input to obtain a disease prediction result at the predicted time point.
本申请的第二方面提供一种疾病预测装置,所述装置包括:A second aspect of the present application provides a disease prediction apparatus, the apparatus comprising:
第一获取单元,用于获取疾病监测数据,所述疾病监测数据是时间序列数据;a first acquiring unit, configured to acquire disease monitoring data, where the disease monitoring data is time series data;
第二获取单元,用于获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据;a second acquiring unit, configured to acquire weather data related to the disease monitoring data, where the weather data is time series data corresponding to the disease monitoring data;
第三获取单元,用于获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据;a third obtaining unit, configured to acquire public opinion data related to the disease monitoring data, where the public opinion data is time series data corresponding to the disease monitoring data;
预处理单元,用于对所述疾病监测数据、天气数据和舆情数据进行预处理;a pre-processing unit for pre-processing the disease monitoring data, weather data, and public opinion data;
构建单元,用于构建多层门控递归单元神经网络模型,即多层GRU模型;a building unit for constructing a multi-layer gated recursive unit neural network model, that is, a multi-layer GRU model;
优化单元,用于从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型;An optimization unit, configured to acquire training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and use the training data and the verification data to train and perform performance on the multi-layer GRU model Verification, obtaining an optimized multi-layer GRU model;
预测单元,用于从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。a prediction unit, configured to obtain disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and the disease monitoring data before the predicted time point, The weather data and the public opinion data are input to the optimized multi-layer GRU model to obtain the disease prediction result at the predicted time point.
本申请的第三方面提供一种计算机装置,所述计算机装置包括存储器及处理器,所述存储器用于存储至少一个计算机可读指令,所述处理器用于执行所述至少一个计算机可读指令以实现以下步骤:A third aspect of the present application provides a computer apparatus comprising a memory and a processor, the memory for storing at least one computer readable instruction, the processor for executing the at least one computer readable instruction Implement the following steps:
获取疾病监测数据,所述疾病监测数据是时间序列数据;Obtaining disease monitoring data, the disease monitoring data is time series data;
获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据;Obtaining weather data related to the disease monitoring data, the weather data being time series data corresponding to the disease monitoring data;
获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据;Obtaining public opinion data related to the disease monitoring data, wherein the public opinion data is time series data corresponding to the disease monitoring data;
对所述疾病监测数据、天气数据和舆情数据进行预处理;Pre-processing the disease monitoring data, weather data, and public opinion data;
构建多层门控递归单元神经网络模型,即多层GRU模型;Construct a multi-layer gated recursive unit neural network model, that is, a multi-layer GRU model;
从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型;Obtaining training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and using the training data and the verification data to perform training and performance verification on the multi-layer GRU model, and then optimized Multi-layer GRU model;
从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。Obtaining disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and monitoring disease data, weather data, and public opinion data before the predicted time point The optimized multi-layer GRU model is input to obtain a disease prediction result at the predicted time point.
本申请的第四方面提供一种非易失性可读存储介质,所述非易失性可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理 器执行时实现以下步骤:A fourth aspect of the present application provides a non-volatile readable storage medium storing at least one computer readable instruction when executed by a processor Implement the following steps:
获取疾病监测数据,所述疾病监测数据是时间序列数据;Obtaining disease monitoring data, the disease monitoring data is time series data;
获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据;Obtaining weather data related to the disease monitoring data, the weather data being time series data corresponding to the disease monitoring data;
获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据;Obtaining public opinion data related to the disease monitoring data, wherein the public opinion data is time series data corresponding to the disease monitoring data;
对所述疾病监测数据、天气数据和舆情数据进行预处理;Pre-processing the disease monitoring data, weather data, and public opinion data;
构建多层门控递归单元神经网络模型,即多层GRU模型;Construct a multi-layer gated recursive unit neural network model, that is, a multi-layer GRU model;
从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型;Obtaining training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and using the training data and the verification data to perform training and performance verification on the multi-layer GRU model, and then optimized Multi-layer GRU model;
从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。Obtaining disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and monitoring disease data, weather data, and public opinion data before the predicted time point The optimized multi-layer GRU model is input to obtain a disease prediction result at the predicted time point.
本申请获取疾病监测数据,所述疾病监测数据是时间序列数据;获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据;获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据;对所述疾病监测数据、天气数据和舆情数据进行预处理;构建多层门控递归单元神经网络模型,即多层GRU模型;从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型;从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。The present application acquires disease monitoring data, which is time-series data; acquires weather data related to the disease monitoring data, the weather data is time-series data corresponding to the disease monitoring data; and acquiring the disease monitoring Data-related public opinion data, wherein the public opinion data is time-series data corresponding to the disease monitoring data; pre-processing the disease monitoring data, weather data, and public opinion data; constructing a multi-layer gated recursive unit neural network model, a multi-layer GRU model; obtaining training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and training the multi-layer GRU model using the training data and the verification data Performance verification, obtaining an optimized multi-layer GRU model; obtaining disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and predicting the predicted time Disease monitoring data, weather data and public opinion data before the point are input into the optimized GRU multilayer model to obtain a prediction result predicted disease point of time.
本申请通过多层GRU模型对患病数据进行预测。GRU模型可以从数据中直接去提取知识,构造出有利于预测的特征向量,提高了预测精度。并且,本申请将天气数据、舆情数据作为影响因素在加入到疾病预测中,提高了疾病预测的准确性。此外,与基于LSTM(Long Short-term Memory,长短时记忆)模型的疾病预测方法相比,本申请使用的GRU模型结构简单,可以快速进行优化,从而加快整个疾病预测过程。因此,本申请实现了快速高准确率的疾病预测。The present application predicts disease data through a multi-layer GRU model. The GRU model can extract knowledge directly from the data, construct a feature vector that is favorable for prediction, and improve the prediction accuracy. Moreover, the present application adds weather data and public opinion data as influencing factors to the disease prediction, thereby improving the accuracy of disease prediction. In addition, compared with the disease prediction method based on LSTM (Long Short-term Memory) model, the GRU model used in this application has a simple structure and can be quickly optimized to speed up the entire disease prediction process. Therefore, the present application achieves rapid and high accuracy disease prediction.
附图说明DRAWINGS
图1是本申请实施例一提供的疾病预测方法的流程图。FIG. 1 is a flowchart of a disease prediction method according to Embodiment 1 of the present application.
图2是本申请实施例二提供的疾病预测方法中获取疾病监测数据相关的天气数据的细化流程图。FIG. 2 is a detailed flowchart of acquiring weather data related to disease monitoring data in the disease prediction method provided in the second embodiment of the present application.
图3是本申请实施例三提供的疾病预测装置的结构图。FIG. 3 is a structural diagram of a disease prediction apparatus according to Embodiment 3 of the present application.
图4是本申请实施例四提供的疾病预测装置中第二获取单元的细化结构图。4 is a detailed structural diagram of a second acquisition unit in the disease prediction apparatus provided in Embodiment 4 of the present application.
图5是本申请实施例五提供的计算机装置的示意图。FIG. 5 is a schematic diagram of a computer device according to Embodiment 5 of the present application.
具体实施方式detailed description
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。All technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention applies, unless otherwise defined. The terminology used herein is for the purpose of describing particular embodiments, and is not intended to be limiting.
优选地,本申请的疾病预测方法应用在一个或者多个计算机装置中。所述计算机装置是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。Preferably, the disease prediction method of the present application is applied to one or more computer devices. The computer device is a device capable of automatically performing numerical calculation and/or information processing according to an instruction set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor and an application specific integrated circuit (ASIC). , Field-Programmable Gate Array (FPGA), Digital Signal Processor (DSP), embedded devices, etc.
所述计算机装置可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机装置可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device.
实施例一Embodiment 1
图1是本申请实施例一提供的疾病预测方法的流程图。所述疾病预测方法应用于计算机装置。所述疾病预测方法利用门控递归单元神经网络模型对疾病监测数据进行预测,获得高准确率的疾病预测结果。FIG. 1 is a flowchart of a disease prediction method according to Embodiment 1 of the present application. The disease prediction method is applied to a computer device. The disease prediction method predicts disease monitoring data by using a gated recursive unit neural network model to obtain a high-accuracy disease prediction result.
如图1所示,所述疾病预测方法具体包括以下步骤:As shown in FIG. 1 , the disease prediction method specifically includes the following steps:
步骤101,获取疾病监测数据,所述疾病监测数据是时间序列数据。In step 101, disease monitoring data is acquired, and the disease monitoring data is time series data.
所述疾病监测数据可以包括流感、手足口病、麻疹、流行性腮腺炎等疾病的患病数据。The disease monitoring data may include disease data for diseases such as influenza, hand, foot and mouth disease, measles, and mumps.
可以在预设区域(例如省市、地区)建立由多个监测点组成的疾病监测网络,从所述监测点获取疾病监测数据,由所述疾病监测数据构成疾病监测的时间序列数据。可以选择医疗机构、学校和幼托机构、药店等作为监测点,分别对相应的目标人群进行疾病监测及数据采集。可以选择满足预设条件的场所作为监测点。所述预设条件可以包括人数、规模等。例如,选择学生人数达到预设数量的学校和幼托机构作为监控点。又如,选择规模(例如以日营业额统计)达到预设规模的药店作为监控点。再如,选择规模(例如以日就医人数统计)达到预设规模的医院作为监控点。A disease monitoring network composed of a plurality of monitoring points may be established in a preset area (for example, a province, a city, a region), and disease monitoring data is acquired from the monitoring points, and the disease monitoring data constitutes time series data of disease monitoring. Medical institutions, schools, child care institutions, pharmacies, etc. can be selected as monitoring points to conduct disease monitoring and data collection for the corresponding target population. A place that meets the preset conditions can be selected as the monitoring point. The preset condition may include a number of people, a scale, and the like. For example, select a school with a predetermined number of schools and child care institutions as monitoring points. Another example is to select a pharmacy that has reached the preset size (for example, by daily turnover) as a monitoring point. For another example, select a hospital (for example, the number of people who seek medical treatment in Japan) to reach a preset size as a monitoring point.
不同时间的疾病监测数据构成疾病监测的时间序列数据。例如,可以将以日为单位采集到的疾病监测数据构成疾病监测的时间序列数据。或者,可以将以周为单位采集到的疾病监测数据构成疾病监测的时间序列数据。Disease monitoring data at different times constitute time series data for disease surveillance. For example, disease monitoring data collected on a daily basis can be used to form time series data for disease surveillance. Alternatively, the disease monitoring data collected on a weekly basis may constitute time series data for disease monitoring.
医疗机构(主要包括医院)是最能捕捉疾病早期暴发预兆的场所,是开展疾病监测的首选。可以根据病人就诊情况,获取疾病监测数据。Medical institutions (mainly including hospitals) are the best place to capture early warning signs of disease and are the first choice for disease surveillance. Disease surveillance data can be obtained based on patient visits.
一部分疾病人会自行去药店购药来缓解早期症状,因此,可以根据药店的药品销售情况,获取疾病监测数据。Some people go to the pharmacy to buy medicines to relieve early symptoms. Therefore, the disease monitoring data can be obtained according to the drug sales of the pharmacy.
儿童和青少年是疾病的高危人群以及疾病传播过程中的重要环节,也应该加强对该人群的监测。学校和幼托机构是监测儿童和青少年疾病发病情况的较佳场所。可以根据学校和幼托机构的儿童和青少年的请假情况,获得疾病监测数据。Children and adolescents are at high risk of disease and an important part of the disease transmission process. Monitoring of this population should also be strengthened. Schools and child care institutions are better places to monitor the incidence of childhood and adolescent diseases. Disease surveillance data can be obtained based on the leave of children and adolescents in schools and child care institutions.
因此,本申请中主要选择医疗机构、学校和幼托机构、药店这三类场所进行疾病监测数据的采集。当然,上述对数据源的选择,并不能限制在另外 的实施方案中增加或替换其他重点关注人群或场所作为监测的数据源。例如,可以将宾馆纳入疾病监测范围,获取宾馆入住人员的疾病监测数据。Therefore, in this application, the medical institution, the school, the child care institution, and the pharmacy are mainly selected for the collection of disease monitoring data. Of course, the above selection of data sources does not limit the addition or replacement of other focused populations or sites in other implementations as a source of data for monitoring. For example, hotels can be included in the disease surveillance area to obtain disease surveillance data for hotel residents.
根据需要,可以取任意一类监控点(例如医疗机构)采集的疾病监测数据构成疾病监测的时间序列数据。例如,可以取医院采集的疾病监测数据构成疾病监测的时间序列数据。或者,可以结合多类监控点采集的疾病监测数据构成疾病监测的时间序列数据。例如,可以以医院采集的疾病监测数据为主,以药店参加的疾病监测数据作为补充,构成疾病监测的时间序列数据。According to the needs, the disease monitoring data collected by any type of monitoring point (such as a medical institution) can constitute time series data of disease monitoring. For example, the disease monitoring data collected by the hospital can be taken to constitute time series data of disease monitoring. Alternatively, the disease monitoring data collected by the plurality of types of monitoring points can be combined to form time series data of disease monitoring. For example, the disease monitoring data collected by the hospital can be mainly used, supplemented by the disease monitoring data participated by the pharmacy, and constitute time series data of disease monitoring.
疾病监测数据可以包括疾病的就诊数、就诊率、发病数、发病率等患病数据。例如,可以从医疗机构(例如医院)获取疾病(例如流感)的每日就诊数,将疾病(例如流感)的每日就诊数作为疾病监测数据。又如,可以从学校获取学生的疾病(例如流感)的每日发病数,将疾病(例如流感)的每日发病数作为疾病监测数据。The disease monitoring data may include disease data such as the number of visits to the disease, the rate of visits, the number of cases, and the incidence rate. For example, the number of daily visits to a disease (eg, flu) can be obtained from a medical institution (eg, a hospital), and the number of daily visits of the disease (eg, flu) can be used as disease monitoring data. For another example, the daily incidence of a student's disease (eg, influenza) can be obtained from the school, and the daily incidence of the disease (eg, influenza) can be used as disease monitoring data.
步骤102,获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据。Step 102: Acquire weather data related to the disease monitoring data, where the weather data is time series data corresponding to the disease monitoring data.
疾病监测数据相关的天气数据是指对疾病监测数据(即疾病的患病数据)有影响的天气数据。可以预先分析不同天气数据对所述疾病监测数据的影响,根据分析结果确定对所述疾病监测数据有影响或影响较大的天气数据。Weather data related to disease surveillance data refers to weather data that affect disease surveillance data (ie disease disease data). The influence of different weather data on the disease monitoring data may be analyzed in advance, and weather data having influence or influence on the disease monitoring data may be determined according to the analysis result.
所述天气数据可以包括湿度、气温、气压、降水量、水汽压、风速、风向、日照时数。在一具体实施例中,所述天气数据可以包括每日的平均气温、平均气压、最高气温、最低气温、平均相对湿度、最小相对湿度、降水量、平均风速、日照时数、平均水汽压。The weather data may include humidity, temperature, air pressure, precipitation, water vapor pressure, wind speed, wind direction, and sunshine hours. In a specific embodiment, the weather data may include daily average temperature, average air pressure, maximum temperature, minimum temperature, average relative humidity, minimum relative humidity, precipitation, average wind speed, sunshine hours, and average water vapor pressure.
所述天气数据与所述疾病监测数据对应的时间段相同,并且,所述天气数据与所述疾病监测数据的统计周期(例如每日、每周)相同。例如,所述疾病监测数据为2018年1-2月的每日就诊数,所述天气数据为2018年1-2月的每日天气数据。又如,所述疾病监测数据为2017年1-12月的每周就诊数,所述天气数据为2017年1-12月的每周天气数据(例如周平均气温)。The weather data is the same as the time period corresponding to the disease monitoring data, and the weather data is the same as the statistical period (eg, daily, weekly) of the disease monitoring data. For example, the disease monitoring data is the number of daily visits from January to February 2018, and the weather data is daily weather data for January-February 2018. As another example, the disease monitoring data is the number of weekly visits from January to December 2017, and the weather data is weekly weather data (eg, weekly average temperature) from January to December 2017.
可以从天气信息网站(例如中国天气网、新浪天气、搜狐天气等)抓取所述天气数据,以提高天气数据的可靠性。可以理解,可以从任意网页中抓取所述天气数据。The weather data can be captured from weather information websites (such as China Weather Network, Sina Weather, Sohu Weather, etc.) to improve the reliability of the weather data. It can be understood that the weather data can be captured from any webpage.
可以抓取预定区域的天气数据。所述预定区域可以包括省、市、地区等。例如,抓取深圳市的天气数据。Weather data for a predetermined area can be captured. The predetermined area may include a province, a city, a region, and the like. For example, grab weather data from Shenzhen.
可以抓取预定时间的天气数据。所述预定时间可以包括年、月、日等。例如,抓取2018年1-2月的每日天气数据。It is possible to capture weather data for a predetermined time. The predetermined time may include a year, a month, a day, and the like. For example, grab daily weather data for January-February 2018.
可以通过网络爬虫抓取所述天气数据。网络爬虫是一个可以自动提取网页数据信息内容的应用程序。网络爬虫通常是从一个或者是若干个初始网页的URL(也称种子URL)开始,获取初始网页的URL,依照特定的算法和策略(例如深度优先搜索策略),在对网页进行抓取的过程中,不断地从当前的网页中抽取新的URL放入到相应的队列中,直到满足停止条件为止。URL为Uniform Resource Locator的缩写,即统一资源定位符。The weather data can be captured by a web crawler. A web crawler is an application that automatically extracts the content of web page data. Web crawlers usually start with a URL (also called a seed URL) of one or several initial web pages, obtain the URL of the initial web page, and fetch the web page according to specific algorithms and strategies (such as depth-first search strategy). In the process, the new URL is continuously extracted from the current web page and placed in the corresponding queue until the stop condition is satisfied. The URL is an abbreviation of Uniform Resource Locator, which is a uniform resource locator.
可以利用天气信息网站开放的API接口(例如中国天气网开放的API接口)抓取所述天气数据。API是应用程序接口(application interface)的缩写,通过API接口可以实现计算机软件之间的相互通信。天气信息网站开放的 API接口可以返回JSON格式或者XML格式的数据。The weather data can be captured by using an open API interface of the weather information website (for example, an API interface opened by the China Weather Network). The API is an abbreviation of application interface, which can realize mutual communication between computer software through an API interface. The open API interface of the weather information website can return data in JSON format or XML format.
在一具体实施例中,可以利用天气信息网站开放的API接口,通过网络爬虫抓取所述天气数据。利用天气信息网站开放的API接口,通过网络爬虫抓取所述天气数据的具体过程参见图2。In a specific embodiment, the weather data can be captured by a web crawler using an open API interface of the weather information website. See Figure 2 for the specific process of crawling the weather data through the web crawler using the open API interface of the weather information website.
步骤103,获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据。Step 103: Acquire public opinion data related to the disease monitoring data, where the public opinion data is time series data corresponding to the disease monitoring data.
疾病监测数据相关的舆情数据是指体现所述疾病监测数据的舆情数据。举例来说,当疾病(例如流感)进入流行期时,随着患病人数增多,很多人会上网搜索疾病相关的词语(例如流感、达菲、高烧等特定词),这些词语的搜索量大大增加。又如,当疾病(例如流感)进入流行期时,随着患病人数增多,新闻、论坛、博客、贴吧等舆情网站上发布的疾病相关内容(例如患病信息、治疗信息等)增多。因此,可以利用疾病监测数据相关的舆情数据辅助进行疾病预测。The public opinion data related to the disease surveillance data refers to the public opinion data reflecting the disease monitoring data. For example, when a disease (such as the flu) enters the epidemic, as the number of patients increases, many people go online to search for disease-related words (such as flu, Tamiflu, high fever, etc.), which have a large search volume. increase. For example, when a disease (such as influenza) enters the epidemic period, as the number of patients increases, disease-related content (such as illness information, treatment information, etc.) posted on news websites such as news, forums, blogs, and post bars increases. Therefore, disease prediction data can be used to assist in disease prediction.
所述舆情数据可以包括特定词的搜索次数。例如,可以统计预设搜索引擎对特定词的搜索次数(例如特定地区预设定搜索引擎对特定词的每日搜索次数)。The lyric data may include the number of searches for a particular word. For example, the number of searches for a particular word by a predetermined search engine can be counted (eg, a specific region pre-sets the number of daily searches by a search engine for a particular word).
所述舆情数据还可以包括特定舆情网站(例如新闻、论坛、博客、贴吧等)包含特定词的舆情信息的数量。The sensation data may also include the number of lyric information containing a particular word for a particular sensation website (eg, news, forums, blogs, post bars, etc.).
所述特定词是与预测的疾病相关的词语,例如,所述特定词是疾病症状相关的词语,当预测的疾病为流感时,所述特定词可以包括:发病突然、高烧、畏寒、头痛、无力、喉咙发炎、肌肉酸痛、干咳等。再如,当预测的疾病为手足口时,所述特定词可以包括:口痛、厌食、低热、手部小疱疹、口部小溃疡等。The specific word is a word related to the predicted disease, for example, the specific word is a word related to the disease symptom, and when the predicted disease is influenza, the specific word may include: sudden onset, high fever, chills, headache , weakness, inflammation of the throat, muscle soreness, dry cough, etc. For another example, when the predicted disease is hand, foot and mouth, the specific words may include: mouth pain, anorexia, hypothermia, hand herpes, small mouth ulcers, and the like.
所述舆情数据与所述疾病监测数据对应的时间段相同,并且,所述舆情数据与所述疾病监测的统计周期(例如每日、每周)相同。例如,所述疾病监测数据为2018年1-2月的每日就诊数,则所述舆情数据为2018年1-2月的每日舆情数据(例如特定词日搜索次数)。又如,所述疾病监测数据为2017年1-12月的每周就诊数,则所述舆情数据为2017年1-12月的每周舆情数据(例如特定词周搜索次数)。The time period corresponding to the disease monitoring data is the same, and the public opinion data is the same as the statistical period of the disease monitoring (eg, daily, weekly). For example, the disease monitoring data is the number of daily visits from January to February 2018, and the public opinion data is daily sensation data of January-February 2018 (for example, the number of search times for a specific word day). For another example, the disease monitoring data is the number of weekly visits from January to December 2017, and the public opinion data is weekly sensation data of January-December 2017 (for example, a specific number of word searches).
可以理解,步骤101-103可以以任意顺序执行,也可以并行执行。It will be appreciated that steps 101-103 may be performed in any order or in parallel.
步骤104,对所述疾病监测数据、天气数据和舆情数据进行预处理。In step 104, the disease monitoring data, the weather data, and the public opinion data are preprocessed.
疾病监测数据、天气数据和舆情数据的预处理可以包括异常数据处理。对疾病监测数据、天气数据和舆情数据进行异常数据处理,是为了修正所述疾病监测数据、天气数据和舆情数据中的异常数据,提高疾病预测的可靠性和准确性。Pre-processing of disease monitoring data, weather data, and public opinion data may include anomalous data processing. Abnormal data processing of disease surveillance data, weather data and public opinion data is to correct abnormal data in the disease monitoring data, weather data and public opinion data, and improve the reliability and accuracy of disease prediction.
所述异常数据处理可以包括填补所述疾病监测数据、天气数据和舆情数据中的缺失值。可以通过缺失值前后数据的平均值或中值来对缺失值进行填充,或者,可以通过回归拟合的方法对缺失值进行填充。The abnormal data processing can include filling missing values in the disease monitoring data, weather data, and public opinion data. The missing values can be filled by the mean or median of the data before and after the missing values, or the missing values can be filled by regression fitting.
所述异常数据处理还可以包括修正所述疾病监测数据、天气数据和舆情数据中的异常值。所述异常值是明显偏离其他数据的数值。可以采用插值法修正所述异常值。The abnormal data processing may further include correcting abnormal values in the disease monitoring data, weather data, and public opinion data. The outlier is a value that deviates significantly from other data. The outlier can be corrected by interpolation.
疾病监测数据、天气数据和舆情数据的预处理还可以包括对所述疾病监 测数据、天气数据和舆情数据进行数据格式转换。例如,对疾病监测数据、天气数据和舆情数据进行标准化处理,使得疾病监测数据、天气数据和舆情数据具有一致性的标准格式,以适合作为GRU模型的输入数据。Pre-processing of disease monitoring data, weather data, and public opinion data may also include data format conversion of the disease monitoring data, weather data, and public opinion data. For example, disease surveillance data, weather data, and public opinion data are standardized so that disease surveillance data, weather data, and public opinion data have a consistent standard format to fit the input data as a GRU model.
步骤105,构建多层门控递归单元神经网络(Gated Recurrent Unit Neural Network)模型,即多层GRU模型。所述多层GRU模型包括两层GRU单元层和一层全连接层,第一层GRU单元层用于对输入数据(例如所述疾病监测数据、天气数据和舆情数据构成的输入数据)构造特征,得到第一隐藏层单元,所述第二层GRU单元层用于对所述第一隐藏层单元进行组合,得到第二隐藏层单元,所述全连接层用于根据所述第二隐藏层单元得到预测结果(例如疾病预测结果),每个GRU单元层包括重置门和更新门,所述重置门和更新门控制所述GRU单元层的记忆状态。Step 105: Construct a Gated Recurrent Unit Neural Network model, that is, a multi-layer GRU model. The multi-layer GRU model includes two layers of GRU unit layers and one layer of fully connected layers, and the first layer of GRU unit layers is used to construct features for input data (eg, input data composed of disease monitoring data, weather data, and public opinion data) Obtaining a first hidden layer unit, where the second layer GRU unit layer is configured to combine the first hidden layer unit to obtain a second hidden layer unit, where the fully connected layer is used according to the second hidden layer The unit obtains prediction results (eg, disease prediction results), and each GRU unit layer includes a reset gate and an update gate that controls the memory state of the GRU unit layer.
GRU模型是一种时间递归神经网络模型。相对于传统的循环神经网络(Recurrent Neural Network,RNN)模型,GRU模型通过在GRU单元层构建一些门来存储信息,因此其在模型训练的过程中,梯度不会很快消失。The GRU model is a time recurrent neural network model. Compared with the traditional Recurrent Neural Network (RNN) model, the GRU model stores information by constructing some gates at the GRU unit layer, so the gradient does not disappear quickly during the model training.
本方法使用的多层GRU模型包括两层GRU单元层和一层全连接层,第一层GRU单元层用于对输入数据(例如疾病监测数据、天气数据和舆情数据构成的输入数据)构造特征,得到第一隐藏层单元,所述第二层GRU单元层用于对所述第一隐藏层单元进行组合,得到第二隐藏层单元。所述全连接层根据所述第二隐藏层单元得到预测值。所述第一隐藏层单元为局部特征,所述第二隐藏层单元为全局特征。也就是说,第一层GRU单元层用于提取局部信息,第二层GRU单元层用于结合局部特征得到全局特征,所述全连接层用于根据全局特征得到预测结果(例如疾病预测结果)。The multi-layer GRU model used in the method includes two layers of GRU unit layers and one layer of fully connected layers, and the first layer of GRU unit layers is used to construct features for input data (such as disease monitoring data, weather data, and input data composed of public opinion data). Obtaining a first hidden layer unit, wherein the second layer GRU unit layer is configured to combine the first hidden layer units to obtain a second hidden layer unit. The fully connected layer obtains a predicted value according to the second hidden layer unit. The first hidden layer unit is a local feature, and the second hidden layer unit is a global feature. That is, the first layer GRU unit layer is used to extract local information, and the second layer GRU unit layer is used to combine global features to obtain global features, and the fully connected layer is used to obtain prediction results according to global features (eg, disease prediction results). .
GRU单元层包括更新门z t和重置门r t。更新门z t是更新隐藏层单元h t的逻辑门。重置门r t决定选用候选隐藏层单元
Figure PCTCN2018099612-appb-000001
时,是否放弃以前的隐藏层单元h t
The GRU unit layer includes an update gate z t and a reset gate r t . The update gate z t is a logic gate that updates the hidden layer unit h t . Reset gate r t decides to choose candidate hidden layer unit
Figure PCTCN2018099612-appb-000001
When to discard the previous hidden layer unit h t .
在一实施例中,GRU单元层的更新门z t、重置门r t、候选隐藏层单元
Figure PCTCN2018099612-appb-000002
和隐藏层单元h t计算如下:
In an embodiment, the update gate z t of the GRU unit layer, the reset gate r t , and the candidate hidden layer unit
Figure PCTCN2018099612-appb-000002
And the hidden layer unit h t is calculated as follows:
z t=σ(W zx t+U zh t-1+b z); z t =σ(W z x t +U z h t-1 +b z );
r t=σ(W rx t+U rh t-1+b r)。 r t = σ(W r x t +U r h t-1 +b r ).
得到更新门z t和重置门r t后,得到输出(候选隐藏层单元
Figure PCTCN2018099612-appb-000003
和隐藏层单元h t):
After the update gate z t and the reset gate r t are obtained, an output is obtained (candidate hidden layer unit)
Figure PCTCN2018099612-appb-000003
And hidden layer units h t ):
Figure PCTCN2018099612-appb-000004
Figure PCTCN2018099612-appb-000004
其中,σ为Sigmoid激活函数,tanh为Tanh激活函数,W z、U z、b z为更新门z t的参数,W r、U r、b r为重置门r t的参数,W、U、b为候选隐藏层单元
Figure PCTCN2018099612-appb-000005
的参数。
Where σ is the Sigmoid activation function, tanh is the Tanh activation function, W z , U z , b z are the parameters of the update gate z t , W r , U r , b r are the parameters of the reset gate r t , W, U , b is a candidate hidden layer unit
Figure PCTCN2018099612-appb-000005
Parameters.
步骤106,从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型。Step 106: Obtain training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and perform training and performance verification on the multi-layer GRU model by using the training data and the verification data. The optimized multi-layer GRU model is obtained.
可以从预处理后的所述疾病监测数据、天气数据和舆情数据中截取时间序列数据,构成所述训练数据和所述验证数据。The time series data may be intercepted from the disease monitoring data, the weather data, and the public opinion data after the pre-processing to constitute the training data and the verification data.
所述多层GRU模型的输入数据是一个预设维度(例如1000维)的向量。可以从截取的时间序列数据中将每个时间点对应的预处理后的疾病监测数据、天气数据和舆情数据构造一个预设维度的向量,按照时间顺序,将各个时间点对应的向量依次输入所述多层GRU模型,用来对所述多层GRU模型进行训练或验证。The input data of the multi-layer GRU model is a vector of a preset dimension (for example, 1000 dimensions). The pre-processed disease monitoring data, weather data and public opinion data corresponding to each time point may be constructed into a preset dimension vector from the intercepted time series data, and the vectors corresponding to the respective time points are sequentially input into the time sequence. A multi-layer GRU model is used to train or verify the multi-layer GRU model.
例如,从预处理后的所述疾病监测数据、天气数据和舆情数据中截取用于训练所述多层GRU模型的第一时间序列数据;从截取的第一时间序列数据中将每个时间点对应的预处理后的疾病监测数据、天气数据和舆情数据构造一个预设维度的第一向量,按照时间顺序,将各个时间点对应的第一向量依次输入所述多层GRU模型,用于对所述多层GRU模型进行训练。从预处理后的所述疾病监测数据、天气数据和舆情数据中截取用于验证所述多层GRU模型的第二时间序列数据;从截取的第二时间序列数据中将每个时间点对应的预处理后的疾病监测数据、天气数据和舆情数据构造一个预设维度的第二向量,按照时间顺序,将各个时间点对应的第二向量依次输入所述多层GRU模型,用于对所述多层GRU模型进行验证。For example, intercepting first time series data for training the multi-layer GRU model from the pre-processed disease monitoring data, weather data, and public opinion data; each time point from the intercepted first time series data Corresponding pre-processed disease monitoring data, weather data, and public opinion data construct a first vector of a preset dimension, and sequentially input the first vector corresponding to each time point into the multi-layer GRU model in time sequence, for The multi-layer GRU model is trained. Extracting second time series data for verifying the multi-layer GRU model from the pre-processed disease monitoring data, weather data, and public opinion data; corresponding to each time point from the intercepted second time series data The pre-processed disease monitoring data, the weather data, and the public opinion data construct a second vector of a preset dimension, and sequentially input the second vector corresponding to each time point into the multi-layer GRU model in time sequence, for Multi-layer GRU model for verification.
在对所述多层GRU模型进行训练时,所述多层GRU模型的损失函数可以定义为均方差,调整所述多层GRU模型的参数,使得所述均方差取得最小值。训练的过程可以采用RMSprop算法。RMSprop是一种改进的随机梯度下降算法。均方差与RMSprop算法是现有技术,此处不再赘述。When training the multi-layer GRU model, the loss function of the multi-layer GRU model may be defined as a mean square error, and the parameters of the multi-layer GRU model are adjusted such that the mean square error takes a minimum value. The training process can use the RMSprop algorithm. RMSprop is an improved stochastic gradient descent algorithm. The mean square error and RMSprop algorithm are prior art and will not be described here.
步骤107,从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。Step 107: Obtain disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and use the disease monitoring data and weather data before the predicted time point. And the lyrical data is input into the optimized multi-layer GRU model to obtain a disease prediction result at the predicted time point.
获取的预测时间点之前的疾病监测数据、天气数据和舆情数据为时间序列数据。可以从获取的预测时间点之前的疾病监测数据、天气数据和舆情数据中,将每个时间点对应的预处理后的疾病监测数据、天气数据和舆情数据构造一个预设维度的第三向量,按照时间顺序,将各个时间点对应的第三向量依次输入所述多层GRU模型,以对预测时间点进行疾病预测。The disease monitoring data, weather data, and public opinion data before the predicted time point are obtained as time series data. The disease monitoring data, the weather data and the public opinion data before the predicted time point are obtained, and the pre-processed disease monitoring data, the weather data and the public opinion data corresponding to each time point are constructed into a third vector of a preset dimension. In a chronological order, the third vector corresponding to each time point is sequentially input to the multi-layer GRU model to perform disease prediction on the predicted time point.
在进行疾病预测时,从初始时间点开始,优化后的多层GRU模型通过当前时间点的输入数据及前一时间点的隐藏层单元逐层组合得到当前时间点的各隐藏层单元,根据当前时间点的隐藏层单元得到当前时间点的预测值,并根据时间顺序,不断递归获取下一时间点的隐藏层单元以及预测值,直到得到所述给定时间点的预测值。In the disease prediction, from the initial time point, the optimized multi-layer GRU model obtains the hidden layer units of the current time point through the input data of the current time point and the hidden layer unit of the previous time point, according to the current The hidden layer unit at the time point obtains the predicted value of the current time point, and continuously recursively acquires the hidden layer unit of the next time point and the predicted value according to the chronological order until the predicted value of the given time point is obtained.
实施例一通过多层GRU模型对患病数据进行预测。GRU模型可以从数据中直接去提取知识,构造出有利于预测的特征向量,提高了预测精度。并且,实施例一将天气数据、舆情数据作为影响因素在加入到疾病预测中,提高了疾病预测的准确性。此外,与基于LSTM(Long Short-term Memory,长短时记忆)模型的疾病预测方法相比,GRU模型结构简单,可以快速进行优化,从而加快整个疾病预测过程。因此,实施例一实现了快速高准确率的疾病预测。Example 1 predicts disease data by a multi-layer GRU model. The GRU model can extract knowledge directly from the data, construct a feature vector that is favorable for prediction, and improve the prediction accuracy. Moreover, in the first embodiment, the weather data and the public opinion data are included as influence factors in the disease prediction, and the accuracy of the disease prediction is improved. In addition, compared with the disease prediction method based on LSTM (Long Short-term Memory) model, the GRU model has a simple structure and can be quickly optimized to speed up the entire disease prediction process. Therefore, the first embodiment achieves a fast and high accuracy rate of disease prediction.
实施例二Embodiment 2
图2是本申请实施例二提供的疾病预测方法中获取疾病监测数据相关的 天气数据(即图1中步骤102)的细化流程图。2 is a detailed flowchart of obtaining weather data related to disease monitoring data (ie, step 102 in FIG. 1) in the disease prediction method provided in the second embodiment of the present application.
可以利用天气信息网站开放的API接口,通过网络爬虫抓取所述天气数据。参阅图2所示,具体可以包括以下步骤:The weather data can be captured by a web crawler using an open API interface of the weather information website. Referring to FIG. 2, the following steps may be specifically included:
步骤201,生成面向所述天气信息网站的API接口的种子URL以及后续的URL。Step 201: Generate a seed URL for the API interface of the weather information website and a subsequent URL.
种子URL是网络爬虫进行一切工作的基础和前提。种子URL可以是一个也可以是多个。The seed URL is the basis and premise for the web crawler to do everything. The seed URL can be one or more.
可以对天气信息网站的URL的结构特点进行分析,根据URL的结构特点得到后续的URL。The structural characteristics of the URL of the weather information website can be analyzed, and the subsequent URLs are obtained according to the structural characteristics of the URL.
步骤202,向所述天气信息网站的API接口发送HTTP请求,请求访问所述API接口。Step 202: Send an HTTP request to an API interface of the weather information website, requesting access to the API interface.
可以以GET方式向所述天气信息网站的API接口发送HTTP请求。当天气信息网站同意获取其提供的天气数据时,返回HTTP响应,以告知可以进行获取天气数据的操作。The HTTP request can be sent to the API interface of the weather information website in GET mode. When the weather information website agrees to obtain the weather data it provides, an HTTP response is returned to inform that the weather data can be acquired.
步骤203,对所述天气信息网站提供的数据内容进行分析和识别,以查看所述数据内容。Step 203: Analyze and identify the data content provided by the weather information website to view the data content.
天气信息网站提供特定格式的数据内容,需要对天气信息网站提供的特定格式的数据内容进行分析和识别,来查看所述数据内容。例如,所述天气信息网站的API接口提供的数据格式为JSON格式。JSON是一种数据交换格式,使用了类似于C语言的语法习惯。对该JSON格式的数据内容进行分析和识别,来查看所述数据内容。The weather information website provides data content in a specific format, and needs to analyze and identify the data content in a specific format provided by the weather information website to view the data content. For example, the data format provided by the API interface of the weather information website is in JSON format. JSON is a data exchange format that uses a grammar convention similar to C. The data content of the JSON format is analyzed and identified to view the data content.
步骤204,判断所述数据内容是否为预定信息内容。Step 204: Determine whether the data content is a predetermined information content.
为了得到特定的天气数据,需要判断所述数据内容是否为预定信息内容。若所述数据内容是否不是预定信息内容,则舍弃该数据内容,否则执行下一步骤。In order to obtain specific weather data, it is necessary to determine whether the data content is a predetermined information content. If the data content is not the predetermined information content, the data content is discarded, otherwise the next step is performed.
步骤205,若所述数据内容为预定信息内容,则抓取所述数据内容。Step 205: If the data content is a predetermined information content, the data content is captured.
数据抓取的最终目的是将网络数据内容抓取到本地。对于JSON格式的数据内容,在抓取所述数据内容时可以采用深度优先搜索策略进行状态空间搜索。The ultimate goal of data crawling is to crawl network data content locally. For the data content in the JSON format, a depth-first search strategy may be used for the state space search when the data content is captured.
步骤206,将抓取的数据内容作为所述天气数据保存到本地。Step 206: Save the captured data content as the weather data to the local.
可以在计算设备上创建数据库,将所述天气数据保存到所述数据库中。A database can be created on the computing device to save the weather data to the database.
传统的网络爬虫都是首先设定一个或者多个入口URL,在抓取网页的过程中,按照抓取的策略,从当前网页上提取出新的URL放入队列,以便获取URL对应的网页内容,将网页内容保存到本地,然后,再提取有效地址作为下一次的入口URL,直到爬行完毕。随着网页数量的剧增,传统的网络爬虫会下载大量的无关网页。利用天气信息网站开放的API接口,通过网络爬虫抓取所述天气数据,可以避免下载无关网页,高效地获取天气数据,从而提高疾病预测的效率。The traditional web crawler first sets one or more portal URLs. In the process of crawling the webpage, according to the crawling strategy, a new URL is extracted from the current webpage into the queue, so as to obtain the webpage content corresponding to the URL. , save the content of the webpage to the local, and then extract the effective address as the next entry URL until the crawl is completed. As the number of web pages increases dramatically, traditional web crawlers download a large number of irrelevant web pages. By using the open API interface of the weather information website to capture the weather data through the web crawler, it is possible to avoid downloading irrelevant web pages and efficiently acquiring weather data, thereby improving the efficiency of disease prediction.
实施例三Embodiment 3
图3为本申请实施例三提供的疾病预测装置的结构图。如图3所示,所述疾病预测装置10可以包括:第一获取单元301、第二获取单元302、第三获取单元303、预处理单元304、构建单元305、优化单元306、预测单元307。FIG. 3 is a structural diagram of a disease prediction apparatus according to Embodiment 3 of the present application. As shown in FIG. 3, the disease prediction apparatus 10 may include: a first acquisition unit 301, a second acquisition unit 302, a third acquisition unit 303, a pre-processing unit 304, a construction unit 305, an optimization unit 306, and a prediction unit 307.
第一获取单元301,用于获取疾病监测数据,所述疾病监测数据是时间序列数据。The first obtaining unit 301 is configured to acquire disease monitoring data, where the disease monitoring data is time series data.
所述疾病监测数据可以包括流感、手足口病、麻疹、流行性腮腺炎等疾病的患病数据。The disease monitoring data may include disease data for diseases such as influenza, hand, foot and mouth disease, measles, and mumps.
可以在预设区域(例如省市、地区)建立由多个监测点组成的疾病监测网络,从所述监测点获取疾病监测数据,由所述疾病监测数据构成疾病监测的时间序列数据。可以选择医疗机构、学校和幼托机构、药店等作为监测点,分别对相应的目标人群进行疾病监测及数据采集。可以选择满足预设条件的场所作为监测点。所述预设条件可以包括人数、规模等。例如,选择学生人数达到预设数量的学校和幼托机构作为监控点。又如,选择规模(例如以日营业额统计)达到预设规模的药店作为监控点。再如,选择规模(例如以日就医人数统计)达到预设规模的医院作为监控点。A disease monitoring network composed of a plurality of monitoring points may be established in a preset area (for example, a province, a city, a region), and disease monitoring data is acquired from the monitoring points, and the disease monitoring data constitutes time series data of disease monitoring. Medical institutions, schools, child care institutions, pharmacies, etc. can be selected as monitoring points to conduct disease monitoring and data collection for the corresponding target population. A place that meets the preset conditions can be selected as the monitoring point. The preset condition may include a number of people, a scale, and the like. For example, select a school with a predetermined number of schools and child care institutions as monitoring points. Another example is to select a pharmacy that has reached the preset size (for example, by daily turnover) as a monitoring point. For another example, select a hospital (for example, the number of people who seek medical treatment in Japan) to reach a preset size as a monitoring point.
不同时间的疾病监测数据构成疾病监测的时间序列数据。例如,可以将以日为单位采集到的疾病监测数据构成疾病监测的时间序列数据。或者,可以将以周为单位采集到的疾病监测数据构成疾病监测的时间序列数据。Disease monitoring data at different times constitute time series data for disease surveillance. For example, disease monitoring data collected on a daily basis can be used to form time series data for disease surveillance. Alternatively, the disease monitoring data collected on a weekly basis may constitute time series data for disease monitoring.
医疗机构(主要包括医院)是最能捕捉疾病早期暴发预兆的场所,是开展疾病监测的首选。可以根据病人就诊情况,获取疾病监测数据。Medical institutions (mainly including hospitals) are the best place to capture early warning signs of disease and are the first choice for disease surveillance. Disease surveillance data can be obtained based on patient visits.
一部分疾病人会自行去药店购药来缓解早期症状,因此,可以根据药店的药品销售情况,获取疾病监测数据。Some people go to the pharmacy to buy medicines to relieve early symptoms. Therefore, the disease monitoring data can be obtained according to the drug sales of the pharmacy.
儿童和青少年是疾病的高危人群以及疾病传播过程中的重要环节,也应该加强对该人群的监测。学校和幼托机构是监测儿童和青少年疾病发病情况的较佳场所。可以根据学校和幼托机构的儿童和青少年的请假情况,获得疾病监测数据。Children and adolescents are at high risk of disease and an important part of the disease transmission process. Monitoring of this population should also be strengthened. Schools and child care institutions are better places to monitor the incidence of childhood and adolescent diseases. Disease surveillance data can be obtained based on the leave of children and adolescents in schools and child care institutions.
因此,本申请中主要选择医疗机构、学校和幼托机构、药店这三类场所进行疾病监测数据的采集。当然,上述对数据源的选择,并不能限制在另外的实施方案中增加或替换其他重点关注人群或场所作为监测的数据源。例如,可以将宾馆纳入疾病监测范围,获取宾馆入住人员的疾病监测数据。Therefore, in this application, the medical institution, the school, the child care institution, and the pharmacy are mainly selected for the collection of disease monitoring data. Of course, the above selection of data sources does not limit the addition or replacement of other focused populations or sites in other embodiments as a source of data for monitoring. For example, hotels can be included in the disease surveillance area to obtain disease surveillance data for hotel residents.
根据需要,可以取任意一类监控点(例如医疗机构)采集的疾病监测数据构成疾病监测的时间序列数据。例如,可以取医院采集的疾病监测数据构成疾病监测的时间序列数据。或者,可以结合多类监控点采集的疾病监测数据构成疾病监测的时间序列数据。例如,可以以医院采集的疾病监测数据为主,以药店参加的疾病监测数据作为补充,构成疾病监测的时间序列数据。According to the needs, the disease monitoring data collected by any type of monitoring point (such as a medical institution) can constitute time series data of disease monitoring. For example, the disease monitoring data collected by the hospital can be taken to constitute time series data of disease monitoring. Alternatively, the disease monitoring data collected by the plurality of types of monitoring points can be combined to form time series data of disease monitoring. For example, the disease monitoring data collected by the hospital can be mainly used, supplemented by the disease monitoring data participated by the pharmacy, and constitute time series data of disease monitoring.
疾病监测数据可以包括疾病的就诊数、就诊率、发病数、发病率等患病数据。例如,可以从医疗机构(例如医院)获取疾病(例如流感)的每日就诊数,将疾病(例如流感)的每日就诊数作为疾病监测数据。又如,可以从学校获取学生的疾病(例如流感)的每日发病数,将疾病(例如流感)的每日发病数作为疾病监测数据。The disease monitoring data may include disease data such as the number of visits to the disease, the rate of visits, the number of cases, and the incidence rate. For example, the number of daily visits to a disease (eg, flu) can be obtained from a medical institution (eg, a hospital), and the number of daily visits of the disease (eg, flu) can be used as disease monitoring data. For another example, the daily incidence of a student's disease (eg, influenza) can be obtained from the school, and the daily incidence of the disease (eg, influenza) can be used as disease monitoring data.
第二获取单元302,用于获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据。The second obtaining unit 302 is configured to acquire weather data related to the disease monitoring data, where the weather data is time series data corresponding to the disease monitoring data.
疾病监测数据相关的天气数据是指对疾病监测数据(即疾病的患病数据)有影响的天气数据。可以预先分析不同天气数据对所述疾病监测数据的影响,根据分析结果确定对所述疾病监测数据有影响或影响较大的天气数据。Weather data related to disease surveillance data refers to weather data that affect disease surveillance data (ie disease disease data). The influence of different weather data on the disease monitoring data may be analyzed in advance, and weather data having influence or influence on the disease monitoring data may be determined according to the analysis result.
所述天气数据可以包括湿度、气温、气压、降水量、水汽压、风速、风向、日照时数。在一具体实施例中,所述天气数据可以包括每日的平均气温、平均气压、最高气温、最低气温、平均相对湿度、最小相对湿度、降水量、平均风速、日照时数、平均水汽压。The weather data may include humidity, temperature, air pressure, precipitation, water vapor pressure, wind speed, wind direction, and sunshine hours. In a specific embodiment, the weather data may include daily average temperature, average air pressure, maximum temperature, minimum temperature, average relative humidity, minimum relative humidity, precipitation, average wind speed, sunshine hours, and average water vapor pressure.
所述天气数据与所述疾病监测数据对应的时间段相同,并且,所述天气数据与所述疾病监测数据的统计周期(例如每日、每周)相同。例如,所述疾病监测数据为2018年1-2月的每日就诊数,所述天气数据为2018年1-2月的每日天气数据。又如,所述疾病监测数据为2017年1-12月的每周就诊数,所述天气数据为2017年1-12月的每周天气数据(例如周平均气温)。The weather data is the same as the time period corresponding to the disease monitoring data, and the weather data is the same as the statistical period (eg, daily, weekly) of the disease monitoring data. For example, the disease monitoring data is the number of daily visits from January to February 2018, and the weather data is daily weather data for January-February 2018. As another example, the disease monitoring data is the number of weekly visits from January to December 2017, and the weather data is weekly weather data (eg, weekly average temperature) from January to December 2017.
可以从天气信息网站(例如中国天气网、新浪天气、搜狐天气等)抓取所述天气数据,以提高天气数据的可靠性。可以理解,可以从任意网页中抓取所述天气数据。The weather data can be captured from weather information websites (such as China Weather Network, Sina Weather, Sohu Weather, etc.) to improve the reliability of the weather data. It can be understood that the weather data can be captured from any webpage.
可以抓取预定区域的天气数据。所述预定区域可以包括省、市、地区等。例如,抓取深圳市的天气数据。Weather data for a predetermined area can be captured. The predetermined area may include a province, a city, a region, and the like. For example, grab weather data from Shenzhen.
可以抓取预定时间的天气数据。所述预定时间可以包括年、月、日等。例如,抓取2018年1-2月的每日天气数据。It is possible to capture weather data for a predetermined time. The predetermined time may include a year, a month, a day, and the like. For example, grab daily weather data for January-February 2018.
可以通过网络爬虫抓取所述天气数据。网络爬虫是一个可以自动提取网页数据信息内容的应用程序。网络爬虫通常是从一个或者是若干个初始网页的URL(也称种子URL)开始,获取初始网页的URL,依照特定的算法和策略(例如深度优先搜索策略),在对网页进行抓取的过程中,不断地从当前的网页中抽取新的URL放入到相应的队列中,直到满足停止条件为止。URL为Uniform Resource Locator的缩写,即统一资源定位符。The weather data can be captured by a web crawler. A web crawler is an application that automatically extracts the content of web page data. Web crawlers usually start with a URL (also called a seed URL) of one or several initial web pages, obtain the URL of the initial web page, and fetch the web page according to specific algorithms and strategies (such as depth-first search strategy). In the process, the new URL is continuously extracted from the current web page and placed in the corresponding queue until the stop condition is satisfied. The URL is an abbreviation of Uniform Resource Locator, which is a uniform resource locator.
可以利用天气信息网站开放的API接口(例如中国天气网开放的API接口)抓取所述天气数据。API是应用程序接口(application interface)的缩写,通过API接口可以实现计算机软件之间的相互通信。天气信息网站开放的API接口可以返回JSON格式或者XML格式的数据。The weather data can be captured by using an open API interface of the weather information website (for example, an API interface opened by the China Weather Network). The API is an abbreviation of application interface, which can realize mutual communication between computer software through an API interface. The open API interface of the weather information website can return data in JSON format or XML format.
在一具体实施例中,可以利用天气信息网站开放的API接口,通过网络爬虫抓取所述天气数据。利用天气信息网站开放的API接口,通过网络爬虫抓取所述天气数据的具体过程参见图2。In a specific embodiment, the weather data can be captured by a web crawler using an open API interface of the weather information website. See Figure 2 for the specific process of crawling the weather data through the web crawler using the open API interface of the weather information website.
第三获取单元303,用于获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据。The third obtaining unit 303 is configured to acquire public opinion data related to the disease monitoring data, where the public opinion data is time series data corresponding to the disease monitoring data.
疾病监测数据相关的舆情数据是指体现所述疾病监测数据的舆情数据。举例来说,当疾病(例如流感)进入流行期时,随着患病人数增多,很多人会上网搜索疾病相关的词语(例如流感、达菲、高烧等特定词),这些词语的搜索量大大增加。又如,当疾病(例如流感)进入流行期时,随着患病人数增多,新闻、论坛、博客、贴吧等舆情网站上发布的疾病相关内容(例如患病信息、治疗信息等)增多。因此,可以利用疾病监测数据相关的舆情数据辅助进行疾病预测。The public opinion data related to the disease surveillance data refers to the public opinion data reflecting the disease monitoring data. For example, when a disease (such as the flu) enters the epidemic, as the number of patients increases, many people go online to search for disease-related words (such as flu, Tamiflu, high fever, etc.), which have a large search volume. increase. For example, when a disease (such as influenza) enters the epidemic period, as the number of patients increases, disease-related content (such as illness information, treatment information, etc.) posted on news websites such as news, forums, blogs, and post bars increases. Therefore, disease prediction data can be used to assist in disease prediction.
所述舆情数据可以包括特定词的搜索次数。例如,可以统计预设搜索引擎对特定词的搜索次数(例如特定地区预设定搜索引擎对特定词的每日搜索次数)。The lyric data may include the number of searches for a particular word. For example, the number of searches for a particular word by a predetermined search engine can be counted (eg, a specific region pre-sets the number of daily searches by a search engine for a particular word).
所述舆情数据还可以包括特定舆情网站(例如新闻、论坛、博客、贴吧 等)包含特定词的舆情信息的数量。The sensation data may also include the number of lyric information containing a particular word for a particular sensation website (e.g., news, forums, blogs, post bars, etc.).
所述特定词是与预测的疾病相关的词语,例如,所述特定词是疾病症状相关的词语,当预测的疾病为流感时,所述特定词可以包括:发病突然、高烧、畏寒、头痛、无力、喉咙发炎、肌肉酸痛、干咳等。再如,当预测的疾病为手足口时,所述特定词可以包括:口痛、厌食、低热、手部小疱疹、口部小溃疡等。The specific word is a word related to the predicted disease, for example, the specific word is a word related to the disease symptom, and when the predicted disease is influenza, the specific word may include: sudden onset, high fever, chills, headache , weakness, inflammation of the throat, muscle soreness, dry cough, etc. For another example, when the predicted disease is hand, foot and mouth, the specific words may include: mouth pain, anorexia, hypothermia, hand herpes, small mouth ulcers, and the like.
所述舆情数据与所述疾病监测数据对应的时间段相同,并且,所述舆情数据与所述疾病监测的统计周期(例如每日、每周)相同。例如,所述疾病监测数据为2018年1-2月的每日就诊数,则所述舆情数据为2018年1-2月的每日舆情数据(例如特定词日搜索次数)。又如,所述疾病监测数据为2017年1-12月的每周就诊数,则所述舆情数据为2017年1-12月的每周舆情数据(例如特定词周搜索次数)。The time period corresponding to the disease monitoring data is the same, and the public opinion data is the same as the statistical period of the disease monitoring (eg, daily, weekly). For example, the disease monitoring data is the number of daily visits from January to February 2018, and the public opinion data is daily sensation data of January-February 2018 (for example, the number of search times for a specific word day). For another example, the disease monitoring data is the number of weekly visits from January to December 2017, and the public opinion data is weekly sensation data of January-December 2017 (for example, a specific number of word searches).
预处理单元304,用于对所述疾病监测数据、天气数据和舆情数据进行预处理。The pre-processing unit 304 is configured to pre-process the disease monitoring data, the weather data, and the public opinion data.
疾病监测数据、天气数据和舆情数据的预处理可以包括异常数据处理。对疾病监测数据、天气数据和舆情数据进行异常数据处理,是为了修正所述疾病监测数据、天气数据和舆情数据中的异常数据,提高疾病预测的可靠性和准确性。Pre-processing of disease monitoring data, weather data, and public opinion data may include anomalous data processing. Abnormal data processing of disease surveillance data, weather data and public opinion data is to correct abnormal data in the disease monitoring data, weather data and public opinion data, and improve the reliability and accuracy of disease prediction.
所述异常数据处理可以包括填补所述疾病监测数据、天气数据和舆情数据中的缺失值。可以通过缺失值前后数据的平均值或中值来对缺失值进行填充,或者,可以通过回归拟合的方法对缺失值进行填充。The abnormal data processing can include filling missing values in the disease monitoring data, weather data, and public opinion data. The missing values can be filled by the mean or median of the data before and after the missing values, or the missing values can be filled by regression fitting.
所述异常数据处理还可以包括修正所述疾病监测数据、天气数据和舆情数据中的异常值。所述异常值是明显偏离其他数据的数值。可以采用插值法修正所述异常值。The abnormal data processing may further include correcting abnormal values in the disease monitoring data, weather data, and public opinion data. The outlier is a value that deviates significantly from other data. The outlier can be corrected by interpolation.
疾病监测数据、天气数据和舆情数据的预处理还可以包括对所述疾病监测数据、天气数据和舆情数据进行数据格式转换。例如,对疾病监测数据、天气数据和舆情数据进行标准化处理,使得疾病监测数据、天气数据和舆情数据具有一致性的标准格式,以适合作为GRU模型的输入数据。Pre-processing of disease monitoring data, weather data, and public opinion data may also include data format conversion of the disease monitoring data, weather data, and public opinion data. For example, disease surveillance data, weather data, and public opinion data are standardized so that disease surveillance data, weather data, and public opinion data have a consistent standard format to fit the input data as a GRU model.
构建单元305,用于构建多层门控递归单元神经网络(Gated Recurrent Unit Recurrent Neural Network)模型,即多层GRU模型。所述多层GRU模型包括两层GRU单元层和一层全连接层,第一层GRU单元层用于对输入数据(例如所述疾病监测数据、天气数据和舆情数据构成的输入数据)构造特征,得到第一隐藏层单元,所述第二层GRU单元层用于对所述第一隐藏层单元进行组合,得到第二隐藏层单元,所述全连接层用于根据所述第二隐藏层单元得到预测结果(例如疾病预测结果),每个GRU单元层包括重置门和更新门,所述重置门和更新门控制所述GRU单元层的记忆状态。The building unit 305 is configured to construct a Gated Recurrent Unit Recurrent Neural Network model, that is, a multi-layer GRU model. The multi-layer GRU model includes two layers of GRU unit layers and one layer of fully connected layers, and the first layer of GRU unit layers is used to construct features for input data (eg, input data composed of disease monitoring data, weather data, and public opinion data) Obtaining a first hidden layer unit, where the second layer GRU unit layer is configured to combine the first hidden layer unit to obtain a second hidden layer unit, where the fully connected layer is used according to the second hidden layer The unit obtains prediction results (eg, disease prediction results), and each GRU unit layer includes a reset gate and an update gate that controls the memory state of the GRU unit layer.
GRU模型是一种时间递归神经网络模型。相对于传统的循环神经网络(Recurrent Neural Network,RNN)模型,GRU模型通过在GRU单元层构建一些门来存储信息,因此其在模型训练的过程中,梯度不会很快消失。The GRU model is a time recurrent neural network model. Compared with the traditional Recurrent Neural Network (RNN) model, the GRU model stores information by constructing some gates at the GRU unit layer, so the gradient does not disappear quickly during the model training.
本方法使用的多层GRU模型包括两层GRU单元层和一层全连接层,第一层GRU单元层用于对输入数据(例如疾病监测数据、天气数据和舆情数据构成的输入数据)构造特征,得到第一隐藏层单元,所述第二层GRU单元 层用于对所述第一隐藏层单元进行组合,得到第二隐藏层单元。所述全连接层根据所述第二隐藏层单元得到预测值。所述第一隐藏层单元为局部特征,所述第二隐藏层单元为全局特征。也就是说,第一层GRU单元层用于提取局部信息,第二层GRU单元层用于结合局部特征得到全局特征,所述全连接层用于根据全局特征得到预测结果(例如疾病预测结果)。The multi-layer GRU model used in the method includes two layers of GRU unit layers and one layer of fully connected layers, and the first layer of GRU unit layers is used to construct features for input data (such as disease monitoring data, weather data, and input data composed of public opinion data). Obtaining a first hidden layer unit, wherein the second layer GRU unit layer is configured to combine the first hidden layer units to obtain a second hidden layer unit. The fully connected layer obtains a predicted value according to the second hidden layer unit. The first hidden layer unit is a local feature, and the second hidden layer unit is a global feature. That is, the first layer GRU unit layer is used to extract local information, and the second layer GRU unit layer is used to combine global features to obtain global features, and the fully connected layer is used to obtain prediction results according to global features (eg, disease prediction results). .
GRU单元层包括更新门z t和重置门r t。更新门z t是更新隐藏层单元h t的逻辑门。重置门r t决定选用候选隐藏层单元
Figure PCTCN2018099612-appb-000006
时,是否放弃以前的隐藏层单元h t
The GRU unit layer includes an update gate z t and a reset gate r t . The update gate z t is a logic gate that updates the hidden layer unit h t . Reset gate r t decides to choose candidate hidden layer unit
Figure PCTCN2018099612-appb-000006
When to discard the previous hidden layer unit h t .
在一实施例中,GRU单元层的更新门z t、重置门r t、候选隐藏层单元
Figure PCTCN2018099612-appb-000007
和隐藏层单元h t计算如下:
In an embodiment, the update gate z t of the GRU unit layer, the reset gate r t , and the candidate hidden layer unit
Figure PCTCN2018099612-appb-000007
And the hidden layer unit h t is calculated as follows:
z t=σ(W zx t+U zh t-1+b z); z t =σ(W z x t +U z h t-1 +b z );
r t=σ(W rx t+U rh t-1+b r)。 r t = σ(W r x t +U r h t-1 +b r ).
得到更新门z t和重置门r t后,得到输出(候选隐藏层单元
Figure PCTCN2018099612-appb-000008
和隐藏层单元h t):
After the update gate z t and the reset gate r t are obtained, an output is obtained (candidate hidden layer unit)
Figure PCTCN2018099612-appb-000008
And hidden layer units h t ):
Figure PCTCN2018099612-appb-000009
Figure PCTCN2018099612-appb-000009
其中,σ为Sigmoid激活函数,tanh为Tanh激活函数,W z、U z、b z为更新门z t的参数,W r、U r、b r为重置门r t的参数,W、U、b为候选隐藏层单元
Figure PCTCN2018099612-appb-000010
的参数。
Where σ is the Sigmoid activation function, tanh is the Tanh activation function, W z , U z , b z are the parameters of the update gate z t , W r , U r , b r are the parameters of the reset gate r t , W, U , b is a candidate hidden layer unit
Figure PCTCN2018099612-appb-000010
Parameters.
优化单元306,用于从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型。The optimization unit 306 is configured to obtain training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and use the training data and the verification data to train the multi-layer GRU model and Performance verification, optimized multi-layer GRU model.
可以从预处理后的所述疾病监测数据、天气数据和舆情数据中截取时间序列数据,构成所述训练数据和所述验证数据。The time series data may be intercepted from the disease monitoring data, the weather data, and the public opinion data after the pre-processing to constitute the training data and the verification data.
所述多层GRU模型的输入数据是一个预设维度(例如1000维)的向量。可以从截取的时间序列数据中将每个时间点对应的预处理后的疾病监测数据、天气数据和舆情数据构造一个预设维度的向量,按照时间顺序,将各个时间点对应的向量依次输入所述多层GRU模型,用来对所述多层GRU模型进行训练或验证。The input data of the multi-layer GRU model is a vector of a preset dimension (for example, 1000 dimensions). The pre-processed disease monitoring data, weather data and public opinion data corresponding to each time point may be constructed into a preset dimension vector from the intercepted time series data, and the vectors corresponding to the respective time points are sequentially input into the time sequence. A multi-layer GRU model is used to train or verify the multi-layer GRU model.
例如,从预处理后的所述疾病监测数据、天气数据和舆情数据中截取用于训练所述多层GRU模型的第一时间序列数据;从截取的第一时间序列数据中将每个时间点对应的预处理后的疾病监测数据、天气数据和舆情数据构造一个预设维度的第一向量,按照时间顺序,将各个时间点对应的第一向量依次输入所述多层GRU模型,用于对所述多层GRU模型进行训练。从预处理后的所述疾病监测数据、天气数据和舆情数据中截取用于验证所述多层GRU模型的第二时间序列数据;从截取的第二时间序列数据中将每个时间点对应的预处理后的疾病监测数据、天气数据和舆情数据构造一个预设维度的第二向量,按照时间顺序,将各个时间点对应的第二向量依次输入所述多层GRU模型,用于对所述多层GRU模型进行验证。For example, intercepting first time series data for training the multi-layer GRU model from the pre-processed disease monitoring data, weather data, and public opinion data; each time point from the intercepted first time series data Corresponding pre-processed disease monitoring data, weather data, and public opinion data construct a first vector of a preset dimension, and sequentially input the first vector corresponding to each time point into the multi-layer GRU model in time sequence, for The multi-layer GRU model is trained. Extracting second time series data for verifying the multi-layer GRU model from the pre-processed disease monitoring data, weather data, and public opinion data; corresponding to each time point from the intercepted second time series data The pre-processed disease monitoring data, the weather data, and the public opinion data construct a second vector of a preset dimension, and sequentially input the second vector corresponding to each time point into the multi-layer GRU model in time sequence, for Multi-layer GRU model for verification.
在对所述多层GRU模型进行训练时,所述多层GRU模型的损失函数可以定义为均方差,调整所述多层GRU模型的参数,使得所述均方差取得最小值。训练的过程可以采用RMSprop算法。RMSprop是一种改进的随机梯 度下降算法。均方差与RMSprop算法是现有技术,此处不再赘述。When training the multi-layer GRU model, the loss function of the multi-layer GRU model may be defined as a mean square error, and the parameters of the multi-layer GRU model are adjusted such that the mean square error takes a minimum value. The training process can use the RMSprop algorithm. RMSprop is an improved random gradient descent algorithm. The mean square error and RMSprop algorithm are prior art and will not be described here.
预测单元307,用于从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。The predicting unit 307 is configured to obtain disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and the disease monitoring data before the predicted time point The weather data and the public opinion data are input into the optimized multi-layer GRU model to obtain a disease prediction result at the predicted time point.
获取的预测时间点之前的疾病监测数据、天气数据和舆情数据为时间序列数据。可以从获取的预测时间点之前的疾病监测数据、天气数据和舆情数据中,将每个时间点对应的预处理后的疾病监测数据、天气数据和舆情数据构造一个预设维度的第三向量,按照时间顺序,将各个时间点对应的第三向量依次输入所述多层GRU模型,以对预测时间点进行疾病预测。The disease monitoring data, weather data, and public opinion data before the predicted time point are obtained as time series data. The disease monitoring data, the weather data and the public opinion data before the predicted time point are obtained, and the pre-processed disease monitoring data, the weather data and the public opinion data corresponding to each time point are constructed into a third vector of a preset dimension. In a chronological order, the third vector corresponding to each time point is sequentially input to the multi-layer GRU model to perform disease prediction on the predicted time point.
在进行疾病预测时,从初始时间点开始,优化后的多层GRU模型通过当前时间点的输入数据及前一时间点的隐藏层单元逐层组合得到当前时间点的各隐藏层单元,根据当前时间点的隐藏层单元得到当前时间点的预测值,并根据时间顺序,不断递归获取下一时间点的隐藏层单元以及预测值,直到得到所述给定时间点的预测值。In the disease prediction, from the initial time point, the optimized multi-layer GRU model obtains the hidden layer units of the current time point through the input data of the current time point and the hidden layer unit of the previous time point, according to the current The hidden layer unit at the time point obtains the predicted value of the current time point, and continuously recursively acquires the hidden layer unit of the next time point and the predicted value according to the chronological order until the predicted value of the given time point is obtained.
实施例三通过多层GRU模型对患病数据进行预测。GRU模型可以从数据中直接去提取知识,构造出有利于预测的特征向量,提高了预测精度。并且,实施例三将天气数据、舆情数据作为影响因素在加入到疾病预测中,提高了疾病预测的准确性。此外,与基于LSTM(Long Short-term Memory,长短时记忆)模型的疾病预测方法相比,GRU模型结构简单,可以快速进行优化,从而加快整个疾病预测过程。因此,实施例三实现了快速高准确率的疾病预测。Example 3 predicts disease data by a multi-layer GRU model. The GRU model can extract knowledge directly from the data, construct a feature vector that is favorable for prediction, and improve the prediction accuracy. Moreover, in the third embodiment, the weather data and the public opinion data are included as influence factors in the disease prediction, and the accuracy of the disease prediction is improved. In addition, compared with the disease prediction method based on LSTM (Long Short-term Memory) model, the GRU model has a simple structure and can be quickly optimized to speed up the entire disease prediction process. Therefore, the third embodiment achieves a fast and high accuracy disease prediction.
实施例四Embodiment 4
图4是本申请实施例四提供的疾病预测装置中第二获取单元(即图3中302)的细化结构图。4 is a detailed structural diagram of a second acquisition unit (ie, 302 in FIG. 3) in the disease prediction apparatus provided in Embodiment 4 of the present application.
第二获取单元302可以利用天气信息网站开放的API接口,通过网络爬虫抓取所述天气数据。参阅图4所示,第二获取单元302可以包括:生成子单元3021、请求子单元3022、分析子单元3023、判断子单元3024、抓取子单元3025、存储子单元3026。The second obtaining unit 302 can capture the weather data through a web crawler by using an API interface opened by the weather information website. Referring to FIG. 4, the second obtaining unit 302 may include: a generating subunit 3021, a requesting subunit 3022, an analyzing subunit 3023, a determining subunit 3024, a grabbing subunit 3025, and a storing subunit 3026.
生成子单元3021,用于生成面向所述天气信息网站的API接口的种子URL以及后续的URL。A generating subunit 3021 is configured to generate a seed URL for the API interface of the weather information website and a subsequent URL.
种子URL是网络爬虫进行一切工作的基础和前提。种子URL可以是一个也可以是多个。The seed URL is the basis and premise for the web crawler to do everything. The seed URL can be one or more.
可以对天气信息网站的URL的结构特点进行分析,根据URL的结构特点得到后续的URL。The structural characteristics of the URL of the weather information website can be analyzed, and the subsequent URLs are obtained according to the structural characteristics of the URL.
请求子单元3022,用于向所述天气信息网站的API接口发送HTTP请求,请求访问所述API接口。The requesting subunit 3022 is configured to send an HTTP request to the API interface of the weather information website to request access to the API interface.
可以以GET方式向所述天气信息网站的API接口发送HTTP请求。当天气信息网站同意获取其提供的天气数据时,返回HTTP响应,以告知可以进行获取天气数据的操作。The HTTP request can be sent to the API interface of the weather information website in GET mode. When the weather information website agrees to obtain the weather data it provides, an HTTP response is returned to inform that the weather data can be acquired.
分析子单元3023,用于对所述天气信息网站提供的数据内容进行分析和识别,以查看所述数据内容。The analyzing subunit 3023 is configured to analyze and identify the data content provided by the weather information website to view the data content.
天气信息网站提供特定格式的数据内容,需要对天气信息网站提供的特定格式的数据内容进行分析和识别,来查看所述数据内容。例如,所述天气信息网站的API接口提供的数据格式为JSON格式。JSON是一种数据交换格式,使用了类似于C语言的语法习惯。对该JSON格式的数据内容进行分析和识别,来查看所述数据内容。The weather information website provides data content in a specific format, and needs to analyze and identify the data content in a specific format provided by the weather information website to view the data content. For example, the data format provided by the API interface of the weather information website is in JSON format. JSON is a data exchange format that uses a grammar convention similar to C. The data content of the JSON format is analyzed and identified to view the data content.
判断子单元3024,用于判断所述数据内容是否为预定信息内容。The determining subunit 3024 is configured to determine whether the data content is a predetermined information content.
为了得到特定的天气数据,需要判断所述数据内容是否为预定信息内容。若所述数据内容是否不是预定信息内容,则舍弃该数据内容,否则执行下一步骤。In order to obtain specific weather data, it is necessary to determine whether the data content is a predetermined information content. If the data content is not the predetermined information content, the data content is discarded, otherwise the next step is performed.
抓取子单元3025,用于若所述数据内容为预定信息内容,则抓取所述数据内容。The capture subunit 3025 is configured to capture the data content if the data content is a predetermined information content.
数据抓取的最终目的是将网络数据内容抓取到本地。对于JSON格式的数据内容,在抓取所述数据内容时可以采用深度优先搜索策略进行状态空间搜索。The ultimate goal of data crawling is to crawl network data content locally. For the data content in the JSON format, a depth-first search strategy may be used for the state space search when the data content is captured.
存储子单元3026,用于将抓取的数据内容作为所述天气数据保存到本地。The storage subunit 3026 is configured to save the captured data content as the weather data to the local.
可以在计算设备上创建数据库,将所述天气数据保存到所述数据库中。A database can be created on the computing device to save the weather data to the database.
传统的网络爬虫都是首先设定一个或者多个入口URL,在抓取网页的过程中,按照抓取的策略,从当前网页上提取出新的URL放入队列,以便获取URL对应的网页内容,将网页内容保存到本地,然后,再提取有效地址作为下一次的入口URL,直到爬行完毕。随着网页数量的剧增,传统的网络爬虫会下载大量的无关网页。第二获取单元302利用天气信息网站开放的API接口,通过网络爬虫抓取所述天气数据,可以避免下载无关网页,高效地获取天气数据,从而提高疾病预测的效率。The traditional web crawler first sets one or more portal URLs. In the process of crawling the webpage, according to the crawling strategy, a new URL is extracted from the current webpage into the queue, so as to obtain the webpage content corresponding to the URL. , save the content of the webpage to the local, and then extract the effective address as the next entry URL until the crawl is completed. As the number of web pages increases dramatically, traditional web crawlers download a large number of irrelevant web pages. The second obtaining unit 302 uses the API interface opened by the weather information website to capture the weather data through the web crawler, thereby avoiding downloading irrelevant web pages and efficiently acquiring weather data, thereby improving the efficiency of disease prediction.
实施例五Embodiment 5
图5为本申请实施例五提供的计算机装置的示意图。所述计算机装置1包括存储器20、处理器30以及存储在所述存储器20中并可在所述处理器30上运行的计算机程序40,例如疾病预测程序。所述处理器30执行所述计算机程序40时实现上述疾病预测方法实施例中的步骤,例如图1所示的步骤101-107。或者,所述处理器30执行所述计算机程序40时实现上述装置实施例中各模块/单元的功能,例如图3中的单元301-307。FIG. 5 is a schematic diagram of a computer apparatus according to Embodiment 5 of the present application. The computer device 1 includes a memory 20, a processor 30, and a computer program 40, such as a disease prediction program, stored in the memory 20 and executable on the processor 30. The processor 30 executes the computer program 40 to implement the steps in the above-described disease prediction method embodiment, such as steps 101-107 shown in FIG. Alternatively, the processor 30, when executing the computer program 40, implements the functions of the various modules/units in the above-described apparatus embodiments, such as units 301-307 in FIG.
示例性的,所述计算机程序40可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器20中,并由所述处理器30执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序40在所述计算机装置1中的执行过程。例如,所述计算机程序40可以被分割成图3中的第一获取单元301、第二获取单元302、第三获取单元303、预处理单元304、构建单元305、优化单元306、预测单元307,各单元具体功能参见实施例三。Illustratively, the computer program 40 can be partitioned into one or more modules/units that are stored in the memory 20 and executed by the processor 30 to complete This application. The one or more modules/units may be a series of computer program instruction segments capable of performing a particular function for describing the execution of the computer program 40 in the computer device 1. For example, the computer program 40 may be divided into a first obtaining unit 301, a second obtaining unit 302, a third obtaining unit 303, a pre-processing unit 304, a building unit 305, an optimizing unit 306, and a predicting unit 307 in FIG. For the specific functions of each unit, refer to the third embodiment.
所称处理器30可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或 者该处理器30也可以是任何常规的处理器等,所述处理器30是所述计算机装置1的控制中心,利用各种接口和线路连接整个计算机装置1的各个部分。The processor 30 may be a central processing unit (CPU), or may be other general-purpose processors, a digital signal processor (DSP), an application specific integrated circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor 30 may be any conventional processor or the like, and the processor 30 is a control center of the computer device 1, and connects the entire computer device 1 by using various interfaces and lines. Various parts.
所述存储器20可用于存储所述计算机程序40和/或模块/单元,所述处理器30通过运行或执行存储在所述存储器20内的计算机程序和/或模块/单元,以及调用存储在存储器20内的数据,实现所述计算机装置1的各种功能。所述存储器20可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据计算机装置1的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器20可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 20 can be used to store the computer program 40 and/or modules/units by running or executing computer programs and/or modules/units stored in the memory 20, and by calling in memory. The data within 20 implements various functions of the computer device 1. The memory 20 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be Data (such as audio data, phone book, etc.) created according to the use of the computer device 1 is stored. In addition, the memory 20 may include a high-speed random access memory, and may also include a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a secure digital (Secure Digital, SD). Card, flash card, at least one disk storage device, flash device, or other volatile solid state storage device.
所述计算机装置1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述非易失性可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述非易失性可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,非易失性可读介质不包括电载波信号和电信信号。The modules/units integrated by the computer device 1 can be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the present application implements all or part of the processes in the foregoing embodiments, and may also be completed by a computer program to instruct related hardware. The computer program may be stored in a non-volatile readable storage medium. The computer program, when executed by the processor, implements the steps of the various method embodiments described above. Wherein, the computer program comprises computer program code, which may be in the form of source code, object code form, executable file or some intermediate form. The non-transitory readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read only memory (ROM, Read- Only Memory), Random Access Memory (RAM), electrical carrier signals, telecommunications signals, and software distribution media. It should be noted that the contents of the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, Volatile readable media does not include electrical carrier signals and telecommunication signals.
在本申请所提供的几个实施例中,应该理解到,所揭露的计算机装置和方法,可以通过其它的方式实现。例如,以上所描述的计算机装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided by the present application, it should be understood that the disclosed computer apparatus and method may be implemented in other manners. For example, the computer device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division, and the actual implementation may have another division manner.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。计算机装置权利要求中陈述的多个单元或计算机装置也可以由同一个单元或计算机装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。In addition, it is to be understood that the word "comprising" does not exclude other elements or steps. A plurality of units or computer devices recited in the computer device claims can also be implemented by the same unit or computer device in software or hardware. The first, second, etc. words are used to denote names and do not denote any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。It should be noted that the above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto. Although the present application is described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solutions of the present application can be applied. Modifications or equivalents are made without departing from the spirit and scope of the technical solutions of the present application.

Claims (20)

  1. 一种疾病预测方法,其特征在于,所述方法包括:A disease prediction method, characterized in that the method comprises:
    获取疾病监测数据,所述疾病监测数据是时间序列数据;Obtaining disease monitoring data, the disease monitoring data is time series data;
    获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据;Obtaining weather data related to the disease monitoring data, the weather data being time series data corresponding to the disease monitoring data;
    获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据;Obtaining public opinion data related to the disease monitoring data, wherein the public opinion data is time series data corresponding to the disease monitoring data;
    对所述疾病监测数据、天气数据和舆情数据进行预处理;Pre-processing the disease monitoring data, weather data, and public opinion data;
    构建多层门控递归单元神经网络模型,即多层GRU模型;Construct a multi-layer gated recursive unit neural network model, that is, a multi-layer GRU model;
    从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型;Obtaining training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and using the training data and the verification data to perform training and performance verification on the multi-layer GRU model, and then optimized Multi-layer GRU model;
    从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。Obtaining disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and monitoring disease data, weather data, and public opinion data before the predicted time point The optimized multi-layer GRU model is input to obtain a disease prediction result at the predicted time point.
  2. 如权利要求1所述的方法,其特征在于,所述从网页中抓取天气数据包括:The method of claim 1 wherein said capturing weather data from a web page comprises:
    生成面向天气信息网站的API接口的种子URL以及后续的URL;Generating a seed URL for the API interface of the weather information website and subsequent URLs;
    向所述天气信息网站的API接口发送HTTP请求,请求访问所述API接口;Sending an HTTP request to an API interface of the weather information website, requesting access to the API interface;
    对所述天气信息网站提供的数据内容进行分析和识别,以查看所述数据内容;Analysing and identifying data content provided by the weather information website to view the data content;
    判断所述数据内容是否为预定信息内容;Determining whether the data content is a predetermined information content;
    若所述数据内容为预定信息内容,则抓取所述数据内容;If the data content is a predetermined information content, the data content is captured;
    将抓取的数据内容作为所述天气数据保存到本地。The captured data content is saved locally as the weather data.
  3. 如权利要求1所述的方法,其特征在于,所述舆情数据包括:The method of claim 1 wherein said public opinion data comprises:
    特定词的搜索次数;或者The number of searches for a particular word; or
    特定舆情网站包含特定词的舆情信息的数量。A particular sensational website contains the amount of lyric information for a particular word.
  4. 如权利要求1所述的方法,其特征在于,所述对所述疾病监测数据、天气数据和舆情数据进行预处理包括:The method of claim 1 wherein said pre-processing said disease monitoring data, weather data, and public opinion data comprises:
    填补所述疾病监测数据、天气数据和舆情数据中的缺失值;Filling in missing values in the disease surveillance data, weather data and public opinion data;
    修正对所述疾病监测数据、天气数据和舆情数据中的异常值;Correcting outliers in the disease monitoring data, weather data, and public opinion data;
    对所述疾病监测数据、天气数据和舆情数据进行数据格式转换。Data format conversion is performed on the disease monitoring data, weather data, and public opinion data.
  5. 如权利要求1-4中任一项所述的方法,其特征在于,所述天气数据包括湿度、气温、气压、降水量、水汽压、风速、风向、日照时数。The method according to any one of claims 1 to 4, wherein the weather data includes humidity, temperature, air pressure, precipitation, water vapor pressure, wind speed, wind direction, and sunshine hours.
  6. 如权利要求1-4中任一项所述的方法,其特征在于,所述多层GRU模型包括两层GRU单元层和一层全连接层,第一层GRU单元层用于对输入数据构造特征,得到第一隐藏层单元,第二层GRU单元层用于对所述第一隐藏层单元 进行组合,得到第二隐藏层单元,所述全连接层用于根据所述第二隐藏层单元得到预测结果,每个GRU单元层包括重置门和更新门,所述重置门和更新门控制所述GRU单元层的记忆状态。The method according to any one of claims 1 to 4, wherein the multi-layer GRU model comprises two layers of GRU unit layers and one layer of fully connected layers, and the first layer of GRU unit layers is used for constructing input data Feature, obtaining a first hidden layer unit, the second layer GRU unit layer is configured to combine the first hidden layer unit to obtain a second hidden layer unit, wherein the fully connected layer is used according to the second hidden layer unit Obtaining a prediction result, each GRU unit layer includes a reset gate and an update gate, and the reset gate and the update gate control a memory state of the GRU unit layer.
  7. 如权利要求1-4中任一项所述的方法,其特征在于,所述多层GRU模型训练过程中使用的损失函数为均方差,使用的算法为RMSprop算法。The method according to any one of claims 1 to 4, wherein the loss function used in the training of the multi-layer GRU model is a mean square error, and the algorithm used is an RMSprop algorithm.
  8. 一种疾病预测装置,其特征在于,所述装置包括:A disease prediction device, characterized in that the device comprises:
    第一获取单元,用于获取疾病监测数据,所述疾病监测数据是时间序列数据;a first acquiring unit, configured to acquire disease monitoring data, where the disease monitoring data is time series data;
    第二获取单元,用于获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据;a second acquiring unit, configured to acquire weather data related to the disease monitoring data, where the weather data is time series data corresponding to the disease monitoring data;
    第三获取单元,用于获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据;a third obtaining unit, configured to acquire public opinion data related to the disease monitoring data, where the public opinion data is time series data corresponding to the disease monitoring data;
    预处理单元,用于对所述疾病监测数据、天气数据和舆情数据进行预处理;a pre-processing unit for pre-processing the disease monitoring data, weather data, and public opinion data;
    构建单元,用于构建多层门控递归单元神经网络模型,即多层GRU模型;a building unit for constructing a multi-layer gated recursive unit neural network model, that is, a multi-layer GRU model;
    优化单元,用于从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型;An optimization unit, configured to acquire training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and use the training data and the verification data to train and perform performance on the multi-layer GRU model Verification, obtaining an optimized multi-layer GRU model;
    预测单元,用于从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。a prediction unit, configured to obtain disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and the disease monitoring data before the predicted time point, The weather data and the public opinion data are input to the optimized multi-layer GRU model to obtain the disease prediction result at the predicted time point.
  9. 一种计算机装置,其特征在于,所述计算机装置包括存储器及处理器,所述存储器用于存储至少一个计算机可读指令,所述处理器用于执行所述至少一个计算机可读指令以实现以下步骤:A computer apparatus, comprising: a memory for storing at least one computer readable instruction, and a processor for executing the at least one computer readable instruction to implement the following steps :
    获取疾病监测数据,所述疾病监测数据是时间序列数据;Obtaining disease monitoring data, the disease monitoring data is time series data;
    获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据;Obtaining weather data related to the disease monitoring data, the weather data being time series data corresponding to the disease monitoring data;
    获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据;Obtaining public opinion data related to the disease monitoring data, wherein the public opinion data is time series data corresponding to the disease monitoring data;
    对所述疾病监测数据、天气数据和舆情数据进行预处理;Pre-processing the disease monitoring data, weather data, and public opinion data;
    构建多层门控递归单元神经网络模型,即多层GRU模型;Construct a multi-layer gated recursive unit neural network model, that is, a multi-layer GRU model;
    从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型;Obtaining training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and using the training data and the verification data to perform training and performance verification on the multi-layer GRU model, and then optimized Multi-layer GRU model;
    从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。Obtaining disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and monitoring disease data, weather data, and public opinion data before the predicted time point The optimized multi-layer GRU model is input to obtain a disease prediction result at the predicted time point.
  10. 如权利要求9所述的计算机装置,其特征在于,所述从网页中抓取天气数据包括:The computer apparatus according to claim 9, wherein the capturing weather data from the webpage comprises:
    生成面向天气信息网站的API接口的种子URL以及后续的URL;Generating a seed URL for the API interface of the weather information website and subsequent URLs;
    向所述天气信息网站的API接口发送HTTP请求,请求访问所述API接口;Sending an HTTP request to an API interface of the weather information website, requesting access to the API interface;
    对所述天气信息网站提供的数据内容进行分析和识别,以查看所述数据内容;Analysing and identifying data content provided by the weather information website to view the data content;
    判断所述数据内容是否为预定信息内容;Determining whether the data content is a predetermined information content;
    若所述数据内容为预定信息内容,则抓取所述数据内容;If the data content is a predetermined information content, the data content is captured;
    将抓取的数据内容作为所述天气数据保存到本地。The captured data content is saved locally as the weather data.
  11. 如权利要求9所述的计算机装置,其特征在于,所述舆情数据包括:The computer apparatus according to claim 9, wherein said public opinion data comprises:
    特定词的搜索次数;或者The number of searches for a particular word; or
    特定舆情网站包含特定词的舆情信息的数量。A particular sensational website contains the amount of lyric information for a particular word.
  12. 如权利要求9所述的计算机装置,其特征在于,所述对所述疾病监测数据、天气数据和舆情数据进行预处理包括:The computer apparatus according to claim 9, wherein said preprocessing said disease monitoring data, weather data, and public opinion data comprises:
    填补所述疾病监测数据、天气数据和舆情数据中的缺失值;Filling in missing values in the disease surveillance data, weather data and public opinion data;
    修正对所述疾病监测数据、天气数据和舆情数据中的异常值;Correcting outliers in the disease monitoring data, weather data, and public opinion data;
    对所述疾病监测数据、天气数据和舆情数据进行数据格式转换。Data format conversion is performed on the disease monitoring data, weather data, and public opinion data.
  13. 如权利要求9-12中任一项所述的计算机装置,其特征在于,所述天气数据包括湿度、气温、气压、降水量、水汽压、风速、风向、日照时数。A computer apparatus according to any one of claims 9 to 12, wherein the weather data includes humidity, temperature, air pressure, precipitation, water vapor pressure, wind speed, wind direction, and sunshine hours.
  14. 如权利要求9-12中任一项所述的计算机装置,其特征在于,所述多层GRU模型包括两层GRU单元层和一层全连接层,第一层GRU单元层用于对输入数据构造特征,得到第一隐藏层单元,第二层GRU单元层用于对所述第一隐藏层单元进行组合,得到第二隐藏层单元,所述全连接层用于根据所述第二隐藏层单元得到预测结果,每个GRU单元层包括重置门和更新门,所述重置门和更新门控制所述GRU单元层的记忆状态。The computer apparatus according to any one of claims 9 to 12, wherein the multi-layer GRU model comprises two layers of GRU unit layers and one layer of fully connected layers, and the first layer of GRU unit layers is used for input data Constructing a feature, obtaining a first hidden layer unit, the second layer GRU unit layer is configured to combine the first hidden layer unit to obtain a second hidden layer unit, wherein the fully connected layer is used according to the second hidden layer The unit obtains a prediction result, and each GRU unit layer includes a reset gate and an update gate, and the reset gate and the update gate control a memory state of the GRU unit layer.
  15. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:A non-volatile readable storage medium, characterized in that the non-volatile readable storage medium stores at least one computer readable instruction, the at least one computer readable instruction being executed by a processor to implement the following steps :
    获取疾病监测数据,所述疾病监测数据是时间序列数据;Obtaining disease monitoring data, the disease monitoring data is time series data;
    获取所述疾病监测数据相关的天气数据,所述天气数据是与所述疾病监测数据对应的时间序列数据;Obtaining weather data related to the disease monitoring data, the weather data being time series data corresponding to the disease monitoring data;
    获取所述疾病监测数据相关的舆情数据,所述舆情数据是与所述疾病监测数据对应的时间序列数据;Obtaining public opinion data related to the disease monitoring data, wherein the public opinion data is time series data corresponding to the disease monitoring data;
    对所述疾病监测数据、天气数据和舆情数据进行预处理;Pre-processing the disease monitoring data, weather data, and public opinion data;
    构建多层门控递归单元神经网络模型,即多层GRU模型;Construct a multi-layer gated recursive unit neural network model, that is, a multi-layer GRU model;
    从预处理后的所述疾病监测数据、天气数据和舆情数据中获取训练数据和验证数据,利用所述训练数据和所述验证数据对所述多层GRU模型进行训练和性能验证,得到优化后的多层GRU模型;Obtaining training data and verification data from the pre-processed disease monitoring data, weather data, and public opinion data, and using the training data and the verification data to perform training and performance verification on the multi-layer GRU model, and then optimized Multi-layer GRU model;
    从预处理后的所述疾病监测数据、天气数据和舆情数据中获取预测时间点之前的疾病监测数据、天气数据和舆情数据,将所述预测时间点之前的疾病监测数据、天气数据和舆情数据输入所述优化后的多层GRU模型,得到所述预测时间点的疾病预测结果。Obtaining disease monitoring data, weather data, and public opinion data before the predicted time point from the pre-processed disease monitoring data, weather data, and public opinion data, and monitoring disease data, weather data, and public opinion data before the predicted time point The optimized multi-layer GRU model is input to obtain a disease prediction result at the predicted time point.
  16. 如权利要求15所述的存储介质,其特征在于,所述从网页中抓取天气数据包括:The storage medium of claim 15, wherein the fetching weather data from the webpage comprises:
    生成面向天气信息网站的API接口的种子URL以及后续的URL;Generating a seed URL for the API interface of the weather information website and subsequent URLs;
    向所述天气信息网站的API接口发送HTTP请求,请求访问所述API接口;Sending an HTTP request to an API interface of the weather information website, requesting access to the API interface;
    对所述天气信息网站提供的数据内容进行分析和识别,以查看所述数据内容;Analysing and identifying data content provided by the weather information website to view the data content;
    判断所述数据内容是否为预定信息内容;Determining whether the data content is a predetermined information content;
    若所述数据内容为预定信息内容,则抓取所述数据内容;If the data content is a predetermined information content, the data content is captured;
    将抓取的数据内容作为所述天气数据保存到本地。The captured data content is saved locally as the weather data.
  17. 如权利要求15所述的存储介质,其特征在于,所述舆情数据包括:The storage medium of claim 15 wherein said public opinion data comprises:
    特定词的搜索次数;或者The number of searches for a particular word; or
    特定舆情网站包含特定词的舆情信息的数量。A particular sensational website contains the amount of lyric information for a particular word.
  18. 如权利要求15所述的存储介质,其特征在于,所述对所述疾病监测数据、天气数据和舆情数据进行预处理包括:The storage medium of claim 15 wherein said pre-processing said disease monitoring data, weather data, and public opinion data comprises:
    填补所述疾病监测数据、天气数据和舆情数据中的缺失值;Filling in missing values in the disease surveillance data, weather data and public opinion data;
    修正对所述疾病监测数据、天气数据和舆情数据中的异常值;Correcting outliers in the disease monitoring data, weather data, and public opinion data;
    对所述疾病监测数据、天气数据和舆情数据进行数据格式转换。Data format conversion is performed on the disease monitoring data, weather data, and public opinion data.
  19. 如权利要求15-18中任一项所述的存储介质,其特征在于,所述天气数据包括湿度、气温、气压、降水量、水汽压、风速、风向、日照时数。The storage medium according to any one of claims 15 to 18, wherein the weather data includes humidity, temperature, air pressure, precipitation, water vapor pressure, wind speed, wind direction, and sunshine hours.
  20. 如权利要求15-18中任一项所述的存储介质,其特征在于,所述多层GRU模型包括两层GRU单元层和一层全连接层,第一层GRU单元层用于对输入数据构造特征,得到第一隐藏层单元,第二层GRU单元层用于对所述第一隐藏层单元进行组合,得到第二隐藏层单元,所述全连接层用于根据所述第二隐藏层单元得到预测结果,每个GRU单元层包括重置门和更新门,所述重置门和更新门控制所述GRU单元层的记忆状态。The storage medium according to any one of claims 15 to 18, wherein the multi-layer GRU model comprises two layers of GRU unit layers and one layer of fully connected layers, and the first layer GRU unit layer is used for input data Constructing a feature, obtaining a first hidden layer unit, the second layer GRU unit layer is configured to combine the first hidden layer unit to obtain a second hidden layer unit, wherein the fully connected layer is used according to the second hidden layer The unit obtains a prediction result, and each GRU unit layer includes a reset gate and an update gate, and the reset gate and the update gate control a memory state of the GRU unit layer.
PCT/CN2018/099612 2018-04-11 2018-08-09 Disease prediction method and device, computer device and readable storage medium WO2019196280A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810322431.8A CN108288502A (en) 2018-04-11 2018-04-11 Disease forecasting method and device, computer installation and readable storage medium storing program for executing
CN201810322431.8 2018-04-11

Publications (1)

Publication Number Publication Date
WO2019196280A1 true WO2019196280A1 (en) 2019-10-17

Family

ID=62834473

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/099612 WO2019196280A1 (en) 2018-04-11 2018-08-09 Disease prediction method and device, computer device and readable storage medium

Country Status (2)

Country Link
CN (1) CN108288502A (en)
WO (1) WO2019196280A1 (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108648829A (en) * 2018-04-11 2018-10-12 平安科技(深圳)有限公司 Disease forecasting method and device, computer installation and readable storage medium storing program for executing
CN108288502A (en) * 2018-04-11 2018-07-17 平安科技(深圳)有限公司 Disease forecasting method and device, computer installation and readable storage medium storing program for executing
CN109119159B (en) * 2018-08-20 2022-04-15 北京理工大学 Deep learning medical diagnosis system based on rapid weight mechanism
CN110895926A (en) * 2018-09-12 2020-03-20 普天信息技术有限公司 Voice recognition method and device
CN109558128A (en) * 2018-10-25 2019-04-02 平安科技(深圳)有限公司 Json data analysis method, device and computer readable storage medium
CN109545386B (en) * 2018-11-02 2021-07-20 深圳先进技术研究院 Influenza spatiotemporal prediction method and device based on deep learning
CN109473180A (en) * 2018-11-20 2019-03-15 河南省疾病预防控制中心 A kind of Disease Control Agency information system based on B/S framework
CN110162398B (en) * 2019-04-11 2024-05-03 平安科技(深圳)有限公司 Scheduling method and device of disease analysis model and terminal equipment
CN110211690A (en) * 2019-04-19 2019-09-06 平安科技(深圳)有限公司 Disease risks prediction technique, device, computer equipment and computer storage medium
CN110379522B (en) * 2019-07-23 2022-08-12 四川骏逸富顿科技有限公司 Disease prevalence trend prediction system and method
CN110610767B (en) * 2019-08-01 2023-06-02 平安科技(深圳)有限公司 Morbidity monitoring method, device, equipment and storage medium
CN110675959B (en) * 2019-08-19 2023-07-07 平安科技(深圳)有限公司 Intelligent data analysis method and device, computer equipment and storage medium
CN110767279A (en) * 2019-10-21 2020-02-07 山东师范大学 Electronic health record missing data completion method and system based on LSTM
CN110993114B (en) * 2019-11-26 2023-06-27 泰康保险集团股份有限公司 Medical data analysis method and device, storage device and electronic equipment
CN111430040A (en) * 2020-03-03 2020-07-17 广东省公共卫生研究院 Hand-foot-and-mouth disease epidemic situation prediction method based on case, weather and pathogen monitoring data
CN111696674B (en) * 2020-06-12 2023-09-08 电子科技大学 Deep learning method and system for electronic medical records

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157422A1 (en) * 2007-12-17 2009-06-18 Boris Gorbis Internet-based balascopy-information exchange system
CN106327240A (en) * 2016-08-11 2017-01-11 中国船舶重工集团公司第七0九研究所 Recommendation method and recommendation system based on GRU neural network
CN106897404A (en) * 2017-02-14 2017-06-27 中国船舶重工集团公司第七0九研究所 A kind of recommendation method and system based on many GRU layers of neutral nets
CN107563122A (en) * 2017-09-20 2018-01-09 长沙学院 The method of crime prediction of Recognition with Recurrent Neural Network is locally connected based on interleaving time sequence
CN107590168A (en) * 2016-07-08 2018-01-16 百度(美国)有限责任公司 The system and method inferred for relation
CN108288502A (en) * 2018-04-11 2018-07-17 平安科技(深圳)有限公司 Disease forecasting method and device, computer installation and readable storage medium storing program for executing

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005063456A (en) * 2004-10-07 2005-03-10 Mitsui Sumitomo Insurance Co Ltd Disease symptoms prediction server, disease symptoms predicting system, disease symptoms prediction method, and program
JP2013092929A (en) * 2011-10-26 2013-05-16 Seiko Epson Corp Disease prediction system and disease prediction method
CN105678080A (en) * 2016-01-11 2016-06-15 浪潮集团有限公司 Method for predicting influenza outbreak possibility through big data search and analysis
CN105812463A (en) * 2016-03-10 2016-07-27 深圳市前海安测信息技术有限公司 Disease early warning system and method based on medical big data
KR101846951B1 (en) * 2017-02-22 2018-04-09 주식회사 씨씨앤아이리서치 An application for predicting an acute exacerbation of chronic obstructive pulmonary disease

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157422A1 (en) * 2007-12-17 2009-06-18 Boris Gorbis Internet-based balascopy-information exchange system
CN107590168A (en) * 2016-07-08 2018-01-16 百度(美国)有限责任公司 The system and method inferred for relation
CN106327240A (en) * 2016-08-11 2017-01-11 中国船舶重工集团公司第七0九研究所 Recommendation method and recommendation system based on GRU neural network
CN106897404A (en) * 2017-02-14 2017-06-27 中国船舶重工集团公司第七0九研究所 A kind of recommendation method and system based on many GRU layers of neutral nets
CN107563122A (en) * 2017-09-20 2018-01-09 长沙学院 The method of crime prediction of Recognition with Recurrent Neural Network is locally connected based on interleaving time sequence
CN108288502A (en) * 2018-04-11 2018-07-17 平安科技(深圳)有限公司 Disease forecasting method and device, computer installation and readable storage medium storing program for executing

Also Published As

Publication number Publication date
CN108288502A (en) 2018-07-17

Similar Documents

Publication Publication Date Title
WO2019196280A1 (en) Disease prediction method and device, computer device and readable storage medium
WO2019196286A1 (en) Illness prediction method and device, computer device, and readable storage medium
Hassan Zadeh et al. Social media for nowcasting flu activity: spatio-temporal big data analysis
Lee et al. Forecasting influenza levels using real-time social media streams
WO2021068601A1 (en) Medical record detection method and apparatus, device and storage medium
US20210125732A1 (en) System and method with federated learning model for geotemporal data associated medical prediction applications
WO2019196278A1 (en) Weather data acquisition method and apparatus, computer apparatus and readable storage medium
WO2018059016A1 (en) Feature processing method and feature processing system for machine learning
CN111933300B (en) Epidemic situation prevention and control effect prediction method, device, server and storage medium
Chandra et al. Fertility decline and the 1918 influenza pandemic in Taiwan
CN113094477B (en) Data structuring method and device, computer equipment and storage medium
US11176126B2 (en) Generating a reliable response to a query
Turk et al. A predictive internet-based model for COVID-19 hospitalization census
CN114420308A (en) Infectious disease propagation path analysis method, device, apparatus, and storage medium
Deng et al. Hospital crowdedness evaluation and in-hospital resource allocation based on image recognition technology
WO2019196283A1 (en) Epidemic disease prediction method, computer device and non-volatile readable storage medium
US11557399B2 (en) Integrative machine learning framework for combining sentiment-based and symptom-based predictive inferences
US20230359817A1 (en) Identifying utilization of intellectual property
Wang et al. A Review of Social Media Data Utilization for the Prediction of Disease Outbreaks and Understanding Public Perception
Nguyen et al. Estimating county health indices using graph neural networks
US11720558B2 (en) Generating a timely response to a query
US20220327377A1 (en) Neural signal detection
Wang Public health emergency decision-making and management system sound research using rough set attribute reduction and blockchain
US20220122448A1 (en) Multi-Computer Processing System for Compliance Monitoring and Control
WO2021139220A1 (en) Epidemic monitoring and controlling method and apparatus, computer device, storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18914128

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19/01/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18914128

Country of ref document: EP

Kind code of ref document: A1