WO2019200786A1 - Method for forecasting public sentiment data, device, terminal, and storage medium - Google Patents

Method for forecasting public sentiment data, device, terminal, and storage medium Download PDF

Info

Publication number
WO2019200786A1
WO2019200786A1 PCT/CN2018/100229 CN2018100229W WO2019200786A1 WO 2019200786 A1 WO2019200786 A1 WO 2019200786A1 CN 2018100229 W CN2018100229 W CN 2018100229W WO 2019200786 A1 WO2019200786 A1 WO 2019200786A1
Authority
WO
WIPO (PCT)
Prior art keywords
disease
data
factor
keyword
data source
Prior art date
Application number
PCT/CN2018/100229
Other languages
French (fr)
Chinese (zh)
Inventor
阮晓雯
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019200786A1 publication Critical patent/WO2019200786A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu

Definitions

  • the present application relates to the field of data prediction technologies, and in particular, to a method, device, terminal, and storage medium for predicting public opinion data.
  • web crawling technology is used to crawl the public opinion data about the disease, but the crawling method is relatively simple, and the simple crawling method is adopted. Secondly, there is no effective and timely inspection of the data obtained by the climb. In addition, for differently distributed data, the same data cleaning and filling method is adopted, and the data processing effect is poor.
  • a first aspect of the present application provides a method for predicting public opinion data, the method comprising:
  • Derived variables of the estrous factors of the disease are calculated based on the new disease data, and the disease is predicted based on the derived variables.
  • a second aspect of the present application provides a public opinion data prediction apparatus, the apparatus comprising:
  • a receiving module configured to receive at least one keyword of a disease input by the user
  • a crawling module configured to determine a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;
  • An analysis module configured to parse the disease data to obtain a disease factor of the disease
  • a cleaning module for performing data cleaning and abnormal value processing on the grievance factor of the disease
  • a prediction module configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.
  • a third aspect of the present application provides a terminal, the terminal including a processor and a memory, the processor implementing the method for predicting public opinion data when the computer readable instructions stored in the memory are executed.
  • a fourth aspect of the present application provides a non-volatile readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the method of predicting data.
  • the method, device, terminal and storage medium for predicting public opinion data by setting different crawler programs corresponding to different types of data sources, using a multi-threaded crawler program to crawl keywords that are crawled and input from corresponding data sources
  • parallel crawling method can speed up the efficiency of crawling, and the data format of the disease data obtained by crawling is relatively uniform, and can avoid difficulty in crawling due to storage format or other problems of data of different data sources or
  • the problem of the problem of parsing the data after the crawling cannot be performed; the data of the disease is analyzed, the depth analysis and the calculation are performed, and the disease data obtained by the crawling is refined, and then formed into a graph or a table.
  • Classes the results are more clear and easy to analyze problems intuitively.
  • a number of variables are derived from the disease's sensation factors, which increases the data indicators and provides a reference for disease prediction, so that the disease prediction will not be blind, empirical, and the prediction results will be more accurate.
  • FIG. 1 is a flowchart of a method for predicting public opinion data provided in Embodiment 1 of the present application.
  • FIG. 2 is a flowchart of a method for predicting public opinion data provided in Embodiment 2 of the present application.
  • FIG. 3 is a structural diagram of a public opinion data prediction apparatus according to Embodiment 3 of the present application.
  • Embodiment 4 is a structural diagram of a public opinion data prediction apparatus provided in Embodiment 4 of the present application.
  • FIG. 5 is a structural diagram of a terminal provided in Embodiment 5 of the present application.
  • the public opinion data prediction method of the embodiment of the present application is applied to one or more terminals.
  • the method of predicting data can also be applied to a hardware environment composed of a terminal and a server connected to the terminal through a network.
  • Networks include, but are not limited to, wide area networks, metropolitan area networks, or local area networks.
  • the public opinion data prediction method in the embodiment of the present application may be executed by a server or may be performed by a terminal; or may be performed by a server and a terminal together.
  • the public opinion data prediction function provided by the method of the present application may be directly integrated on the terminal, or the client for implementing the method of the present application may be installed.
  • the method provided by the present application may also be run on a server or the like in the form of a software development kit (step DK), providing an interface for public opinion data prediction function in the form of step DK, terminal or other
  • step DK software development kit
  • FIG. 1 is a flowchart of a method for predicting public opinion data provided in Embodiment 1 of the present application.
  • the order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.
  • Step 11 Receive at least one keyword of a disease input by the user.
  • the keyword is a word related to the symptoms of the disease, for example, when the disease is a cold, the keywords may include: sneezing, runny nose, stuffy nose, headache, dizziness, cough, innocence, sore throat, and the like.
  • the keywords when the disease is hand, foot and mouth, the keywords may include: mouth pain, anorexia, hypothermia, hand herpes, small mouth ulcers, and the like.
  • the keyword may be a symptom of a disease obtained by the user according to his or her own experience, or may be a symptom of a disease collected from a disease expert.
  • the terminal presets a function for the user to input a keyword of the disease.
  • the terminal provides a text input box through which the user can input at least one keyword.
  • the terminal provides a function of a voice assistant, and the user can input at least one keyword through the voice assistant.
  • Step 12 Determine a data source related to the keyword in the Internet, and use a crawler program to crawl disease data related to the keyword from the data source.
  • the data sources related to the keywords in the Internet may include, but are not limited to, Baidu, Google, Tencent, Weibo, Hot Search, and any website that supports user search access.
  • Using the crawler program to crawl disease data related to the keyword from various data sources may include: Baidu Index, Google Trends, Tencent Analysis, news information, advertisement data, channel data, microblogging heat, forum public opinion information, and the like.
  • the user determines a Uniform Resource Locator (URL) of the data source in the Internet, and the crawler crawls the disease data related to the keyword according to the URL.
  • URL Uniform Resource Locator
  • Step 13 Analyze the disease data to obtain a sensation factor of the disease.
  • the specific analysis work including the public opinion analysis of the disease data, including text processing, text analysis, word frequency statistics, correlation analysis, etc., to obtain the disease sensation factors.
  • the sensation factor of the disease may include a plurality of sub-sentiment factors, for example, a first sub-sentiment factor, a second sub-sentiment factor, a third sub-sentiment factor, a fourth sentiment factor, and the like.
  • the first sub-sentiment factor may be a headache
  • the second sub-sentiment factor may be a runny nose
  • the third sub-sentiment factor may be a fever
  • the fourth sub-sense factor may be a cough.
  • Step 14 Perform data cleaning and abnormal value processing on the sensation factor of the disease.
  • Data cleaning and outlier processing of the disease sensation factors are performed to eliminate redundant data in the grievance factors of the disease, and to obtain disease data in a consistent standard format, so that the disease after washing and abnormal value processing
  • the lyric factor is available and more suitable for subsequent analysis work.
  • the data cleaning of the sensation factor of the disease comprises: performing data cleaning on the sensation factor of the disease according to the type of the sensation factor of the disease.
  • sensation factors of the disease include, but are not limited to, estrus factors of noise-containing diseases, sensation factors of unconformity diseases, sensation factors of diseases containing repeated information, sensation factors of diseases with unbalanced data, inconsistencies
  • the disease factor of the disease the sensation factor of the incomplete disease, and the like.
  • the data is cleaned by removing the extra large value and the negative value point; the lyric factor for the unconformity disease is cleaned by the method of removing the abnormal value;
  • the lyric factor of the disease of the information is cleaned by means of deleting duplicates; the lyric factor of the unbalanced disease is cleaned by data denoising method; the lyric factor for the inconsistent disease is determined by data type The method of class is used for data cleaning; for the lyric factor of the incomplete disease, data cleaning is performed by establishing a reference value of the relevant standard.
  • the abnormal value processing of the sensation factor of the disease comprises: performing a missing value replacement on the sensation factor of the disease according to the distribution of the sensation factor of the disease.
  • the distribution of the sensation factors of the disease includes, but is not limited to, a stable type and a severe type.
  • the lyric factor of the stable distributed disease means that the trend of the sensation factor of the disease is relatively stable, for example, 50, 53, 52, 49, 51 and the like.
  • the lyric factor of the severely distributed disease means that the change trend of the sensation factor of the disease is sharp and the change range is large, for example, 50, 100, 43, 89, 4, and the like.
  • the K-nearest distance neighbor method can be used to determine the nearest K samples from the lyric factor sample with the missing disease according to the Euclidean distance or correlation analysis, and the K disease
  • the weighted average of the lyric factor values is used to estimate the missing data of the sample; for the stable distribution of the estrous factor of the disease, a predictive model can also be used to predict the estrus factor of each missing disease, if the lyric factor of the missing disease is a numerical value Type, the mean value can be used to fill the lyric factor of the missing disease. If the lyric factor of the missing disease is non-numeric, the mode can be used to fill the lyric factor of the missing disease.
  • a mean method can be used to replace the lyric factor of the missing disease.
  • the method of using the averaging method to replace the lyric factor of the missing disease is based on the assumption of completely random deletion, the variance and standard deviation of the sensation factor causing the disease become small, and thus the method may further include :
  • the lyric factor of the disease obtained by the mean substitution is integrated with the preset expansion coefficient, and the sensation factor of the new disease is obtained as the sensation factor of the final disease.
  • the preset expansion coefficient is a preset expansion coefficient, and the expansion coefficient is greater than 1.
  • the outlier treatment of the disease factor of the disease further comprises: directly discarding the sensation factor of the abnormal disease. Directly discarding the lyric factors of abnormal diseases can ensure that the lyric factors of the disease obtained by the climb are clean and avoid interference when analyzing the grievance factors of the disease.
  • Step 15 Standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.
  • the data standardization of the sensation factors of the disease after data cleansing and outlier processing is to convert the lyric factors of the disease into dimensionless pure values, so that indicators of different units or magnitudes can be compared and weighted.
  • the method for data standardization includes, but is not limited to, sum standardization, standard deviation standardization, maximum value standardization, range difference standardization, and the like. It is preferable to standardize the range, and the maximum value of the new data obtained after the range normalization processing is 1 and the minimum value is 0, and the remaining values are between 0 and 1.
  • Step 16 Calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.
  • the derived variables include: maximum value, minimum value, average number, variance, standard deviation, covariance, range (maximum-minimum value), median, mode, quartile.
  • the mean, median, mode, quartile describes the concentration of the disease's sensation factors, and the greater the concentration of the disease's sensation factors, indicating that the disease is predicted to be more severe;
  • the variance and standard deviation characterize the degree of dispersion of the disease's sensation factors, and the smaller the degree of dispersion of the disease's sensation factors, the more serious the disease is predicted.
  • the public opinion data prediction method determines a data source related to the keyword in the Internet by receiving at least one keyword of a disease input by a user, and crawls the data source from the data source by using a crawler program
  • the disease data, the disease data is analyzed to obtain the sensation factor of the disease, and then the lyric factors of the disease are cleaned and the abnormal value is processed, and the lyric factors of the disease after the data cleaning and the abnormal value processing are standardized.
  • the crawler program is used to climb the disease data related to the input keywords, and the lyric factors of the more comprehensive diseases related to the disease are obtained; the lyric factors of the diseases are performed.
  • Data collation, in-depth analysis and calculation, this kind of refinement of the disease data obtained by crawling can obtain the purpose from basic data display to decision-making data display, and provide reference for disease prediction, and the prediction result is accurate.
  • FIG. 2 is a flowchart of a method for predicting public opinion data provided in Embodiment 2 of the present application.
  • the order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.
  • Step 21 Receive at least one keyword of a disease input by a user.
  • Step 21 in this embodiment is the same as step 11 in the first embodiment, and details are not described herein again.
  • Step 22 Determine a data source related to the keyword in the Internet, and classify the data source according to the type of the data source.
  • the data sources related to the keyword may be classified into two categories according to the type of the data source, the first type is an exponential data source, and the second type is a public opinion data source.
  • the index type data source includes, but is not limited to, Baidu, Google, 360, and the like.
  • the data source includes: but not limited to: Weibo, forum, WeChat, hot search, and the like.
  • Step 23 Set a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source.
  • Setting different crawler programs for different types of data sources can facilitate smoother crawling of data of data sources of the category, and can avoid crawling difficulties or failures due to different data source storage formats or other problems.
  • the data after the crawl is parsed.
  • the corresponding dual-thread crawler is set.
  • Baidu and Weibo are two different types of data sources, each having its own text storage format, and the first crawler is set to crawl the disease data related to the keyword in Baidu, and the second crawler program is used. It is designed to crawl disease data related to the keyword in Weibo.
  • the data source related to the keyword in the Internet may be subdivided into a plurality of categories according to actual needs, and corresponding crawling programs are respectively set for each category of data sources.
  • Step 24 The disease data related to the keyword is respectively crawled from the corresponding data source by using the multi-threaded crawler program.
  • the URL of the data source corresponding to the crawler program is placed in the crawl queue, and the multi-threaded crawler crawls the disease data related to the keyword from the data source in parallel.
  • Step 25 Analyze the disease data to obtain a sensation factor of the disease.
  • Step 26 Perform data cleaning and abnormal value processing on the sensation factor of the disease.
  • Step 27 Standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.
  • Steps 25 to 27 in this embodiment respectively correspond to steps 13-15 in the first embodiment, and details are not described herein again.
  • Step 28 Calculate a derivative variable of the sensation factor of the disease according to the new disease data, and create a chart to perform visual display according to the calculated derivative variable.
  • the step 24 may further include: classifying and storing the disease data obtained by the crawling.
  • the disease data is stored in a local database or stored in a storage server or stored in the cloud.
  • the disease data crawled from Baidu is stored in a first storage location in the local database
  • the disease data crawled from the meager data is stored in a second storage location in the local database.
  • the first storage location and the second storage location may be located in the same root directory in the local data at the same time, or may be located in different root directories.
  • the first storage location and the second storage location may also be displayed in different names with different names.
  • the data collected from different data sources is classified and stored, which is convenient for analyzing data of the same data source.
  • the method may further include: crawling from the data source by using a crawler program during a preset crawler period.
  • the disease data related to the keyword.
  • the preset crawler time period is a preset crawler time period.
  • the pre-set crawler time period is from 24 to 3 every night, so when the server accessing the data source is generally small, the server of the data source is not given. It creates a lot of access pressure, which is conducive to the smooth running of the server of the data source and can improve the crawling efficiency.
  • the method further comprises: quantifying a sub-sentiment factor of each of the diseases, obtaining a weight of a sub-sentiment factor of the disease, and determining a sub-sentiment factor whose weight is greater than a preset weight threshold as a public opinion factor of the disease.
  • the specific process of quantifying the sub-sentiment factor of each of the diseases to obtain the weight of the sub-sentiment factor of the disease is: calculating the sum of the quantities of all the sub-sentiment factors of the disease, and calculating each sub-sentiment factor to account for the The percentage of the sum, which is the weight of the corresponding sub-sense factor.
  • the preset weight threshold is a preset weight threshold.
  • the child sentiment factor is determined as the disease sensation factor, and the child sensation factor with less weight can be effectively filtered out. , can reduce the amount of data calculation, effectively shorten the disease prediction time, and the child weight factor with less weight will not have any impact on the outcome of disease prediction.
  • the method for predicting public opinion data determines a data source related to the keyword in the Internet by receiving at least one keyword of a disease input by a user, and performing the data source according to the type of the data source.
  • Classification according to the number of categories obtained by classifying the data source, setting a multi-threaded crawler program having the same number of the categories, and crawling from the corresponding data source with the multi-threaded crawler program Key words related disease data, followed by data cleaning and outlier processing of the disease sensation factor, data standardization of the sensation factors of the disease after data cleaning and abnormal value processing, to obtain new disease data, according to the The new disease data calculates the derived variables of the disease's sensation factors, and the calculated derivative variables are graphically displayed to visualize the disease.
  • the multi-threaded crawler is used to crawl and retrieve the disease data related to the input keywords from the corresponding data sources, and the parallel crawling method can speed up the crawling efficiency.
  • the data format of the disease data obtained by the crawling is relatively uniform, and the problem that the crawling difficulty or the parsing of the crawled data cannot be caused due to the storage format or other problems of the data of different data sources can be avoided;
  • the disease sensation factor is used for data collation, in-depth analysis and calculation. After the disease data obtained by the climb is refined, it is made into a graph or a table, and the results are more clear and easy to analyze the problem intuitively, providing disease prediction. Based on the reference, the prediction results are accurate.
  • FIG. 3 is a functional block diagram of a public opinion data prediction apparatus according to Embodiment 3 of the present application.
  • the public opinion data prediction device 30 operates in a terminal.
  • the public opinion data predicting device 30 can include a plurality of functional modules consisting of program code segments.
  • the program code for each of the program segments in the public opinion data predicting device 30 can be stored in a memory and executed by at least one processor to perform (see Figure 1 and its associated description) predictions of the public opinion data.
  • the public opinion data prediction device 30 of the terminal may be divided into a plurality of functional modules according to functions performed by the terminal.
  • the function module may include: a receiving module 301, a crawling module 302, a parsing module 303, a cleaning module 304, an expanding module 305, a standardizing module 306, and a predicting module 307.
  • a module as referred to in this application refers to a series of computer readable instruction segments that are executable by at least one processor and capable of performing a fixed function, which are stored in the memory. In some embodiments, the functionality of each module will be detailed in subsequent embodiments.
  • the receiving module 301 is configured to receive at least one keyword of a disease input by the user.
  • the keyword is a word related to the symptoms of the disease, for example, when the disease is a cold, the keywords may include: sneezing, runny nose, stuffy nose, headache, dizziness, cough, innocence, sore throat, and the like.
  • the keywords when the disease is hand, foot and mouth, the keywords may include: mouth pain, anorexia, hypothermia, hand herpes, small mouth ulcers, and the like.
  • the keyword may be a symptom of a disease obtained by the user according to his or her own experience, or may be a symptom of a disease collected from a disease expert.
  • the terminal presets a function for the user to input a keyword of the disease.
  • the terminal provides a text input box through which the user can input at least one keyword.
  • the terminal provides a function of a voice assistant, and the user can input at least one keyword through the voice assistant.
  • the crawling module 302 is configured to determine a data source related to the keyword in the Internet, and crawl the disease data related to the keyword from the data source by using a crawler program.
  • the data sources related to the keywords in the Internet may include, but are not limited to, Baidu, Google, Tencent, Weibo, Hot Search, and any website that supports user search access.
  • Using the crawler program to crawl disease data related to the keyword from various data sources may include: Baidu Index, Google Trends, Tencent Analysis, news information, advertisement data, channel data, microblogging heat, forum public opinion information, and the like.
  • the user determines a Uniform Resource Locator (URL) of the data source in the Internet, and the crawler crawls the disease data related to the keyword according to the URL.
  • URL Uniform Resource Locator
  • the parsing module 303 is configured to parse the disease data to obtain a sensation factor of the disease.
  • the specific analysis work including the public opinion analysis of the disease data, including text processing, text analysis, word frequency statistics, correlation analysis, etc., to obtain the disease sensation factors.
  • the sensation factor of the disease may include a plurality of sub-sentiment factors, for example, a first sub-sentiment factor, a second sub-sentiment factor, a third sub-sentiment factor, a fourth sentiment factor, and the like.
  • the first sub-sentiment factor may be a headache
  • the second sub-sentiment factor may be a runny nose
  • the third sub-sentiment factor may be a fever
  • the fourth sub-sense factor may be a cough.
  • the cleaning module 304 is configured to perform data cleaning and abnormal value processing on the sensation factor of the disease.
  • Data cleaning and outlier processing of the disease sensation factors are performed to eliminate redundant data in the grievance factors of the disease, and to obtain disease data in a consistent standard format, so that the disease after washing and abnormal value processing
  • the lyric factor is available and more suitable for subsequent analysis work.
  • the cleaning module 304 is further configured to perform data cleaning on the sensation factor of the disease according to the type of the sensation factor of the disease.
  • sensation factors of the disease include, but are not limited to, estrus factors of noise-containing diseases, sensation factors of unconformity diseases, sensation factors of diseases containing repeated information, sensation factors of diseases with unbalanced data, inconsistencies
  • the disease factor of the disease the sensation factor of the incomplete disease, and the like.
  • the data is cleaned by removing the extra large value and the negative value point; the lyric factor for the unconformity disease is cleaned by the method of removing the abnormal value;
  • the lyric factor of the disease of the information is cleaned by means of deleting duplicates; the lyric factor of the unbalanced disease is cleaned by data denoising method; the lyric factor for the inconsistent disease is determined by data type The method of class is used for data cleaning; for the lyric factor of the incomplete disease, data cleaning is performed by establishing a reference value of the relevant standard.
  • the cleaning module 304 is further configured to perform a missing value replacement on the sensation factor of the disease according to the distribution of the sensation factor of the disease.
  • the distribution of the sensation factors of the disease includes, but is not limited to, a stable type and a severe type.
  • the lyric factor of the stable distributed disease means that the trend of the sensation factor of the disease is relatively stable, for example, 50, 53, 52, 49, 51 and the like.
  • the lyric factor of the severely distributed disease means that the change trend of the sensation factor of the disease is sharp and the change range is large, for example, 50, 100, 43, 89, 4, and the like.
  • the K-nearest distance neighbor method can be used to determine the nearest K samples from the lyric factor sample with the missing disease according to the Euclidean distance or correlation analysis, and the K disease
  • the weighted average of the lyric factor values is used to estimate the missing data of the sample; for the stable distribution of the estrous factor of the disease, a predictive model can also be used to predict the estrus factor of each missing disease, if the lyric factor of the missing disease is a numerical value Type, the mean value can be used to fill the lyric factor of the missing disease. If the lyric factor of the missing disease is non-numeric, the mode can be used to fill the lyric factor of the missing disease.
  • a mean method can be used to replace the lyric factor of the missing disease.
  • the cleaning module 304 is also used to directly discard the sensation factor of the abnormal disease. Directly discarding the lyric factors of abnormal diseases can ensure that the lyric factors of the disease obtained by the climb are clean and avoid interference when analyzing the grievance factors of the disease.
  • the expansion module 305 is configured to integrate the sensation factor of the disease obtained by the mean replacement with the preset expansion coefficient to obtain a sensation factor of the new disease as a sensation factor of the final disease.
  • the method of using the averaging method to replace the lyric factor of the missing disease is based on the assumption of completely random deletion, which causes the variance and standard deviation of the disease's estrous factor to become smaller.
  • the preset expansion coefficient is a preset expansion coefficient, and the expansion coefficient is greater than 1.
  • the normalization module 306 is configured to standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.
  • the data standardization of the lyric factors of the disease after data cleansing and outlier processing is to convert the lyric factors of the disease into dimensionless pure values, so that indicators of different units or magnitudes can be compared and weighted.
  • the method for data standardization includes, but is not limited to, sum standardization, standard deviation standardization, maximum value standardization, range difference standardization, and the like. It is preferable to standardize the range, and the maximum value of the new data obtained after the range normalization processing is 1 and the minimum value is 0, and the remaining values are between 0 and 1.
  • the prediction module 307 is configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.
  • the derived variables include: maximum value, minimum value, average number, variance, standard deviation, covariance, range (maximum-minimum value), median, mode, quartile.
  • the mean, median, mode, quartile describes the concentration of the disease's sensation factors, and the greater the concentration of the disease's sensation factors, indicating that the disease is predicted to be more severe;
  • the variance and standard deviation characterize the degree of dispersion of the disease's sensation factors, and the smaller the degree of dispersion of the disease's sensation factors, the more serious the disease is predicted.
  • the sensation data prediction device 30 receives at least one keyword of the disease input by the user through the receiving module 301, and the crawling module 302 determines a data source related to the keyword in the Internet, and uses the crawler program from the data source. Climbing the disease data related to the keyword, the parsing module 303 parses the disease data to obtain a sensation factor of the disease, and then the cleaning module 304 performs data cleaning and abnormal value processing on the sensation factor of the disease, and the normalization module 306 Data is normalized to the lyric factors of the disease after data cleaning and abnormal value processing to obtain new disease data, and the prediction module 307 calculates a derivative variable of the disease sensation factor according to the new disease data, thereby The disease is predicted.
  • the crawler program is used to climb the disease data related to the input keywords, and the lyric factors of the more comprehensive diseases related to the disease are obtained; the lyric factors of the diseases are performed.
  • Data collation, in-depth analysis and calculation, this kind of refinement of the disease data obtained by crawling can obtain the purpose from basic data display to decision-making data display, and provide reference for disease prediction, and the prediction result is accurate.
  • FIG. 4 is a functional block diagram of a public opinion data prediction apparatus according to Embodiment 4 of the present application.
  • the public opinion data prediction device 40 operates in a terminal.
  • the public opinion data predicting device 40 may include a plurality of functional modules composed of program code segments.
  • the program code for each of the program segments in the public opinion data predicting device 40 may be stored in a memory and executed by at least one processor to perform (see FIG. 2 and its associated description) predictions of the public opinion data.
  • the public opinion data prediction device 40 of the terminal may be divided into a plurality of functional modules according to the functions performed by the terminal.
  • the function module may include: a receiving module 401, a classification module 402, a setting module 403, a crawling module 404, a parsing module 405, a cleaning module 406, a standardization module 407, a visualization module 408, a storage module 409, and a quantization module 410.
  • a module as referred to in this application refers to a series of computer readable instruction segments that are executable by at least one processor and capable of performing a fixed function, which are stored in the memory. In some embodiments, the functionality of each module will be detailed in subsequent embodiments.
  • the receiving module 401 is configured to receive at least one keyword of a disease input by the user.
  • the classification module 402 is configured to determine a data source related to the keyword in the Internet, and classify the data source according to the type of the data source.
  • the data sources related to the keyword may be classified into two categories according to the type of the data source, the first type is an exponential data source, and the second type is a public opinion data source.
  • the index type data source includes, but is not limited to, Baidu, Google, 360, and the like.
  • the data source includes: but not limited to: Weibo, forum, WeChat, hot search, and the like.
  • the setting module 403 is configured to set a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source.
  • Setting different crawler programs for different types of data sources can facilitate smoother crawling of data of data sources of the category, and can avoid crawling difficulties or failures due to different data source storage formats or other problems.
  • the data after the crawl is parsed.
  • the corresponding dual-thread crawler is set.
  • Baidu and Weibo are two different types of data sources, each having its own text storage format, and the first crawler is set to crawl the disease data related to the keyword in Baidu, and the second crawler program is used. It is designed to crawl disease data related to the keyword in Weibo.
  • the data source related to the keyword in the Internet may be subdivided into a plurality of categories according to actual needs, and corresponding crawling programs are respectively set for each category of data sources.
  • the crawling module 404 is configured to use the multi-threaded crawler to respectively crawl disease data related to the keyword from the corresponding data source.
  • the URL of the data source corresponding to the crawler program is placed in the crawl queue, and the multi-threaded crawler crawls the disease data related to the keyword from the data source in parallel.
  • the parsing module 405 is configured to parse the disease data to obtain a sensation factor of the disease.
  • the cleaning module 406 is configured to perform data cleaning and abnormal value processing on the sensation factor of the disease.
  • the standardization module 407 is configured to standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.
  • the visualization module 408 is configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and perform a visual display according to the calculated derivative variable.
  • the storage module 409 is configured to classify and store the disease data obtained by the crawl.
  • the disease data is stored in a local database or stored in a storage server or stored in the cloud.
  • the disease data crawled from Baidu is stored in a first storage location in the local database
  • the disease data crawled from the meager data is stored in a second storage location in the local database.
  • the first storage location and the second storage location may be located in the same root directory in the local data at the same time, or may be located in different root directories.
  • the first storage location and the second storage location may also be displayed in different names with different names.
  • the data collected from different data sources is classified and stored, which is convenient for analyzing data of the same data source.
  • the disease data needs to be updated periodically, and the crawling module 404 is further configured to use the crawler to climb from the data source during the preset crawling period.
  • the disease data associated with the keyword is taken.
  • the preset crawler time period is a preset crawler time period.
  • the pre-set crawler time period is from 24 to 3 every night, so when the server accessing the data source is generally small, the server of the data source is not given. It creates a lot of access pressure, which is conducive to the smooth running of the server of the data source and can improve the crawling efficiency.
  • the public opinion data may further include a quantification module 410 for separately quantizing the sub-sentiment factor of each of the diseases, obtaining the weight of the sub-sentiment factor of the disease, and determining the sub-sentiment factor whose weight is greater than the preset weight threshold as the disease Lyric factor.
  • the specific process of quantifying the sub-sentiment factor of each of the diseases to obtain the weight of the sub-sentiment factor of the disease is: calculating the sum of the quantities of all the sub-sentiment factors of the disease, and calculating each sub-sentiment factor to account for the The percentage of the sum, which is the weight of the corresponding sub-sense factor.
  • the preset weight threshold is a preset weight threshold.
  • the child sentiment factor is determined as the disease sensation factor, and the child sensation factor with less weight can be effectively filtered out. , can reduce the amount of data calculation, effectively shorten the disease prediction time, and the child weight factor with less weight will not have any impact on the outcome of disease prediction.
  • the sensation data prediction device 40 receives at least one keyword of the disease input by the user through the receiving module 401, and the classification module 402 determines a data source related to the keyword in the Internet, according to the data source.
  • the setting module 403 sets a multi-threaded crawler having the same number of categories as the number of categories obtained by classifying the data source, and the crawl module 404 utilizes the multi-threaded crawler
  • the disease data related to the keyword is respectively crawled from the corresponding data source, and then the parsing module 405 parses the disease data to obtain a sensation factor of the disease, and the cleaning module 406 performs data on the sensation factor of the disease.
  • the cleaning and outlier processing the normalization module 407 normalizes the data of the disease factor of the disease after the data cleaning and the outlier processing to obtain new disease data, and the visualization module 408 calculates the derivative of the disease factor based on the new disease data. Variables, which are graphically displayed based on the calculated derived variables, thereby Disease prediction.
  • the multi-threaded crawler is used to crawl and retrieve the disease data related to the input keywords from the corresponding data sources, and the parallel crawling method can speed up the crawling efficiency.
  • the data format of the disease data obtained by the crawling is relatively uniform, and the problem that the crawling difficulty or the parsing of the crawled data cannot be caused due to the storage format or other problems of the data of different data sources can be avoided;
  • the disease sensation factor is used for data collation, in-depth analysis and calculation. After the disease data obtained by the climb is refined, it is made into a graph or a table, and the results are more clear and easy to analyze the problem intuitively, providing disease prediction. Based on the reference, the prediction results are accurate.
  • the above-described integrated unit implemented in the form of a software function module can be stored in a non-volatile readable storage medium.
  • the software function modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a dual screen device, or a network device, etc.) or a processor to perform portions of the methods described in various embodiments of the present application. .
  • FIG. 5 is a schematic diagram of a terminal according to Embodiment 5 of the present application.
  • the terminal 5 comprises a memory 51, at least one processor 52, computer readable instructions 53 stored in the memory 51 and operable on the at least one processor 52, and at least one communication bus 54.
  • the at least one processor 52 executes the steps of the embodiment of the public opinion data prediction method when the computer readable instructions 53 are executed, or the apparatus implementation is implemented when the at least one processor 52 executes the computer readable instructions 53 The function of each module/unit in the example.
  • the computer readable instructions 53 may be partitioned into one or more modules/units, the one or more modules/units being stored in the memory 51 and by the at least one processor 52 Execute to complete this application.
  • the one or more modules/units may be a series of computer readable instruction segments capable of performing a particular function, the instruction segments being used to describe the execution of the computer readable instructions 53 in the terminal 5.
  • the terminal 5 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. It can be understood by those skilled in the art that the schematic diagram 5 is only an example of the terminal 5, does not constitute a limitation of the terminal 5, may include more or less components than the illustration, or combine some components, or different components.
  • the terminal 5 may further include an input/output device, a network access device, a bus, and the like.
  • the at least one processor 52 may be a central processing unit, or may be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gates or transistor logic devices, discrete Hardware components, etc.
  • the processor 52 may be a microprocessor or the processor 52 may be any conventional processor or the like.
  • the processor 52 is a control center of the terminal 5, and connects the entire terminal 5 with various interfaces and lines. section.
  • the memory 51 can be used to store the computer readable instructions 53 and/or modules/units by running or executing computer readable instructions and/or modules/units stored in the memory 51, and The data stored in the memory 51 is called to implement various functions of the terminal 5.
  • the memory 51 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.); and the storage data area may be Data (such as audio data, phone book, etc.) created according to the use of the terminal 5 is stored.
  • the memory 51 may include a high speed random access memory, and may also include a nonvolatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card, a secure digital card, a flash memory card, at least one disk storage device, a flash memory device. Or other volatile solid-state storage devices.
  • a nonvolatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card, a secure digital card, a flash memory card, at least one disk storage device, a flash memory device. Or other volatile solid-state storage devices.
  • the present application implements all or part of the processes in the foregoing embodiments, and may also be implemented by computer-readable instructions, which may be stored in a non-volatile manner.
  • the computer readable instructions when executed by a processor, implement the steps of the various method embodiments described above.
  • the computer readable instructions comprise computer readable instruction code, which may be in the form of source code, an object code form, an executable file or some intermediate form or the like.
  • the non-transitory readable medium may include any entity or device capable of carrying the computer readable instruction code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read only memory, and a random memory. Take memory, electrical carrier signals, telecommunication signals, and software distribution media. It should be noted that the contents of the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, Volatile readable media does not include electrical carrier signals and telecommunication signals.
  • the disclosed terminal and method may be implemented in other manners.
  • the terminal embodiment described above is only illustrative.
  • the division of the unit is only a logical function division, and the actual implementation may have another division manner.
  • each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist physically separately, or two or more units may be integrated in the same unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software function modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method for forecasting public sentiment data, the method comprising: receiving at least one keyword of a disease input by a user; determining a data source related to the keyword on the Internet, and crawling the data source by means of a crawler program to obtain disease data related to the keyword; parsing the disease data to obtain a public sentiment factor with regards to the disease; performing data cleaning and outlier processing on the public sentiment factor of the disease; performing data normalization on the public sentiment factor of the disease that has undergone data cleaning and outlier processing, and obtaining new disease data; and calculating, according to the new disease data, a derivative variable of the public sentiment factor of the disease, and obtaining a disease forecast according to the derivative variable. The present application further provides a device for forecasting public sentiment data, a terminal, and a storage medium. The present application crawls comprehensive disease data, performs data organization, in-depth analysis and computation on the disease data, and achieves the goal of displaying data from basic data to decision-making data, thereby providing a useful reference in disease forecasting.

Description

舆情数据预测方法、装置、终端及存储介质Public opinion data prediction method, device, terminal and storage medium
本申请要求于2018年04月18日提交中国专利局,申请号为201810351128.0、发明名称为“舆情数据预测方法、装置、终端及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent Application No. 201810351128.0, entitled "Surveying Data Prediction Method, Apparatus, Terminal, and Storage Medium", filed on April 18, 2018, the entire contents of which are incorporated by reference. In this application.
技术领域Technical field
本申请涉及数据预测技术领域,具体涉及一种舆情数据预测方法、装置、终端及存储介质。The present application relates to the field of data prediction technologies, and in particular, to a method, device, terminal, and storage medium for predicting public opinion data.
背景技术Background technique
随着互联网的快速发展,计算机技术已经在各行各业方便着人们的生活,在医疗领域也不例外。在网络上潜藏着大量的疾病的专业数据和用户的问诊记录,但是这些数据不够系统、不够完整,当一种流行病迅速爆发时,往往并不能及时更新网站信息,导致信息录入滞后,用户不能及时了解最新信息,及时预防,防患于未然。With the rapid development of the Internet, computer technology has facilitated people's lives in all walks of life, and is no exception in the medical field. There are a lot of professional data of the disease and the user's medical record on the network, but the data is not systematic and incomplete. When an epidemic breaks out rapidly, the website information is often not updated in time, resulting in the lag of information entry. We can't keep up to date with the latest information, prevent it in time, and prevent it from happening.
目前采用网络爬虫技术爬取关于疾病的舆情数据,但是爬取方法比较单一,采用简单爬虫的方法。其次,对爬取得到的数据没有进行有效的、及时的检验。另外,对于不同分布的数据,采用相同的数据清洗、填充的方式,数据处理效果较差。At present, web crawling technology is used to crawl the public opinion data about the disease, but the crawling method is relatively simple, and the simple crawling method is adopted. Secondly, there is no effective and timely inspection of the data obtained by the climb. In addition, for differently distributed data, the same data cleaning and filling method is adopted, and the data processing effect is poor.
发明内容Summary of the invention
鉴于以上内容,有必要提出一种舆情数据预测方法、装置、终端及存储介质,能够爬取不同的数据源中的疾病数据,并采用不同的数据检查、清洗和异常值处理方法。In view of the above, it is necessary to propose a method, device, terminal and storage medium for predicting public opinion data, which can crawl disease data in different data sources and adopt different data inspection, cleaning and outlier processing methods.
本申请的第一方面提供一种舆情数据预测方法,所述方法包括:A first aspect of the present application provides a method for predicting public opinion data, the method comprising:
接收用户输入的疾病的至少一个关键词;Receiving at least one keyword of a disease input by a user;
确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据;Determining a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;
对所述疾病数据进行解析得到疾病的舆情因子;Parsing the disease data to obtain a disease factor of the disease;
对所述疾病的舆情因子进行数据清洗和异常值处理;Data cleaning and outlier processing of the disease factor of the disease;
对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据;及Standardize data on the sensation factors of diseases after data cleansing and outlier processing to obtain new disease data;
根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据所述衍生变量对疾病进行预测。Derived variables of the estrous factors of the disease are calculated based on the new disease data, and the disease is predicted based on the derived variables.
本申请的第二方面提供一种舆情数据预测装置,所述装置包括:A second aspect of the present application provides a public opinion data prediction apparatus, the apparatus comprising:
接收模块,用于接收用户输入的疾病的至少一个关键词;a receiving module, configured to receive at least one keyword of a disease input by the user;
爬取模块,用于确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据;a crawling module, configured to determine a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;
解析模块,用于对所述疾病数据进行解析得到疾病的舆情因子;An analysis module, configured to parse the disease data to obtain a disease factor of the disease;
清洗模块,用于对所述疾病的舆情因子进行数据清洗和异常值处理;a cleaning module for performing data cleaning and abnormal value processing on the grievance factor of the disease;
标准化模块,用于对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据;及a standardized module for standardizing data on the grievance factors of diseases after data cleansing and outlier processing to obtain new disease data;
预测模块,用于根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据所述衍生变量对疾病进行预测。And a prediction module, configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.
本申请的第三方面提供一种终端,所述终端包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令时实现所述舆情数据预测方法。A third aspect of the present application provides a terminal, the terminal including a processor and a memory, the processor implementing the method for predicting public opinion data when the computer readable instructions stored in the memory are executed.
本申请的第四方面提供一种非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现所述舆情数据预测方法。A fourth aspect of the present application provides a non-volatile readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the method of predicting data.
本申请所述的舆情数据预测方法、装置、终端及存储介质,通过设置不同的爬虫程序对应不同类别的数据源,利用多线程爬虫程序爬取从对应的数据源中爬取与输入的关键词相关的疾病数据,并行的爬取方式可以加快爬取的效率,爬取得到的疾病数据的数据格式较为统一,且能够避免由于不同的数据源的数据的存储格式或者其他问题导致爬取困难或者无法对爬取后的数据进行解析的问题的发生;对所述疾病的舆情因子进行数据整理、深度分析和计算,这种对爬取得到的疾病数据进行精细化处理后,制作成图形或表格类,结果展示更加清晰,便于直观的分析问题。另外,根据疾病的舆情因子衍生出多个变量,增加了数据指标,为疾病预测提供了参考依据,使得疾病的预测将不再盲目、凭经验,预测结果更加准确。The method, device, terminal and storage medium for predicting public opinion data according to the present application, by setting different crawler programs corresponding to different types of data sources, using a multi-threaded crawler program to crawl keywords that are crawled and input from corresponding data sources Related disease data, parallel crawling method can speed up the efficiency of crawling, and the data format of the disease data obtained by crawling is relatively uniform, and can avoid difficulty in crawling due to storage format or other problems of data of different data sources or The problem of the problem of parsing the data after the crawling cannot be performed; the data of the disease is analyzed, the depth analysis and the calculation are performed, and the disease data obtained by the crawling is refined, and then formed into a graph or a table. Classes, the results are more clear and easy to analyze problems intuitively. In addition, a number of variables are derived from the disease's sensation factors, which increases the data indicators and provides a reference for disease prediction, so that the disease prediction will not be blind, empirical, and the prediction results will be more accurate.
附图说明DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can obtain other drawings according to the provided drawings without any creative work.
图1是本申请实施例一提供的舆情数据预测方法的流程图。FIG. 1 is a flowchart of a method for predicting public opinion data provided in Embodiment 1 of the present application.
图2是本申请实施例二提供的舆情数据预测方法的流程图。FIG. 2 is a flowchart of a method for predicting public opinion data provided in Embodiment 2 of the present application.
图3是本申请实施例三提供的舆情数据预测装置的结构图。FIG. 3 is a structural diagram of a public opinion data prediction apparatus according to Embodiment 3 of the present application.
图4是本申请实施例四提供的舆情数据预测装置的结构图。4 is a structural diagram of a public opinion data prediction apparatus provided in Embodiment 4 of the present application.
图5是本申请实施例五提供的终端的结构图。FIG. 5 is a structural diagram of a terminal provided in Embodiment 5 of the present application.
如下具体实施方式将结合上述附图进一步说明本申请。The present application will be further described in conjunction with the above drawings in the following detailed description.
具体实施方式detailed description
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。The above described objects, features, and advantages of the present invention will be more clearly understood from the following detailed description. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。All technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention applies, unless otherwise defined. The terminology used herein is for the purpose of describing particular embodiments, and is not intended to be limiting.
本申请实施例的舆情数据预测方法应用在一个或者多个终端中。所述舆 情数据预测方法也可以应用于由终端和通过网络与所述终端进行连接的服务器所构成的硬件环境中。网络包括但不限于:广域网、城域网或局域网。本申请实施例的舆情数据预测方法可以由服务器来执行,也可以由终端来执行;还可以是由服务器和终端共同执行。The public opinion data prediction method of the embodiment of the present application is applied to one or more terminals. The method of predicting data can also be applied to a hardware environment composed of a terminal and a server connected to the terminal through a network. Networks include, but are not limited to, wide area networks, metropolitan area networks, or local area networks. The public opinion data prediction method in the embodiment of the present application may be executed by a server or may be performed by a terminal; or may be performed by a server and a terminal together.
所述对于需要进行舆情数据预测方法的终端,可以直接在终端上集成本申请的方法所提供的舆情数据预测功能,或者安装用于实现本申请的方法的客户端。再如,本申请所提供的方法还可以以软件开发工具包(步骤oftware Development Kit,步骤DK)的形式运行在服务器等设备上,以步骤DK的形式提供舆情数据预测功能的接口,终端或其他设备通过提供的接口即可实现舆情数据的预测。For the terminal that needs to perform the public opinion data prediction method, the public opinion data prediction function provided by the method of the present application may be directly integrated on the terminal, or the client for implementing the method of the present application may be installed. For example, the method provided by the present application may also be run on a server or the like in the form of a software development kit (step DK), providing an interface for public opinion data prediction function in the form of step DK, terminal or other The device can predict the public opinion data through the provided interface.
实施例一Embodiment 1
图1是本申请实施例一提供的舆情数据预测方法的流程图。根据不同的需求,该流程图中的执行顺序可以改变,某些步骤可以省略。FIG. 1 is a flowchart of a method for predicting public opinion data provided in Embodiment 1 of the present application. The order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.
步骤11、接收用户输入的疾病的至少一个关键词。 Step 11. Receive at least one keyword of a disease input by the user.
所述关键词是与疾病的症状相关的词语,例如,当疾病为感冒时,所述关键词可以包括:打喷嚏、流鼻涕、鼻塞、头痛头晕、咳嗽无痰、喉咙痛等。再如,当疾病为手足口时,所述关键词可以包括:口痛、厌食、低热、手部小疱疹、口部小溃疡等。The keyword is a word related to the symptoms of the disease, for example, when the disease is a cold, the keywords may include: sneezing, runny nose, stuffy nose, headache, dizziness, cough, innocence, sore throat, and the like. For another example, when the disease is hand, foot and mouth, the keywords may include: mouth pain, anorexia, hypothermia, hand herpes, small mouth ulcers, and the like.
为了便于后续爬取到更多与疾病相关的数据,用户可以输入疾病的多个关键词。所述关键词可以是用户根据自身经验获得的疾病的症状,也可以是从疾病专家处收集得到的疾病的症状。To facilitate subsequent crawling of more disease-related data, users can enter multiple keywords for the disease. The keyword may be a symptom of a disease obtained by the user according to his or her own experience, or may be a symptom of a disease collected from a disease expert.
本实施例中,终端预先设置供用户输入疾病的关键词的功能,例如,所述终端提供一文本输入框,用户可通过所述文本输入框输入至少一个关键词。或者,所述终端提供语音助手的功能,用户可通过所述语音助手输入至少一个关键词。In this embodiment, the terminal presets a function for the user to input a keyword of the disease. For example, the terminal provides a text input box through which the user can input at least one keyword. Alternatively, the terminal provides a function of a voice assistant, and the user can input at least one keyword through the voice assistant.
步骤12、确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据。Step 12: Determine a data source related to the keyword in the Internet, and use a crawler program to crawl disease data related to the keyword from the data source.
互联网中与所述关键词相关的数据源可以包括,但不限于:百度、谷歌、腾讯、微博、热搜、知乎及任何支持用户搜索访问的网站等。利用爬虫程序从各种数据源中爬取与所述关键词相关的疾病数据可以包括:百度指数、谷歌趋势、腾讯分析、新闻资讯、广告数据、渠道数据、微博热度、论坛舆情信息等。The data sources related to the keywords in the Internet may include, but are not limited to, Baidu, Google, Tencent, Weibo, Hot Search, and any website that supports user search access. Using the crawler program to crawl disease data related to the keyword from various data sources may include: Baidu Index, Google Trends, Tencent Analysis, news information, advertisement data, channel data, microblogging heat, forum public opinion information, and the like.
本实施例中,用户确定互联网中的数据源的全球资源定位器(Uniform Resource Locator,URL),所述爬虫程序根据URL爬取与所述关键词相关的疾病数据。In this embodiment, the user determines a Uniform Resource Locator (URL) of the data source in the Internet, and the crawler crawls the disease data related to the keyword according to the URL.
步骤13、对所述疾病数据进行解析得到疾病的舆情因子。 Step 13. Analyze the disease data to obtain a sensation factor of the disease.
对疾病数据进行包括舆情分析的具体分析工作,其中包括文本处理、文本分析、词频统计、相关性分析等处理,以获取疾病的舆情因子。The specific analysis work including the public opinion analysis of the disease data, including text processing, text analysis, word frequency statistics, correlation analysis, etc., to obtain the disease sensation factors.
本实施例中,所述疾病的舆情因子可以包括多个子舆情因子,例如,第一子舆情因子、第二子舆情因子、第三子舆情因子、第四舆情因子等。In this embodiment, the sensation factor of the disease may include a plurality of sub-sentiment factors, for example, a first sub-sentiment factor, a second sub-sentiment factor, a third sub-sentiment factor, a fourth sentiment factor, and the like.
举例而言,所述第一子舆情因子可以是头痛,所述第二子舆情因子可以是流鼻涕,所述第三子舆情因子可以是发烧、第四子舆情因子可以是咳嗽。For example, the first sub-sentiment factor may be a headache, the second sub-sentiment factor may be a runny nose, the third sub-sentiment factor may be a fever, and the fourth sub-sense factor may be a cough.
步骤14、对所述疾病的舆情因子进行数据清洗和异常值处理。 Step 14. Perform data cleaning and abnormal value processing on the sensation factor of the disease.
对所述疾病的舆情因子进行数据清洗和异常值处理,是为了消除所述疾病的舆情因子中的冗余数据,得到具有一致性的标准格式的疾病数据,使得清洗和异常值处理后的疾病的舆情因子可用且更适合进行后续的分析工作。Data cleaning and outlier processing of the disease sensation factors are performed to eliminate redundant data in the grievance factors of the disease, and to obtain disease data in a consistent standard format, so that the disease after washing and abnormal value processing The lyric factor is available and more suitable for subsequent analysis work.
本实施例中,所述对所述疾病的舆情因子进行数据清洗包括:根据所述疾病的舆情因子的类型对所述疾病的舆情因子进行数据清洗。In this embodiment, the data cleaning of the sensation factor of the disease comprises: performing data cleaning on the sensation factor of the disease according to the type of the sensation factor of the disease.
所述疾病的舆情因子的类型包括,但不限于:含有噪声的疾病的舆情因子、不符合常理的疾病的舆情因子、含有重复信息的疾病的舆情因子、数据不平衡的疾病的舆情因子、不一致的疾病的舆情因子、不完整的疾病的舆情因子等。The types of sensation factors of the disease include, but are not limited to, estrus factors of noise-containing diseases, sensation factors of unconformity diseases, sensation factors of diseases containing repeated information, sensation factors of diseases with unbalanced data, inconsistencies The disease factor of the disease, the sensation factor of the incomplete disease, and the like.
对于所述含有噪声的疾病的舆情因子采用去除特大值及负值点的方法进行数据清洗;对于所述不符合常理的疾病的舆情因子采用去除异常值的方法进行数据清洗;对于所述含有重复信息的疾病的舆情因子采用删除重复项的方法进行数据清洗;对于所述不平衡的疾病的舆情因子采用数据去噪的方法进行数据清洗;对于所述不一致的疾病的舆情因子采用按数据类型归类的方法进行数据清洗;对于所述不完整的疾病的舆情因子,采用确立相关标准参照值的方法进行数据清洗。For the sensation factor of the noise-containing disease, the data is cleaned by removing the extra large value and the negative value point; the lyric factor for the unconformity disease is cleaned by the method of removing the abnormal value; The lyric factor of the disease of the information is cleaned by means of deleting duplicates; the lyric factor of the unbalanced disease is cleaned by data denoising method; the lyric factor for the inconsistent disease is determined by data type The method of class is used for data cleaning; for the lyric factor of the incomplete disease, data cleaning is performed by establishing a reference value of the relevant standard.
本实施例中,所述对所述疾病的舆情因子进行异常值处理包括:根据所述疾病的舆情因子的分布对所述疾病的舆情因子进行缺失值替换。In this embodiment, the abnormal value processing of the sensation factor of the disease comprises: performing a missing value replacement on the sensation factor of the disease according to the distribution of the sensation factor of the disease.
本实施例中,所述疾病的舆情因子的分布包括,但不限于:稳定型及剧烈型。所述稳定型分布的疾病的舆情因子是指所述疾病的舆情因子的变化趋势比较平稳,例如,50、53、52、49、51等。所述剧烈型分布的疾病的舆情因子是指所述疾病的舆情因子的变化趋势比较尖锐,变化幅度较大,例如,50、100、43、89、4等。In this embodiment, the distribution of the sensation factors of the disease includes, but is not limited to, a stable type and a severe type. The lyric factor of the stable distributed disease means that the trend of the sensation factor of the disease is relatively stable, for example, 50, 53, 52, 49, 51 and the like. The lyric factor of the severely distributed disease means that the change trend of the sensation factor of the disease is sharp and the change range is large, for example, 50, 100, 43, 89, 4, and the like.
对于稳定型分布的所述疾病的舆情因子,可以采用K-最近距离邻居法,根据欧式距离或相关分析来确定距离具有缺失的疾病的舆情因子样本最近的K个样本,将这K个疾病的舆情因子值加权平均来估计该样本的缺失数据;对于稳定型分布的所述疾病的舆情因子,还可以采用预测模型来预测每一个缺失的疾病的舆情因子,如果缺失的疾病的舆情因子是数值型的,可以采用平均值来填充该缺失的疾病的舆情因子,如果缺失的疾病的舆情因子是非数值型的,可以采用众数来填充该缺失的疾病的舆情因子。For a stable distribution of the disease's sensation factor, the K-nearest distance neighbor method can be used to determine the nearest K samples from the lyric factor sample with the missing disease according to the Euclidean distance or correlation analysis, and the K disease The weighted average of the lyric factor values is used to estimate the missing data of the sample; for the stable distribution of the estrous factor of the disease, a predictive model can also be used to predict the estrus factor of each missing disease, if the lyric factor of the missing disease is a numerical value Type, the mean value can be used to fill the lyric factor of the missing disease. If the lyric factor of the missing disease is non-numeric, the mode can be used to fill the lyric factor of the missing disease.
对于剧烈型分布的所述疾病的舆情因子,可以采用均值法替代所缺失的疾病的舆情因子。For a violent distribution of the disease's sensation factor, a mean method can be used to replace the lyric factor of the missing disease.
优选地,由于采用均值法替代所缺失的疾病的舆情因子的方法是建立在完全随机缺失的假设之上,会造成疾病的舆情因子的方差及标准差变小,因而,所述方法还可以包括:将通过均值替代后得到的疾病的舆情因子与预设扩大系数进行求积,得到新的疾病的舆情因子作为最终的疾病的舆情因子。Preferably, since the method of using the averaging method to replace the lyric factor of the missing disease is based on the assumption of completely random deletion, the variance and standard deviation of the sensation factor causing the disease become small, and thus the method may further include : The lyric factor of the disease obtained by the mean substitution is integrated with the preset expansion coefficient, and the sensation factor of the new disease is obtained as the sensation factor of the final disease.
所述预设扩大系数为预先设置的扩大系数,所述扩大系数大于1。The preset expansion coefficient is a preset expansion coefficient, and the expansion coefficient is greater than 1.
在其他实施例中,所述对所述疾病的舆情因子进行异常值处理还包括:直接丢弃有异常的疾病的舆情因子。直接将有异常的疾病的舆情因子进行丢弃,可以保证爬取得到的疾病的舆情因子的干净,避免对所述疾病的舆情因子进行分析时造成了干扰。In other embodiments, the outlier treatment of the disease factor of the disease further comprises: directly discarding the sensation factor of the abnormal disease. Directly discarding the lyric factors of abnormal diseases can ensure that the lyric factors of the disease obtained by the climb are clean and avoid interference when analyzing the grievance factors of the disease.
步骤15、对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据。 Step 15. Standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.
对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,是为了将所述疾病的舆情因子转化为无量纲的纯数值,便于不同单位或量级的指标能够进行比较和加权。The data standardization of the sensation factors of the disease after data cleansing and outlier processing is to convert the lyric factors of the disease into dimensionless pure values, so that indicators of different units or magnitudes can be compared and weighted.
本实施例中,所述数据标准化的方法包括,但不限于:总和标准化、标准差标准化、极大值标准化、极差标准化等。优选为极差标准化,经过极差标准化处理后所得到的新数据的极大值为1,极小值为0,其余各数值在0与1之间。In this embodiment, the method for data standardization includes, but is not limited to, sum standardization, standard deviation standardization, maximum value standardization, range difference standardization, and the like. It is preferable to standardize the range, and the maximum value of the new data obtained after the range normalization processing is 1 and the minimum value is 0, and the remaining values are between 0 and 1.
步骤16、根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据所述衍生变量对疾病进行预测。 Step 16. Calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.
本实施例中,所述衍生变量包括:最大值、最小值、平均数、方差、标准差、协方差、极差(最大值-最小值)、中位数、众数、四分位数。其中,所述平均数、中位数、众数、四分位数描述了疾病的舆情因子的集中程度,疾病的舆情因子的集中程度越大,表明预测出的该疾病越严重;极差、方差、标准差刻画了疾病的舆情因子的离散程度,疾病的舆情因子的离散程度越小,表明预测出的该疾病越严重。In this embodiment, the derived variables include: maximum value, minimum value, average number, variance, standard deviation, covariance, range (maximum-minimum value), median, mode, quartile. Wherein, the mean, median, mode, quartile describes the concentration of the disease's sensation factors, and the greater the concentration of the disease's sensation factors, indicating that the disease is predicted to be more severe; The variance and standard deviation characterize the degree of dispersion of the disease's sensation factors, and the smaller the degree of dispersion of the disease's sensation factors, the more serious the disease is predicted.
所述舆情数据预测方法,通过接收用户输入的疾病的至少一个关键词,确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据,对所述疾病数据进行解析得到疾病的舆情因子,接着对所述疾病的舆情因子进行数据清洗和异常值处理,对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据,根据所述新的疾病数据计算疾病的舆情因子的衍生变量,从而根据所述衍生变量对疾病进行预测。通过用户粗略的输入与疾病相关的关键词,利用爬虫程序爬取与输入的关键词相关的疾病数据,得到了与该疾病相关的较全面的疾病的舆情因子;对所述疾病的舆情因子进行数据整理、深度分析和计算,这种对爬取得到的疾病数据进行精细化处理可以获得从基础数据展示到决策性数据展示的目的,为疾病预测提供了参考依据,预测结果准确。The public opinion data prediction method determines a data source related to the keyword in the Internet by receiving at least one keyword of a disease input by a user, and crawls the data source from the data source by using a crawler program The disease data, the disease data is analyzed to obtain the sensation factor of the disease, and then the lyric factors of the disease are cleaned and the abnormal value is processed, and the lyric factors of the disease after the data cleaning and the abnormal value processing are standardized. Obtaining new disease data, calculating a derivative variable of the disease sensation factor based on the new disease data, thereby predicting the disease according to the derived variable. Through the user's rough input of the disease-related keywords, the crawler program is used to climb the disease data related to the input keywords, and the lyric factors of the more comprehensive diseases related to the disease are obtained; the lyric factors of the diseases are performed. Data collation, in-depth analysis and calculation, this kind of refinement of the disease data obtained by crawling can obtain the purpose from basic data display to decision-making data display, and provide reference for disease prediction, and the prediction result is accurate.
实施例二Embodiment 2
图2是本申请实施例二提供的舆情数据预测方法的流程图。根据不同的需求,该流程图中的执行顺序可以改变,某些步骤可以省略。FIG. 2 is a flowchart of a method for predicting public opinion data provided in Embodiment 2 of the present application. The order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.
步骤21、接收用户输入的疾病的至少一个关键词。Step 21: Receive at least one keyword of a disease input by a user.
本实施例中的步骤21同实施例一中的步骤11,本文再此不再详细赘述。 Step 21 in this embodiment is the same as step 11 in the first embodiment, and details are not described herein again.
步骤22、确定互联网中与所述关键词相关的数据源,根据所述数据源的类型对所述数据源进行分类。Step 22: Determine a data source related to the keyword in the Internet, and classify the data source according to the type of the data source.
本实施例中,可以根据数据源的类型,将与所述关键词相关的数据源分 为两大类,第一类为指数型数据源,第二类为舆情量数据源。所述指数型数据源包括,但不限于:百度,谷歌,360等。所述舆情量数据源包括,但不限于:微博、论坛、微信、热搜等。In this embodiment, the data sources related to the keyword may be classified into two categories according to the type of the data source, the first type is an exponential data source, and the second type is a public opinion data source. The index type data source includes, but is not limited to, Baidu, Google, 360, and the like. The data source includes: but not limited to: Weibo, forum, WeChat, hot search, and the like.
步骤23、根据对与所述数据源进行分类得到的类别数,设置与所述类别数相同的多线程爬虫程序。Step 23: Set a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source.
设置不同的爬虫程序对应不同类别的数据源,可以便于更顺畅的爬取到该类别的数据源的数据,可以避免由于不同的数据源的数据的存储格式或者其他问题导致爬取困难或者无法对爬取后的数据进行解析。Setting different crawler programs for different types of data sources can facilitate smoother crawling of data of data sources of the category, and can avoid crawling difficulties or failures due to different data source storage formats or other problems. The data after the crawl is parsed.
本实施例中,若所述数据源分为两类,则对应的设置双线程爬虫程序。例如,百度和微博是两个不同类型的数据源,均有各自的文本存储格式,则设置第一爬虫程序专用于爬取百度中的与所述关键词相关的疾病数据,第二爬虫程序专用于爬取微博中的与所述关键词相关的疾病数据。In this embodiment, if the data source is divided into two categories, the corresponding dual-thread crawler is set. For example, Baidu and Weibo are two different types of data sources, each having its own text storage format, and the first crawler is set to crawl the disease data related to the keyword in Baidu, and the second crawler program is used. It is designed to crawl disease data related to the keyword in Weibo.
在其他实施例中,还可以根据实际需要,将互联网中与所述关键词相关的数据源细分为多个类别,并分别为每一类别的数据源设置对应的爬虫程序。In other embodiments, the data source related to the keyword in the Internet may be subdivided into a plurality of categories according to actual needs, and corresponding crawling programs are respectively set for each category of data sources.
步骤24、利用所述多线程爬虫程序分别从对应的所述数据源中爬取与所述关键词相关的疾病数据。Step 24: The disease data related to the keyword is respectively crawled from the corresponding data source by using the multi-threaded crawler program.
本实施例中,将对应爬虫程序的数据源的URL放入爬取队列中,所述多线程爬虫程序并行地从所述数据源中爬取与所述关键词相关的疾病数据。In this embodiment, the URL of the data source corresponding to the crawler program is placed in the crawl queue, and the multi-threaded crawler crawls the disease data related to the keyword from the data source in parallel.
步骤25、对所述疾病数据进行解析得到疾病的舆情因子。Step 25: Analyze the disease data to obtain a sensation factor of the disease.
步骤26、对所述疾病的舆情因子进行数据清洗和异常值处理。Step 26: Perform data cleaning and abnormal value processing on the sensation factor of the disease.
步骤27、对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据。 Step 27. Standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.
本实施例中的步骤25-27分别对应实施例一中的步骤13-15,本文在此不再详细赘述。 Steps 25 to 27 in this embodiment respectively correspond to steps 13-15 in the first embodiment, and details are not described herein again.
步骤28、根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据计算出的衍生变量制作成图表进行可视化展示。Step 28: Calculate a derivative variable of the sensation factor of the disease according to the new disease data, and create a chart to perform visual display according to the calculated derivative variable.
优选的,所述步骤24还可以包括:对爬取得到的所述疾病数据进行分类存储。Preferably, the step 24 may further include: classifying and storing the disease data obtained by the crawling.
所述疾病数据存储在本地数据库中或者存储于存储服务器中或者存储于云端中。例如,将从百度爬取的疾病数据存储于本地数据库中第一存储位置,将从微薄爬取的疾病数据存储于本地数据库中的第二存储位置。所述第一存储位置和所述第二存储位置可以同时位于所述本地数据中的同一根目录下,也可以位于不同的根目录下。所述第一存储位置和所述第二存储位置还可以以不同的名称进行区别显示。对从不同的数据源爬取得到的数据进行分类存储,便于对同一数据源的数据进行分析。The disease data is stored in a local database or stored in a storage server or stored in the cloud. For example, the disease data crawled from Baidu is stored in a first storage location in the local database, and the disease data crawled from the meager data is stored in a second storage location in the local database. The first storage location and the second storage location may be located in the same root directory in the local data at the same time, or may be located in different root directories. The first storage location and the second storage location may also be displayed in different names with different names. The data collected from different data sources is classified and stored, which is convenient for analyzing data of the same data source.
优选的,为保证爬取到的疾病数据是最新的,需要定期对疾病数据进行更新,所述方法还可以包括:利用爬虫程序在预设爬虫时间段内从所述数据源中爬取与所述关键词相关的疾病数据。Preferably, in order to ensure that the crawled disease data is up-to-date, the disease data needs to be updated periodically, and the method may further include: crawling from the data source by using a crawler program during a preset crawler period. The disease data related to the keyword.
预设爬虫时间段为预先设置的爬虫时间段,例如,预先设置爬虫时间段在每天晚上的24点到3点,因此时一般访问数据源的服务器的人比较少,不 会给数据源的服务器造成很大的访问压力,有利于数据源的服务器的平稳运行,且可以提高爬取效率。The preset crawler time period is a preset crawler time period. For example, the pre-set crawler time period is from 24 to 3 every night, so when the server accessing the data source is generally small, the server of the data source is not given. It creates a lot of access pressure, which is conducive to the smooth running of the server of the data source and can improve the crawling efficiency.
优选地,在利用爬虫程序在预设爬虫时间段内从所述数据源中爬取与所述关键词相关的疾病数据,对所述疾病数据进行解析得到疾病的舆情因子之后,所述方法还可以包括:分别对每个所述疾病的子舆情因子进行量化,得到疾病的子舆情因子的权重,将权重大于预设权重阈值的子舆情因子确定为疾病的舆情因子。Preferably, after the disease data related to the keyword is crawled from the data source in a preset crawling time period by using a crawler program, and the disease data is analyzed to obtain a sensation factor of the disease, the method further The method further comprises: quantifying a sub-sentiment factor of each of the diseases, obtaining a weight of a sub-sentiment factor of the disease, and determining a sub-sentiment factor whose weight is greater than a preset weight threshold as a public opinion factor of the disease.
所述对每个所述疾病的子舆情因子进行量化,得到疾病的子舆情因子的权重的具体过程为:计算所述疾病的所有子舆情因子的数量总和,计算每一个子舆情因子占所述总和的百分比,所述百分比为对应的子舆情因子的权重。The specific process of quantifying the sub-sentiment factor of each of the diseases to obtain the weight of the sub-sentiment factor of the disease is: calculating the sum of the quantities of all the sub-sentiment factors of the disease, and calculating each sub-sentiment factor to account for the The percentage of the sum, which is the weight of the corresponding sub-sense factor.
预设权重阈值为预先设置的权重阈值,当子舆情因子的权重大于所述预设权重阈值时,将该子舆情因子确定为疾病的舆情因子,能够有效地筛选掉权重较小的子舆情因子,可以减小数据计算量,有效缩短疾病预测时间,而权重较小的子舆情因子不会对疾病预测的结果造成任何影响。The preset weight threshold is a preset weight threshold. When the weight of the child sentiment factor is greater than the preset weight threshold, the child sentiment factor is determined as the disease sensation factor, and the child sensation factor with less weight can be effectively filtered out. , can reduce the amount of data calculation, effectively shorten the disease prediction time, and the child weight factor with less weight will not have any impact on the outcome of disease prediction.
综上所述,所述舆情数据预测方法,通过接收用户输入的疾病的至少一个关键词,确定互联网中与所述关键词相关的数据源,根据所述数据源的类型对所述数据源进行分类,根据对与所述数据源进行分类得到的类别数,设置与所述类别数相同的多线程爬虫程序,利用所述多线程爬虫程序分别从对应的所述数据源中爬取与所述关键词相关的疾病数据,接着对所述疾病的舆情因子进行数据清洗和异常值处理,对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据,根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据计算出的衍生变量制作成图表进行可视化展示,从而对疾病进行预测。通过设置不同的爬虫程序对应不同类别的数据源,利用多线程爬虫程序爬取从对应的数据源中爬取与输入的关键词相关的疾病数据,并行的爬取方式可以加快爬取的效率,爬取得到的疾病数据的数据格式较为统一,且能够避免由于不同的数据源的数据的存储格式或者其他问题导致爬取困难或者无法对爬取后的数据进行解析的问题的发生;对所述疾病的舆情因子进行数据整理、深度分析和计算,这种对爬取得到的疾病数据进行精细化处理后,制作成图形或表格类,结果展示更加清晰,便于直观的分析问题,为疾病预测提供了参考依据,预测结果准确。In summary, the method for predicting public opinion data determines a data source related to the keyword in the Internet by receiving at least one keyword of a disease input by a user, and performing the data source according to the type of the data source. Classification, according to the number of categories obtained by classifying the data source, setting a multi-threaded crawler program having the same number of the categories, and crawling from the corresponding data source with the multi-threaded crawler program Key words related disease data, followed by data cleaning and outlier processing of the disease sensation factor, data standardization of the sensation factors of the disease after data cleaning and abnormal value processing, to obtain new disease data, according to the The new disease data calculates the derived variables of the disease's sensation factors, and the calculated derivative variables are graphically displayed to visualize the disease. By setting different crawler programs to correspond to different types of data sources, the multi-threaded crawler is used to crawl and retrieve the disease data related to the input keywords from the corresponding data sources, and the parallel crawling method can speed up the crawling efficiency. The data format of the disease data obtained by the crawling is relatively uniform, and the problem that the crawling difficulty or the parsing of the crawled data cannot be caused due to the storage format or other problems of the data of different data sources can be avoided; The disease sensation factor is used for data collation, in-depth analysis and calculation. After the disease data obtained by the climb is refined, it is made into a graph or a table, and the results are more clear and easy to analyze the problem intuitively, providing disease prediction. Based on the reference, the prediction results are accurate.
以上所述,仅是本申请的具体实施方式,但本申请的保护范围并不局限于此,对于本领域的普通技术人员来说,在不脱离本申请创造构思的前提下,还可以做出改进,但这些均属于本申请的保护范围。The above description is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and those skilled in the art can also make without departing from the concept of the present application. Improvements, but these are all within the scope of this application.
下面结合第3至5图,分别对实现上述舆情数据预测方法的终端的功能模块及硬件结构进行介绍。The function modules and hardware structures of the terminal for realizing the above-described public opinion data prediction method will be respectively described below with reference to the third to fifth figures.
实施例三Embodiment 3
图3为本申请实施例三提供的舆情数据预测装置的功能模块图。FIG. 3 is a functional block diagram of a public opinion data prediction apparatus according to Embodiment 3 of the present application.
在一些实施例中,所述舆情数据预测装置30运行于终端中。所述舆情数据预测装置30可以包括多个由程序代码段所组成的功能模块。所述舆情数据预测装置30中的各个程序段的程序代码可以存储于存储器中,并由至少一个 处理器所执行,以执行(详见图1及其相关描述)对舆情数据的预测。In some embodiments, the public opinion data prediction device 30 operates in a terminal. The public opinion data predicting device 30 can include a plurality of functional modules consisting of program code segments. The program code for each of the program segments in the public opinion data predicting device 30 can be stored in a memory and executed by at least one processor to perform (see Figure 1 and its associated description) predictions of the public opinion data.
本实施例中,所述终端的舆情数据预测装置30根据其所执行的功能,可以被划分为多个功能模块。所述功能模块可以包括:接收模块301、爬取模块302、解析模块303、清洗模块304、扩大模块305、标准化模块306及预测模块307。本申请所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在所述存储器中。在一些实施例中,关于各模块的功能将在后续的实施例中详述。In this embodiment, the public opinion data prediction device 30 of the terminal may be divided into a plurality of functional modules according to functions performed by the terminal. The function module may include: a receiving module 301, a crawling module 302, a parsing module 303, a cleaning module 304, an expanding module 305, a standardizing module 306, and a predicting module 307. A module as referred to in this application refers to a series of computer readable instruction segments that are executable by at least one processor and capable of performing a fixed function, which are stored in the memory. In some embodiments, the functionality of each module will be detailed in subsequent embodiments.
接收模块301,用于接收用户输入的疾病的至少一个关键词。The receiving module 301 is configured to receive at least one keyword of a disease input by the user.
所述关键词是与疾病的症状相关的词语,例如,当疾病为感冒时,所述关键词可以包括:打喷嚏、流鼻涕、鼻塞、头痛头晕、咳嗽无痰、喉咙痛等。再如,当疾病为手足口时,所述关键词可以包括:口痛、厌食、低热、手部小疱疹、口部小溃疡等。The keyword is a word related to the symptoms of the disease, for example, when the disease is a cold, the keywords may include: sneezing, runny nose, stuffy nose, headache, dizziness, cough, innocence, sore throat, and the like. For another example, when the disease is hand, foot and mouth, the keywords may include: mouth pain, anorexia, hypothermia, hand herpes, small mouth ulcers, and the like.
为了便于后续爬取到更多与疾病相关的数据,用户可以输入疾病的多个关键词。所述关键词可以是用户根据自身经验获得的疾病的症状,也可以是从疾病专家处收集得到的疾病的症状。To facilitate subsequent crawling of more disease-related data, users can enter multiple keywords for the disease. The keyword may be a symptom of a disease obtained by the user according to his or her own experience, or may be a symptom of a disease collected from a disease expert.
本实施例中,终端预先设置供用户输入疾病的关键词的功能,例如,所述终端提供一文本输入框,用户可通过所述文本输入框输入至少一个关键词。或者,所述终端提供语音助手的功能,用户可通过所述语音助手输入至少一个关键词。In this embodiment, the terminal presets a function for the user to input a keyword of the disease. For example, the terminal provides a text input box through which the user can input at least one keyword. Alternatively, the terminal provides a function of a voice assistant, and the user can input at least one keyword through the voice assistant.
爬取模块302,用于确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据。The crawling module 302 is configured to determine a data source related to the keyword in the Internet, and crawl the disease data related to the keyword from the data source by using a crawler program.
互联网中与所述关键词相关的数据源可以包括,但不限于:百度、谷歌、腾讯、微博、热搜、知乎及任何支持用户搜索访问的网站等。利用爬虫程序从各种数据源中爬取与所述关键词相关的疾病数据可以包括:百度指数、谷歌趋势、腾讯分析、新闻资讯、广告数据、渠道数据、微博热度、论坛舆情信息等。The data sources related to the keywords in the Internet may include, but are not limited to, Baidu, Google, Tencent, Weibo, Hot Search, and any website that supports user search access. Using the crawler program to crawl disease data related to the keyword from various data sources may include: Baidu Index, Google Trends, Tencent Analysis, news information, advertisement data, channel data, microblogging heat, forum public opinion information, and the like.
本实施例中,用户确定互联网中的数据源的全球资源定位器(Uniform Resource Locator,URL),所述爬虫程序根据URL爬取与所述关键词相关的疾病数据。In this embodiment, the user determines a Uniform Resource Locator (URL) of the data source in the Internet, and the crawler crawls the disease data related to the keyword according to the URL.
解析模块303,用于对所述疾病数据进行解析得到疾病的舆情因子。The parsing module 303 is configured to parse the disease data to obtain a sensation factor of the disease.
对疾病数据进行包括舆情分析的具体分析工作,其中包括文本处理、文本分析、词频统计、相关性分析等处理,以获取疾病的舆情因子。The specific analysis work including the public opinion analysis of the disease data, including text processing, text analysis, word frequency statistics, correlation analysis, etc., to obtain the disease sensation factors.
本实施例中,所述疾病的舆情因子可以包括多个子舆情因子,例如,第一子舆情因子、第二子舆情因子、第三子舆情因子、第四舆情因子等。In this embodiment, the sensation factor of the disease may include a plurality of sub-sentiment factors, for example, a first sub-sentiment factor, a second sub-sentiment factor, a third sub-sentiment factor, a fourth sentiment factor, and the like.
举例而言,所述第一子舆情因子可以是头痛,所述第二子舆情因子可以是流鼻涕,所述第三子舆情因子可以是发烧、第四子舆情因子可以是咳嗽。For example, the first sub-sentiment factor may be a headache, the second sub-sentiment factor may be a runny nose, the third sub-sentiment factor may be a fever, and the fourth sub-sense factor may be a cough.
清洗模块304,用于对所述疾病的舆情因子进行数据清洗和异常值处理。The cleaning module 304 is configured to perform data cleaning and abnormal value processing on the sensation factor of the disease.
对所述疾病的舆情因子进行数据清洗和异常值处理,是为了消除所述疾病的舆情因子中的冗余数据,得到具有一致性的标准格式的疾病数据,使得清洗和异常值处理后的疾病的舆情因子可用且更适合进行后续的分析工作。Data cleaning and outlier processing of the disease sensation factors are performed to eliminate redundant data in the grievance factors of the disease, and to obtain disease data in a consistent standard format, so that the disease after washing and abnormal value processing The lyric factor is available and more suitable for subsequent analysis work.
所述清洗模块304,还用于根据所述疾病的舆情因子的类型对所述疾病的舆情因子进行数据清洗。The cleaning module 304 is further configured to perform data cleaning on the sensation factor of the disease according to the type of the sensation factor of the disease.
所述疾病的舆情因子的类型包括,但不限于:含有噪声的疾病的舆情因子、不符合常理的疾病的舆情因子、含有重复信息的疾病的舆情因子、数据不平衡的疾病的舆情因子、不一致的疾病的舆情因子、不完整的疾病的舆情因子等。The types of sensation factors of the disease include, but are not limited to, estrus factors of noise-containing diseases, sensation factors of unconformity diseases, sensation factors of diseases containing repeated information, sensation factors of diseases with unbalanced data, inconsistencies The disease factor of the disease, the sensation factor of the incomplete disease, and the like.
对于所述含有噪声的疾病的舆情因子采用去除特大值及负值点的方法进行数据清洗;对于所述不符合常理的疾病的舆情因子采用去除异常值的方法进行数据清洗;对于所述含有重复信息的疾病的舆情因子采用删除重复项的方法进行数据清洗;对于所述不平衡的疾病的舆情因子采用数据去噪的方法进行数据清洗;对于所述不一致的疾病的舆情因子采用按数据类型归类的方法进行数据清洗;对于所述不完整的疾病的舆情因子,采用确立相关标准参照值的方法进行数据清洗。For the sensation factor of the noise-containing disease, the data is cleaned by removing the extra large value and the negative value point; the lyric factor for the unconformity disease is cleaned by the method of removing the abnormal value; The lyric factor of the disease of the information is cleaned by means of deleting duplicates; the lyric factor of the unbalanced disease is cleaned by data denoising method; the lyric factor for the inconsistent disease is determined by data type The method of class is used for data cleaning; for the lyric factor of the incomplete disease, data cleaning is performed by establishing a reference value of the relevant standard.
所述清洗模块304,还用于根据所述疾病的舆情因子的分布对所述疾病的舆情因子进行缺失值替换。The cleaning module 304 is further configured to perform a missing value replacement on the sensation factor of the disease according to the distribution of the sensation factor of the disease.
本实施例中,所述疾病的舆情因子的分布包括,但不限于:稳定型及剧烈型。所述稳定型分布的疾病的舆情因子是指所述疾病的舆情因子的变化趋势比较平稳,例如,50、53、52、49、51等。所述剧烈型分布的疾病的舆情因子是指所述疾病的舆情因子的变化趋势比较尖锐,变化幅度较大,例如,50、100、43、89、4等。In this embodiment, the distribution of the sensation factors of the disease includes, but is not limited to, a stable type and a severe type. The lyric factor of the stable distributed disease means that the trend of the sensation factor of the disease is relatively stable, for example, 50, 53, 52, 49, 51 and the like. The lyric factor of the severely distributed disease means that the change trend of the sensation factor of the disease is sharp and the change range is large, for example, 50, 100, 43, 89, 4, and the like.
对于稳定型分布的所述疾病的舆情因子,可以采用K-最近距离邻居法,根据欧式距离或相关分析来确定距离具有缺失的疾病的舆情因子样本最近的K个样本,将这K个疾病的舆情因子值加权平均来估计该样本的缺失数据;对于稳定型分布的所述疾病的舆情因子,还可以采用预测模型来预测每一个缺失的疾病的舆情因子,如果缺失的疾病的舆情因子是数值型的,可以采用平均值来填充该缺失的疾病的舆情因子,如果缺失的疾病的舆情因子是非数值型的,可以采用众数来填充该缺失的疾病的舆情因子。For a stable distribution of the disease's sensation factor, the K-nearest distance neighbor method can be used to determine the nearest K samples from the lyric factor sample with the missing disease according to the Euclidean distance or correlation analysis, and the K disease The weighted average of the lyric factor values is used to estimate the missing data of the sample; for the stable distribution of the estrous factor of the disease, a predictive model can also be used to predict the estrus factor of each missing disease, if the lyric factor of the missing disease is a numerical value Type, the mean value can be used to fill the lyric factor of the missing disease. If the lyric factor of the missing disease is non-numeric, the mode can be used to fill the lyric factor of the missing disease.
对于剧烈型分布的所述疾病的舆情因子,可以采用均值法替代所缺失的疾病的舆情因子。For a violent distribution of the disease's sensation factor, a mean method can be used to replace the lyric factor of the missing disease.
所述清洗模块304,还用于直接丢弃有异常的疾病的舆情因子。直接将有异常的疾病的舆情因子进行丢弃,可以保证爬取得到的疾病的舆情因子的干净,避免对所述疾病的舆情因子进行分析时造成了干扰。The cleaning module 304 is also used to directly discard the sensation factor of the abnormal disease. Directly discarding the lyric factors of abnormal diseases can ensure that the lyric factors of the disease obtained by the climb are clean and avoid interference when analyzing the grievance factors of the disease.
扩大模块305,用于将通过均值替代后得到的疾病的舆情因子与预设扩大系数进行求积,得到新的疾病的舆情因子作为最终的疾病的舆情因子。由于采用均值法替代所缺失的疾病的舆情因子的方法是建立在完全随机缺失的假设之上,会造成疾病的舆情因子的方差及标准差变小。所述预设扩大系数为预先设置的扩大系数,所述扩大系数大于1。The expansion module 305 is configured to integrate the sensation factor of the disease obtained by the mean replacement with the preset expansion coefficient to obtain a sensation factor of the new disease as a sensation factor of the final disease. The method of using the averaging method to replace the lyric factor of the missing disease is based on the assumption of completely random deletion, which causes the variance and standard deviation of the disease's estrous factor to become smaller. The preset expansion coefficient is a preset expansion coefficient, and the expansion coefficient is greater than 1.
标准化模块306,用于对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据。The normalization module 306 is configured to standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.
对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化, 是为了将所述疾病的舆情因子转化为无量纲的纯数值,便于不同单位或量级的指标能够进行比较和加权。The data standardization of the lyric factors of the disease after data cleansing and outlier processing is to convert the lyric factors of the disease into dimensionless pure values, so that indicators of different units or magnitudes can be compared and weighted.
本实施例中,所述数据标准化的方法包括,但不限于:总和标准化、标准差标准化、极大值标准化、极差标准化等。优选为极差标准化,经过极差标准化处理后所得到的新数据的极大值为1,极小值为0,其余各数值在0与1之间。In this embodiment, the method for data standardization includes, but is not limited to, sum standardization, standard deviation standardization, maximum value standardization, range difference standardization, and the like. It is preferable to standardize the range, and the maximum value of the new data obtained after the range normalization processing is 1 and the minimum value is 0, and the remaining values are between 0 and 1.
预测模块307,用于根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据所述衍生变量对疾病进行预测。The prediction module 307 is configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.
本实施例中,所述衍生变量包括:最大值、最小值、平均数、方差、标准差、协方差、极差(最大值-最小值)、中位数、众数、四分位数。其中,所述平均数、中位数、众数、四分位数描述了疾病的舆情因子的集中程度,疾病的舆情因子的集中程度越大,表明预测出的该疾病越严重;极差、方差、标准差刻画了疾病的舆情因子的离散程度,疾病的舆情因子的离散程度越小,表明预测出的该疾病越严重。In this embodiment, the derived variables include: maximum value, minimum value, average number, variance, standard deviation, covariance, range (maximum-minimum value), median, mode, quartile. Wherein, the mean, median, mode, quartile describes the concentration of the disease's sensation factors, and the greater the concentration of the disease's sensation factors, indicating that the disease is predicted to be more severe; The variance and standard deviation characterize the degree of dispersion of the disease's sensation factors, and the smaller the degree of dispersion of the disease's sensation factors, the more serious the disease is predicted.
所述舆情数据预测装置30,通过接收模块301接收用户输入的疾病的至少一个关键词,爬取模块302确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据,解析模块303对所述疾病数据进行解析得到疾病的舆情因子,接着清洗模块304对所述疾病的舆情因子进行数据清洗和异常值处理,标准化模块306对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据,预测模块307根据所述新的疾病数据计算疾病的舆情因子的衍生变量,从而根据所述衍生变量对疾病进行预测。通过用户粗略的输入与疾病相关的关键词,利用爬虫程序爬取与输入的关键词相关的疾病数据,得到了与该疾病相关的较全面的疾病的舆情因子;对所述疾病的舆情因子进行数据整理、深度分析和计算,这种对爬取得到的疾病数据进行精细化处理可以获得从基础数据展示到决策性数据展示的目的,为疾病预测提供了参考依据,预测结果准确。The sensation data prediction device 30 receives at least one keyword of the disease input by the user through the receiving module 301, and the crawling module 302 determines a data source related to the keyword in the Internet, and uses the crawler program from the data source. Climbing the disease data related to the keyword, the parsing module 303 parses the disease data to obtain a sensation factor of the disease, and then the cleaning module 304 performs data cleaning and abnormal value processing on the sensation factor of the disease, and the normalization module 306 Data is normalized to the lyric factors of the disease after data cleaning and abnormal value processing to obtain new disease data, and the prediction module 307 calculates a derivative variable of the disease sensation factor according to the new disease data, thereby The disease is predicted. Through the user's rough input of the disease-related keywords, the crawler program is used to climb the disease data related to the input keywords, and the lyric factors of the more comprehensive diseases related to the disease are obtained; the lyric factors of the diseases are performed. Data collation, in-depth analysis and calculation, this kind of refinement of the disease data obtained by crawling can obtain the purpose from basic data display to decision-making data display, and provide reference for disease prediction, and the prediction result is accurate.
实施例四Embodiment 4
图4为本申请实施例四提供的舆情数据预测装置的功能模块图。FIG. 4 is a functional block diagram of a public opinion data prediction apparatus according to Embodiment 4 of the present application.
在一些实施例中,所述舆情数据预测装置40运行于终端中。所述舆情数据预测装置40可以包括多个由程序代码段所组成的功能模块。所述舆情数据预测装置40中的各个程序段的程序代码可以存储于存储器中,并由至少一个处理器所执行,以执行(详见图2及其相关描述)对舆情数据的预测。In some embodiments, the public opinion data prediction device 40 operates in a terminal. The public opinion data predicting device 40 may include a plurality of functional modules composed of program code segments. The program code for each of the program segments in the public opinion data predicting device 40 may be stored in a memory and executed by at least one processor to perform (see FIG. 2 and its associated description) predictions of the public opinion data.
本实施例中,所述终端的舆情数据预测装置40根据其所执行的功能,可以被划分为多个功能模块。所述功能模块可以包括:接收模块401、分类模块402、设置模块403、爬取模块404、解析模块405、清洗模块406、标准化模块407、可视化模块408、存储模块409及量化模块410。本申请所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在所述存储器中。在一些实施例中,关于各模块的功能将在后续的实施例中详述。In this embodiment, the public opinion data prediction device 40 of the terminal may be divided into a plurality of functional modules according to the functions performed by the terminal. The function module may include: a receiving module 401, a classification module 402, a setting module 403, a crawling module 404, a parsing module 405, a cleaning module 406, a standardization module 407, a visualization module 408, a storage module 409, and a quantization module 410. A module as referred to in this application refers to a series of computer readable instruction segments that are executable by at least one processor and capable of performing a fixed function, which are stored in the memory. In some embodiments, the functionality of each module will be detailed in subsequent embodiments.
接收模块401,用于接收用户输入的疾病的至少一个关键词。The receiving module 401 is configured to receive at least one keyword of a disease input by the user.
分类模块402,用于确定互联网中与所述关键词相关的数据源,根据所述数据源的类型对所述数据源进行分类。The classification module 402 is configured to determine a data source related to the keyword in the Internet, and classify the data source according to the type of the data source.
本实施例中,可以根据数据源的类型,将与所述关键词相关的数据源分为两大类,第一类为指数型数据源,第二类为舆情量数据源。所述指数型数据源包括,但不限于:百度,谷歌,360等。所述舆情量数据源包括,但不限于:微博、论坛、微信、热搜等。In this embodiment, the data sources related to the keyword may be classified into two categories according to the type of the data source, the first type is an exponential data source, and the second type is a public opinion data source. The index type data source includes, but is not limited to, Baidu, Google, 360, and the like. The data source includes: but not limited to: Weibo, forum, WeChat, hot search, and the like.
设置模块403,用于根据对与所述数据源进行分类得到的类别数,设置与所述类别数相同的多线程爬虫程序。The setting module 403 is configured to set a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source.
设置不同的爬虫程序对应不同类别的数据源,可以便于更顺畅的爬取到该类别的数据源的数据,可以避免由于不同的数据源的数据的存储格式或者其他问题导致爬取困难或者无法对爬取后的数据进行解析。Setting different crawler programs for different types of data sources can facilitate smoother crawling of data of data sources of the category, and can avoid crawling difficulties or failures due to different data source storage formats or other problems. The data after the crawl is parsed.
本实施例中,若所述数据源分为两类,则对应的设置双线程爬虫程序。例如,百度和微博是两个不同类型的数据源,均有各自的文本存储格式,则设置第一爬虫程序专用于爬取百度中的与所述关键词相关的疾病数据,第二爬虫程序专用于爬取微博中的与所述关键词相关的疾病数据。In this embodiment, if the data source is divided into two categories, the corresponding dual-thread crawler is set. For example, Baidu and Weibo are two different types of data sources, each having its own text storage format, and the first crawler is set to crawl the disease data related to the keyword in Baidu, and the second crawler program is used. It is designed to crawl disease data related to the keyword in Weibo.
在其他实施例中,还可以根据实际需要,将互联网中与所述关键词相关的数据源细分为多个类别,并分别为每一类别的数据源设置对应的爬虫程序。In other embodiments, the data source related to the keyword in the Internet may be subdivided into a plurality of categories according to actual needs, and corresponding crawling programs are respectively set for each category of data sources.
爬取模块404,用于利用所述多线程爬虫程序分别从对应的所述数据源中爬取与所述关键词相关的疾病数据。The crawling module 404 is configured to use the multi-threaded crawler to respectively crawl disease data related to the keyword from the corresponding data source.
本实施例中,将对应爬虫程序的数据源的URL放入爬取队列中,所述多线程爬虫程序并行地从所述数据源中爬取与所述关键词相关的疾病数据。In this embodiment, the URL of the data source corresponding to the crawler program is placed in the crawl queue, and the multi-threaded crawler crawls the disease data related to the keyword from the data source in parallel.
解析模块405,用于对所述疾病数据进行解析得到疾病的舆情因子。The parsing module 405 is configured to parse the disease data to obtain a sensation factor of the disease.
清洗模块406,用于对所述疾病的舆情因子进行数据清洗和异常值处理。The cleaning module 406 is configured to perform data cleaning and abnormal value processing on the sensation factor of the disease.
标准化模块407,用于对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据。The standardization module 407 is configured to standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.
可视化模块408,用于根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据计算出的衍生变量制作成图表进行可视化展示。The visualization module 408 is configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and perform a visual display according to the calculated derivative variable.
存储模块409,用于对爬取得到的所述疾病数据进行分类存储。The storage module 409 is configured to classify and store the disease data obtained by the crawl.
所述疾病数据存储在本地数据库中或者存储于存储服务器中或者存储于云端中。例如,将从百度爬取的疾病数据存储于本地数据库中第一存储位置,将从微薄爬取的疾病数据存储于本地数据库中的第二存储位置。所述第一存储位置和所述第二存储位置可以同时位于所述本地数据中的同一根目录下,也可以位于不同的根目录下。所述第一存储位置和所述第二存储位置还可以以不同的名称进行区别显示。对从不同的数据源爬取得到的数据进行分类存储,便于对同一数据源的数据进行分析。The disease data is stored in a local database or stored in a storage server or stored in the cloud. For example, the disease data crawled from Baidu is stored in a first storage location in the local database, and the disease data crawled from the meager data is stored in a second storage location in the local database. The first storage location and the second storage location may be located in the same root directory in the local data at the same time, or may be located in different root directories. The first storage location and the second storage location may also be displayed in different names with different names. The data collected from different data sources is classified and stored, which is convenient for analyzing data of the same data source.
优选的,为保证爬取到的疾病数据是最新的,需要定期对疾病数据进行更新,所述爬取模块404,还用于利用爬虫程序在预设爬虫时间段内从所述数据源中爬取与所述关键词相关的疾病数据。Preferably, in order to ensure that the crawled disease data is up to date, the disease data needs to be updated periodically, and the crawling module 404 is further configured to use the crawler to climb from the data source during the preset crawling period. The disease data associated with the keyword is taken.
预设爬虫时间段为预先设置的爬虫时间段,例如,预先设置爬虫时间段 在每天晚上的24点到3点,因此时一般访问数据源的服务器的人比较少,不会给数据源的服务器造成很大的访问压力,有利于数据源的服务器的平稳运行,且可以提高爬取效率。The preset crawler time period is a preset crawler time period. For example, the pre-set crawler time period is from 24 to 3 every night, so when the server accessing the data source is generally small, the server of the data source is not given. It creates a lot of access pressure, which is conducive to the smooth running of the server of the data source and can improve the crawling efficiency.
优选地,在利用爬虫程序在预设爬虫时间段内从所述数据源中爬取与所述关键词相关的疾病数据,对所述疾病数据进行解析得到疾病的舆情因子之后,所述舆情数据预测装置40还可以包括量化模块410,用于分别对每个所述疾病的子舆情因子进行量化,得到疾病的子舆情因子的权重,将权重大于预设权重阈值的子舆情因子确定为疾病的舆情因子。Preferably, after the disease data related to the keyword is crawled from the data source in a preset crawling time period by using a crawler program, and the disease data is analyzed to obtain a sensation factor of the disease, the public opinion data The prediction device 40 may further include a quantification module 410 for separately quantizing the sub-sentiment factor of each of the diseases, obtaining the weight of the sub-sentiment factor of the disease, and determining the sub-sentiment factor whose weight is greater than the preset weight threshold as the disease Lyric factor.
所述对每个所述疾病的子舆情因子进行量化,得到疾病的子舆情因子的权重的具体过程为:计算所述疾病的所有子舆情因子的数量总和,计算每一个子舆情因子占所述总和的百分比,所述百分比为对应的子舆情因子的权重。The specific process of quantifying the sub-sentiment factor of each of the diseases to obtain the weight of the sub-sentiment factor of the disease is: calculating the sum of the quantities of all the sub-sentiment factors of the disease, and calculating each sub-sentiment factor to account for the The percentage of the sum, which is the weight of the corresponding sub-sense factor.
预设权重阈值为预先设置的权重阈值,当子舆情因子的权重大于所述预设权重阈值时,将该子舆情因子确定为疾病的舆情因子,能够有效地筛选掉权重较小的子舆情因子,可以减小数据计算量,有效缩短疾病预测时间,而权重较小的子舆情因子不会对疾病预测的结果造成任何影响。The preset weight threshold is a preset weight threshold. When the weight of the child sentiment factor is greater than the preset weight threshold, the child sentiment factor is determined as the disease sensation factor, and the child sensation factor with less weight can be effectively filtered out. , can reduce the amount of data calculation, effectively shorten the disease prediction time, and the child weight factor with less weight will not have any impact on the outcome of disease prediction.
综上所述,所述舆情数据预测装置40,通过接收模块401接收用户输入的疾病的至少一个关键词,分类模块402确定互联网中与所述关键词相关的数据源,根据所述数据源的类型对所述数据源进行分类,设置模块403根据对与所述数据源进行分类得到的类别数,设置与所述类别数相同的多线程爬虫程序,爬取模块404利用所述多线程爬虫程序分别从对应的所述数据源中爬取与所述关键词相关的疾病数据,接着解析模块405对所述疾病数据进行解析得到疾病的舆情因子,清洗模块406对所述疾病的舆情因子进行数据清洗和异常值处理,标准化模块407对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据,可视化模块408根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据计算出的衍生变量制作成图表进行可视化展示,从而对疾病进行预测。通过设置不同的爬虫程序对应不同类别的数据源,利用多线程爬虫程序爬取从对应的数据源中爬取与输入的关键词相关的疾病数据,并行的爬取方式可以加快爬取的效率,爬取得到的疾病数据的数据格式较为统一,且能够避免由于不同的数据源的数据的存储格式或者其他问题导致爬取困难或者无法对爬取后的数据进行解析的问题的发生;对所述疾病的舆情因子进行数据整理、深度分析和计算,这种对爬取得到的疾病数据进行精细化处理后,制作成图形或表格类,结果展示更加清晰,便于直观的分析问题,为疾病预测提供了参考依据,预测结果准确。上述以软件功能模块的形式实现的集成的单元,可以存储在一个非易失性可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,双屏设备,或者网络设备等)或处理器执行本申请各个实施例所述方法的部分。In summary, the sensation data prediction device 40 receives at least one keyword of the disease input by the user through the receiving module 401, and the classification module 402 determines a data source related to the keyword in the Internet, according to the data source. Types the data source, the setting module 403 sets a multi-threaded crawler having the same number of categories as the number of categories obtained by classifying the data source, and the crawl module 404 utilizes the multi-threaded crawler The disease data related to the keyword is respectively crawled from the corresponding data source, and then the parsing module 405 parses the disease data to obtain a sensation factor of the disease, and the cleaning module 406 performs data on the sensation factor of the disease. The cleaning and outlier processing, the normalization module 407 normalizes the data of the disease factor of the disease after the data cleaning and the outlier processing to obtain new disease data, and the visualization module 408 calculates the derivative of the disease factor based on the new disease data. Variables, which are graphically displayed based on the calculated derived variables, thereby Disease prediction. By setting different crawler programs to correspond to different types of data sources, the multi-threaded crawler is used to crawl and retrieve the disease data related to the input keywords from the corresponding data sources, and the parallel crawling method can speed up the crawling efficiency. The data format of the disease data obtained by the crawling is relatively uniform, and the problem that the crawling difficulty or the parsing of the crawled data cannot be caused due to the storage format or other problems of the data of different data sources can be avoided; The disease sensation factor is used for data collation, in-depth analysis and calculation. After the disease data obtained by the climb is refined, it is made into a graph or a table, and the results are more clear and easy to analyze the problem intuitively, providing disease prediction. Based on the reference, the prediction results are accurate. The above-described integrated unit implemented in the form of a software function module can be stored in a non-volatile readable storage medium. The software function modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a dual screen device, or a network device, etc.) or a processor to perform portions of the methods described in various embodiments of the present application. .
实施例五 Embodiment 5
图5为本申请实施例五提供的终端的示意图。FIG. 5 is a schematic diagram of a terminal according to Embodiment 5 of the present application.
所述终端5包括:存储器51、至少一个处理器52、存储在所述存储器 51中并可在所述至少一个处理器52上运行的计算机可读指令53、至少一条通讯总线54。The terminal 5 comprises a memory 51, at least one processor 52, computer readable instructions 53 stored in the memory 51 and operable on the at least one processor 52, and at least one communication bus 54.
所述至少一个处理器52执行所述计算机可读指令53时实现上述舆情数据预测方法实施例中的步骤,或者,所述至少一个处理器52执行所述计算机可读指令53时实现上述装置实施例中各模块/单元的功能。The at least one processor 52 executes the steps of the embodiment of the public opinion data prediction method when the computer readable instructions 53 are executed, or the apparatus implementation is implemented when the at least one processor 52 executes the computer readable instructions 53 The function of each module/unit in the example.
示例性的,所述计算机可读指令53可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器51中,并由所述至少一个处理器52执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令53在所述终端5中的执行过程。Illustratively, the computer readable instructions 53 may be partitioned into one or more modules/units, the one or more modules/units being stored in the memory 51 and by the at least one processor 52 Execute to complete this application. The one or more modules/units may be a series of computer readable instruction segments capable of performing a particular function, the instruction segments being used to describe the execution of the computer readable instructions 53 in the terminal 5.
所述终端5可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解,所述示意图5仅仅是终端5的示例,并不构成对终端5的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端5还可以包括输入输出设备、网络接入设备、总线等。The terminal 5 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. It can be understood by those skilled in the art that the schematic diagram 5 is only an example of the terminal 5, does not constitute a limitation of the terminal 5, may include more or less components than the illustration, or combine some components, or different components. For example, the terminal 5 may further include an input/output device, a network access device, a bus, and the like.
所述至少一个处理器52可以是中央处理单元,还可以是其他通用处理器、数字信号处理器、专用集成电路、现成可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。该处理器52可以是微处理器或者该处理器52也可以是任何常规的处理器等,所述处理器52是所述终端5的控制中心,利用各种接口和线路连接整个终端5的各个部分。The at least one processor 52 may be a central processing unit, or may be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gates or transistor logic devices, discrete Hardware components, etc. The processor 52 may be a microprocessor or the processor 52 may be any conventional processor or the like. The processor 52 is a control center of the terminal 5, and connects the entire terminal 5 with various interfaces and lines. section.
所述存储器51可用于存储所述计算机可读指令53和/或模块/单元,所述处理器52通过运行或执行存储在所述存储器51内的计算机可读指令和/或模块/单元,以及调用存储在存储器51内的数据,实现所述终端5的各种功能。所述存储器51可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端5的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器51可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡,安全数字卡,闪存卡、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 51 can be used to store the computer readable instructions 53 and/or modules/units by running or executing computer readable instructions and/or modules/units stored in the memory 51, and The data stored in the memory 51 is called to implement various functions of the terminal 5. The memory 51 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.); and the storage data area may be Data (such as audio data, phone book, etc.) created according to the use of the terminal 5 is stored. In addition, the memory 51 may include a high speed random access memory, and may also include a nonvolatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card, a secure digital card, a flash memory card, at least one disk storage device, a flash memory device. Or other volatile solid-state storage devices.
所述终端5集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述非易失性可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器、随机存取存储器、电载波信号、电信信 号以及软件分发介质等。需要说明的是,所述非易失性可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,非易失性可读介质不包括电载波信号和电信信号。The modules/units integrated by the terminal 5, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the processes in the foregoing embodiments, and may also be implemented by computer-readable instructions, which may be stored in a non-volatile manner. In reading a storage medium, the computer readable instructions, when executed by a processor, implement the steps of the various method embodiments described above. Wherein, the computer readable instructions comprise computer readable instruction code, which may be in the form of source code, an object code form, an executable file or some intermediate form or the like. The non-transitory readable medium may include any entity or device capable of carrying the computer readable instruction code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read only memory, and a random memory. Take memory, electrical carrier signals, telecommunication signals, and software distribution media. It should be noted that the contents of the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, Volatile readable media does not include electrical carrier signals and telecommunication signals.
在本申请所提供的几个实施例中,应该理解到,所揭露的终端和方法,可以通过其它的方式实现。例如,以上所描述的终端实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided by the present application, it should be understood that the disclosed terminal and method may be implemented in other manners. For example, the terminal embodiment described above is only illustrative. For example, the division of the unit is only a logical function division, and the actual implementation may have another division manner.
另外,在本申请各个实施例中的各功能单元可以集成在相同处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在相同单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist physically separately, or two or more units may be integrated in the same unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software function modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图表记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。It is obvious to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, and the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Therefore, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims instead All changes in the meaning and scope of equivalent elements are included in this application. Any reference signs in the claims should not be construed as limiting the claim. In addition, it is to be understood that the term "comprising" does not exclude other elements or the singular does not exclude the plural. A plurality of units or devices recited in the system claims can also be implemented by a unit or device by software or hardware. The first, second, etc. words are used to denote names and do not denote any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神范围。It should be noted that the above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto. Although the present application is described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solutions of the present application can be applied. Modifications or equivalent substitutions are made without departing from the spirit of the invention.

Claims (20)

  1. 一种舆情数据预测方法,其特征在于,所述方法包括:A method for predicting public opinion data, characterized in that the method comprises:
    接收用户输入的疾病的至少一个关键词;Receiving at least one keyword of a disease input by a user;
    确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据;Determining a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;
    对所述疾病数据进行解析得到疾病的舆情因子;Parsing the disease data to obtain a disease factor of the disease;
    对所述疾病的舆情因子进行数据清洗和异常值处理;Data cleaning and outlier processing of the disease factor of the disease;
    对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据;及Standardize data on the sensation factors of diseases after data cleansing and outlier processing to obtain new disease data;
    根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据所述衍生变量对疾病进行预测。Derived variables of the estrous factors of the disease are calculated based on the new disease data, and the disease is predicted based on the derived variables.
  2. 如权利要求1所述的方法,其特征在于,所述确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据包括:The method of claim 1 wherein said determining a data source associated with said keyword in the Internet and crawling said disease data associated with said keyword from said data source using a crawler program comprises: :
    确定互联网中与所述关键词相关的数据源,根据所述数据源的类型对所述数据源进行分类;Determining a data source related to the keyword in the Internet, and classifying the data source according to the type of the data source;
    根据对与所述数据源进行分类得到的类别数,设置与所述类别数相同的多线程爬虫程序;Setting a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source;
    利用所述多线程爬虫程序分别从对应的所述数据源中爬取与所述关键词相关的疾病数据。The multi-threaded crawler program is used to respectively crawl disease data related to the keyword from the corresponding data source.
  3. 如权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 wherein the method further comprises:
    根据计算出的衍生变量制作成图表进行可视化展示,所述衍生变量包括:最大值、最小值、平均数、方差、标准差、协方差、极差、中位数、众数、四分位数。Visualized display based on the calculated derivative variables, including: maximum, minimum, mean, variance, standard deviation, covariance, range, median, mode, quartile .
  4. 如权利要求1所述的方法,其特征在于,所述数据标准化包括以下一种或几种的组合:The method of claim 1 wherein said data normalization comprises one or a combination of the following:
    总和标准化、标准差标准化、极大值标准化或者极差标准化。Sum standardization, standard deviation standardization, maximum value standardization or range standardization.
  5. 如权利要求1所述的方法,其特征在于,所述利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据包括:The method of claim 1 wherein said crawling said disease data associated with said keyword from said data source using a crawler program comprises:
    利用爬虫程序在预设爬虫时间段内从所述数据源中爬取与所述关键词相关的疾病数据。The disease data associated with the keyword is crawled from the data source during a preset crawler period using a crawler program.
  6. 如权利要求1所述的方法,其特征在于,所述对所述疾病数据进行解析得到疾病的舆情因子包括:The method of claim 1 wherein said analysing the disease data to obtain a disease comprises:
    计算所述疾病的所有子舆情因子的数量总和,计算每一个子舆情因子占所述总和的百分比,所述百分比为对应的子舆情因子的权重,将权重大于预设权重阈值的子舆情因子确定为疾病的舆情因子。Calculating a sum of the number of all sub-sentiment factors of the disease, calculating a percentage of each sub-sentiment factor in the sum, the percentage being a weight of the corresponding sub-sentiment factor, determining a sub-sentiment factor having a weight greater than a preset weight threshold A sensation factor for the disease.
  7. 如权利要求1所述的方法,其特征在于,所述对所述疾病的舆情因子进行数据清洗和异常值处理包括:The method of claim 1 wherein said data cleaning and outlier processing of the sensation factor of said disease comprises:
    根据所述疾病的舆情因子的类型对所述疾病的舆情因子进行数据清洗;Data cleaning of the sensation factors of the disease according to the type of sensation factor of the disease;
    根据所述疾病的舆情因子的分布对所述疾病的舆情因子进行缺失值替 换;或者Deletion values for the disease factor of the disease are replaced according to the distribution of the estrous factors of the disease; or
    直接丢弃有异常的疾病的舆情因子。Directly discard the sensation factors of abnormal diseases.
  8. 一种舆情数据预测装置,其特征在于,所述装置包括:A public opinion data prediction apparatus, characterized in that the apparatus comprises:
    接收模块,用于接收用户输入的疾病的至少一个关键词;a receiving module, configured to receive at least one keyword of a disease input by the user;
    爬取模块,用于确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据;a crawling module, configured to determine a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;
    解析模块,用于对所述疾病数据进行解析得到疾病的舆情因子;An analysis module, configured to parse the disease data to obtain a disease factor of the disease;
    清洗模块,用于对所述疾病的舆情因子进行数据清洗和异常值处理;a cleaning module for performing data cleaning and abnormal value processing on the grievance factor of the disease;
    标准化模块,用于对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据;及a standardized module for standardizing data on the grievance factors of diseases after data cleansing and outlier processing to obtain new disease data;
    预测模块,用于根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据所述衍生变量对疾病进行预测。And a prediction module, configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.
  9. 一种终端,其特征在于,所述终端包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令时实现如下步骤:A terminal, comprising: a processor and a memory, wherein when the processor is configured to execute the computer readable instructions stored in the memory, the following steps are implemented:
    接收用户输入的疾病的至少一个关键词;Receiving at least one keyword of a disease input by a user;
    确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据;Determining a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;
    对所述疾病数据进行解析得到疾病的舆情因子;Parsing the disease data to obtain a disease factor of the disease;
    对所述疾病的舆情因子进行数据清洗和异常值处理;Data cleaning and outlier processing of the disease factor of the disease;
    对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据;及Standardize data on the sensation factors of diseases after data cleansing and outlier processing to obtain new disease data;
    根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据所述衍生变量对疾病进行预测。Derived variables of the estrous factors of the disease are calculated based on the new disease data, and the disease is predicted based on the derived variables.
  10. 如权利要求9所述的终端,其特征在于,所述确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据包括:The terminal according to claim 9, wherein said determining a data source associated with said keyword in the Internet, and crawling from said data source using said crawler program for disease data related to said keyword comprises: :
    确定互联网中与所述关键词相关的数据源,根据所述数据源的类型对所述数据源进行分类;Determining a data source related to the keyword in the Internet, and classifying the data source according to the type of the data source;
    根据对与所述数据源进行分类得到的类别数,设置与所述类别数相同的多线程爬虫程序;Setting a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source;
    利用所述多线程爬虫程序分别从对应的所述数据源中爬取与所述关键词相关的疾病数据。The multi-threaded crawler program is used to respectively crawl disease data related to the keyword from the corresponding data source.
  11. 如权利要求9所述的终端,其特征在于,所述处理器还用于执行所述计算机可读指令时实现如下步骤:The terminal according to claim 9, wherein the processor is further configured to: when the computer readable instructions are executed: implementing the following steps:
    根据计算出的衍生变量制作成图表进行可视化展示,所述衍生变量包括:最大值、最小值、平均数、方差、标准差、协方差、极差、中位数、众数、四分位数。Visualized display based on the calculated derivative variables, including: maximum, minimum, mean, variance, standard deviation, covariance, range, median, mode, quartile .
  12. 如权利要求9所述的终端,其特征在于,所述利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据包括:The terminal according to claim 9, wherein the crawling the disease data related to the keyword from the data source by using a crawler program comprises:
    利用爬虫程序在预设爬虫时间段内从所述数据源中爬取与所述关键词相 关的疾病数据。The disease data associated with the keyword is crawled from the data source during a preset crawler period using a crawler program.
  13. 如权利要求9所述的终端,其特征在于,所述对所述疾病数据进行解析得到疾病的舆情因子包括:The terminal according to claim 9, wherein the parsing factor for parsing the disease data to obtain a disease comprises:
    计算所述疾病的所有子舆情因子的数量总和,计算每一个子舆情因子占所述总和的百分比,所述百分比为对应的子舆情因子的权重,将权重大于预设权重阈值的子舆情因子确定为疾病的舆情因子。Calculating a sum of the number of all sub-sentiment factors of the disease, calculating a percentage of each sub-sentiment factor in the sum, the percentage being a weight of the corresponding sub-sentiment factor, determining a sub-sentiment factor having a weight greater than a preset weight threshold A sensation factor for the disease.
  14. 如权利要求9所述的终端,其特征在于,所述对所述疾病的舆情因子进行数据清洗和异常值处理包括:The terminal according to claim 9, wherein said data cleaning and outlier processing of the sensation factor of said disease comprises:
    根据所述疾病的舆情因子的类型对所述疾病的舆情因子进行数据清洗;Data cleaning of the sensation factors of the disease according to the type of sensation factor of the disease;
    根据所述疾病的舆情因子的分布对所述疾病的舆情因子进行缺失值替换;或者Loss value substitution for the disease factor of the disease according to the distribution of the sensation factor of the disease; or
    直接丢弃有异常的疾病的舆情因子。Directly discard the sensation factors of abnormal diseases.
  15. 一种非易失性可读存储介质,其上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:A non-volatile readable storage medium having stored thereon computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the following steps:
    接收用户输入的疾病的至少一个关键词;Receiving at least one keyword of a disease input by a user;
    确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据;Determining a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;
    对所述疾病数据进行解析得到疾病的舆情因子;Parsing the disease data to obtain a disease factor of the disease;
    对所述疾病的舆情因子进行数据清洗和异常值处理;Data cleaning and outlier processing of the disease factor of the disease;
    对经过数据清洗和异常值处理之后的疾病的舆情因子进行数据标准化,得到新的疾病数据;及Standardize data on the sensation factors of diseases after data cleansing and outlier processing to obtain new disease data;
    根据所述新的疾病数据计算疾病的舆情因子的衍生变量,根据所述衍生变量对疾病进行预测。Derived variables of the estrous factors of the disease are calculated based on the new disease data, and the disease is predicted based on the derived variables.
  16. 如权利要求15所述的存储介质,其特征在于,所述确定互联网中与所述关键词相关的数据源,并利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据包括:The storage medium according to claim 15, wherein said determining a data source associated with said keyword in the Internet, and crawling the disease data associated with said keyword from said data source using a crawler program include:
    确定互联网中与所述关键词相关的数据源,根据所述数据源的类型对所述数据源进行分类;Determining a data source related to the keyword in the Internet, and classifying the data source according to the type of the data source;
    根据对与所述数据源进行分类得到的类别数,设置与所述类别数相同的多线程爬虫程序;Setting a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source;
    利用所述多线程爬虫程序分别从对应的所述数据源中爬取与所述关键词相关的疾病数据。The multi-threaded crawler program is used to respectively crawl disease data related to the keyword from the corresponding data source.
  17. 如权利要求15所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还实现如下步骤:The storage medium of claim 15 wherein said computer readable instructions are further executed by said processor to:
    根据计算出的衍生变量制作成图表进行可视化展示,所述衍生变量包括:最大值、最小值、平均数、方差、标准差、协方差、极差、中位数、众数、四分位数。Visualized display based on the calculated derivative variables, including: maximum, minimum, mean, variance, standard deviation, covariance, range, median, mode, quartile .
  18. 如权利要求15所述的存储介质,其特征在于,所述利用爬虫程序从所述数据源中爬取与所述关键词相关的疾病数据包括:The storage medium of claim 15, wherein the crawling the disease data associated with the keyword from the data source using a crawler program comprises:
    利用爬虫程序在预设爬虫时间段内从所述数据源中爬取与所述关键词相 关的疾病数据。The disease data associated with the keyword is crawled from the data source during a preset crawler period using a crawler program.
  19. 如权利要求15所述的存储介质,其特征在于,所述对所述疾病数据进行解析得到疾病的舆情因子包括:The storage medium according to claim 15, wherein said analysing factor for analyzing said disease data to obtain a disease comprises:
    计算所述疾病的所有子舆情因子的数量总和,计算每一个子舆情因子占所述总和的百分比,所述百分比为对应的子舆情因子的权重,将权重大于预设权重阈值的子舆情因子确定为疾病的舆情因子。Calculating a sum of the number of all sub-sentiment factors of the disease, calculating a percentage of each sub-sentiment factor in the sum, the percentage being a weight of the corresponding sub-sentiment factor, determining a sub-sentiment factor having a weight greater than a preset weight threshold A sensation factor for the disease.
  20. 如权利要求15所述的存储介质,其特征在于,所述对所述疾病的舆情因子进行数据清洗和异常值处理包括:The storage medium according to claim 15, wherein said data cleaning and outlier processing of the sensation factor of said disease comprises:
    根据所述疾病的舆情因子的类型对所述疾病的舆情因子进行数据清洗;Data cleaning of the sensation factors of the disease according to the type of sensation factor of the disease;
    根据所述疾病的舆情因子的分布对所述疾病的舆情因子进行缺失值替换;或者Loss value substitution for the disease factor of the disease according to the distribution of the sensation factor of the disease; or
    直接丢弃有异常的疾病的舆情因子。Directly discard the sensation factors of abnormal diseases.
PCT/CN2018/100229 2018-04-18 2018-08-13 Method for forecasting public sentiment data, device, terminal, and storage medium WO2019200786A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810351128.0 2018-04-18
CN201810351128.0A CN108647249B (en) 2018-04-18 2018-04-18 Public opinion data prediction method, device, terminal and storage medium

Publications (1)

Publication Number Publication Date
WO2019200786A1 true WO2019200786A1 (en) 2019-10-24

Family

ID=63746630

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/100229 WO2019200786A1 (en) 2018-04-18 2018-08-13 Method for forecasting public sentiment data, device, terminal, and storage medium

Country Status (2)

Country Link
CN (1) CN108647249B (en)
WO (1) WO2019200786A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749341A (en) * 2021-01-22 2021-05-04 南京莱斯网信技术研究院有限公司 Key public opinion recommendation method, readable storage medium and data processing device
CN113590914A (en) * 2021-06-23 2021-11-02 北京百度网讯科技有限公司 Information processing method, device, electronic equipment and storage medium
CN116629913A (en) * 2023-07-24 2023-08-22 山东青上化工有限公司 Data extraction system and processing method for compound fertilizer production process

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110299208A (en) * 2019-05-22 2019-10-01 平安科技(深圳)有限公司 Disease surveillance data exception detection method, system, equipment and storage medium
CN110321342A (en) * 2019-05-27 2019-10-11 平安科技(深圳)有限公司 Business valuation studies method, apparatus and storage medium based on intelligent characteristic selection
CN110675959B (en) * 2019-08-19 2023-07-07 平安科技(深圳)有限公司 Intelligent data analysis method and device, computer equipment and storage medium
CN110569298B (en) * 2019-09-12 2023-03-24 成都中科大旗软件股份有限公司 Data docking and visualization method and system
CN111968753A (en) * 2020-08-06 2020-11-20 平安科技(深圳)有限公司 Epidemic situation monitoring method and device, computer equipment and storage medium
CN111986763B (en) * 2020-09-03 2024-05-14 深圳平安智慧医健科技有限公司 Disease data analysis method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239892A (en) * 2017-05-26 2017-10-10 山东省科学院情报研究所 Region talent's equilibrium of supply and demand quantitative analysis method based on big data
US20170316080A1 (en) * 2016-04-29 2017-11-02 Quest Software Inc. Automatically generated employee profiles
CN107330613A (en) * 2017-06-29 2017-11-07 平安万家医疗投资管理有限责任公司 A kind of public sentiment monitoring method, equipment and computer-readable recording medium

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2335801A1 (en) * 1998-04-29 2002-05-14 Justin Winfield A system and method for text mining
US20120296974A1 (en) * 1999-04-27 2012-11-22 Joseph Akwo Tabe Social network for media topics of information relating to the science of positivism
US7685091B2 (en) * 2006-02-14 2010-03-23 Accenture Global Services Gmbh System and method for online information analysis
CN102043893A (en) * 2009-10-13 2011-05-04 北京大学 Disease pre-warning method and system
GB201103673D0 (en) * 2011-03-03 2011-04-20 Zillian S A Method of generating statistical opinion data
CN103577557B (en) * 2013-10-21 2017-04-05 北京奇虎科技有限公司 A kind of apparatus and method of the crawl frequency for determining network resource point
CN105653527A (en) * 2014-11-11 2016-06-08 江苏威盾网络科技有限公司 Public sentiment treatment and information deploying method based on web crawler technology
CN105740228B (en) * 2016-01-25 2019-06-04 云南大学 A kind of internet public feelings analysis method and system
CN106096056B (en) * 2016-06-30 2019-11-26 西南石油大学 One kind being based on distributed public sentiment data real-time collecting method and system
CN106599553B (en) * 2016-11-29 2019-08-16 中国科学院深圳先进技术研究院 Disease Warning Mechanism device
CN106649270A (en) * 2016-12-19 2017-05-10 四川长虹电器股份有限公司 Public opinion monitoring and analyzing method
CN106951698A (en) * 2017-03-13 2017-07-14 成都育芽科技有限公司 A kind of disease risks forecasting system based on network big data platform
CN107220297B (en) * 2017-05-02 2020-11-20 北京大学 Multi-source heterogeneous data automatic collection method and system for software project

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170316080A1 (en) * 2016-04-29 2017-11-02 Quest Software Inc. Automatically generated employee profiles
CN107239892A (en) * 2017-05-26 2017-10-10 山东省科学院情报研究所 Region talent's equilibrium of supply and demand quantitative analysis method based on big data
CN107330613A (en) * 2017-06-29 2017-11-07 平安万家医疗投资管理有限责任公司 A kind of public sentiment monitoring method, equipment and computer-readable recording medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749341A (en) * 2021-01-22 2021-05-04 南京莱斯网信技术研究院有限公司 Key public opinion recommendation method, readable storage medium and data processing device
CN112749341B (en) * 2021-01-22 2024-03-29 南京莱斯网信技术研究院有限公司 Important public opinion recommendation method, readable storage medium and data processing device
CN113590914A (en) * 2021-06-23 2021-11-02 北京百度网讯科技有限公司 Information processing method, device, electronic equipment and storage medium
CN113590914B (en) * 2021-06-23 2024-02-20 北京百度网讯科技有限公司 Information processing method, apparatus, electronic device and storage medium
CN116629913A (en) * 2023-07-24 2023-08-22 山东青上化工有限公司 Data extraction system and processing method for compound fertilizer production process
CN116629913B (en) * 2023-07-24 2023-10-03 山东青上化工有限公司 Data extraction system and processing method for compound fertilizer production process

Also Published As

Publication number Publication date
CN108647249A (en) 2018-10-12
CN108647249B (en) 2022-08-02

Similar Documents

Publication Publication Date Title
WO2019200786A1 (en) Method for forecasting public sentiment data, device, terminal, and storage medium
Marcoulides et al. Evaluation of variance inflation factors in regression models using latent variable modeling methods
US11049165B2 (en) System for clustering and aggregating data from multiple sources
US10878004B2 (en) Keyword extraction method, apparatus and server
Hridoy et al. Localized twitter opinion mining using sentiment analysis
JP2020527788A (en) Disease prediction methods and devices, computer devices and readable storage media
Duckworth et al. Using explainable machine learning to characterise data drift and detect emergent health risks for emergency department admissions during COVID-19
WO2019196280A1 (en) Disease prediction method and device, computer device and readable storage medium
WO2021175009A1 (en) Early warning event graph construction method and apparatus, device, and storage medium
EP3586251A1 (en) Method for determining news veracity
WO2022217713A1 (en) Syndrome monitoring and early warning method and apparatus, computer device, and storage medium
US20130246463A1 (en) Prediction and isolation of patterns across datasets
JP5180743B2 (en) Brand analysis method and apparatus
WO2014187076A1 (en) Natural language generating method and system
Tse et al. Social network based crowd sensing for intelligent transportation and climate applications
Gumpili et al. Sample size and its evolution in research
CN111259220B (en) Data acquisition method and system based on big data
CN113094477B (en) Data structuring method and device, computer equipment and storage medium
EP4295259A1 (en) Reputation management and machine learning systems and processes
Anwar Hridoy et al. Localized twitter opinion mining using sentiment analysis
CN114297478A (en) Page recommendation method, device, equipment and storage medium
Shepherd et al. Online health information seeking for Mpox in endemic and nonendemic Countries: Google trends study
CN113344723A (en) User insurance cognitive evolution path prediction method and device and computer equipment
CN110413842B (en) Content auditing method, system, electronic equipment and medium based on public opinion situation perception
Ehwerhemuepha et al. Development and validation of an early warning tool for sepsis and decompensation in children during emergency department triage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18915728

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21.01.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18915728

Country of ref document: EP

Kind code of ref document: A1