WO2019200786A1

WO2019200786A1 - Method for forecasting public sentiment data, device, terminal, and storage medium

Info

Publication number: WO2019200786A1
Application number: PCT/CN2018/100229
Authority: WO
Inventors: 阮晓雯; 徐亮; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-04-18
Filing date: 2018-08-13
Publication date: 2019-10-24
Also published as: CN108647249A; CN108647249B

Abstract

A method for forecasting public sentiment data, the method comprising: receiving at least one keyword of a disease input by a user; determining a data source related to the keyword on the Internet, and crawling the data source by means of a crawler program to obtain disease data related to the keyword; parsing the disease data to obtain a public sentiment factor with regards to the disease; performing data cleaning and outlier processing on the public sentiment factor of the disease; performing data normalization on the public sentiment factor of the disease that has undergone data cleaning and outlier processing, and obtaining new disease data; and calculating, according to the new disease data, a derivative variable of the public sentiment factor of the disease, and obtaining a disease forecast according to the derivative variable. The present application further provides a device for forecasting public sentiment data, a terminal, and a storage medium. The present application crawls comprehensive disease data, performs data organization, in-depth analysis and computation on the disease data, and achieves the goal of displaying data from basic data to decision-making data, thereby providing a useful reference in disease forecasting.

Description

Public opinion data prediction method, device, terminal and storage medium

This application claims priority to Chinese Patent Application No. 201810351128.0, entitled "Surveying Data Prediction Method, Apparatus, Terminal, and Storage Medium", filed on April 18, 2018, the entire contents of which are incorporated by reference. In this application.

Technical field

The present application relates to the field of data prediction technologies, and in particular, to a method, device, terminal, and storage medium for predicting public opinion data.

Background technique

With the rapid development of the Internet, computer technology has facilitated people's lives in all walks of life, and is no exception in the medical field. There are a lot of professional data of the disease and the user's medical record on the network, but the data is not systematic and incomplete. When an epidemic breaks out rapidly, the website information is often not updated in time, resulting in the lag of information entry. We can't keep up to date with the latest information, prevent it in time, and prevent it from happening.

At present, web crawling technology is used to crawl the public opinion data about the disease, but the crawling method is relatively simple, and the simple crawling method is adopted. Secondly, there is no effective and timely inspection of the data obtained by the climb. In addition, for differently distributed data, the same data cleaning and filling method is adopted, and the data processing effect is poor.

Summary of the invention

In view of the above, it is necessary to propose a method, device, terminal and storage medium for predicting public opinion data, which can crawl disease data in different data sources and adopt different data inspection, cleaning and outlier processing methods.

A first aspect of the present application provides a method for predicting public opinion data, the method comprising:

Receiving at least one keyword of a disease input by a user;

Determining a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;

Parsing the disease data to obtain a disease factor of the disease;

Data cleaning and outlier processing of the disease factor of the disease;

Standardize data on the sensation factors of diseases after data cleansing and outlier processing to obtain new disease data;

Derived variables of the estrous factors of the disease are calculated based on the new disease data, and the disease is predicted based on the derived variables.

A second aspect of the present application provides a public opinion data prediction apparatus, the apparatus comprising:

a receiving module, configured to receive at least one keyword of a disease input by the user;

a crawling module, configured to determine a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;

An analysis module, configured to parse the disease data to obtain a disease factor of the disease;

a cleaning module for performing data cleaning and abnormal value processing on the grievance factor of the disease;

a standardized module for standardizing data on the grievance factors of diseases after data cleansing and outlier processing to obtain new disease data;

And a prediction module, configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.

A third aspect of the present application provides a terminal, the terminal including a processor and a memory, the processor implementing the method for predicting public opinion data when the computer readable instructions stored in the memory are executed.

A fourth aspect of the present application provides a non-volatile readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the method of predicting data.

The method, device, terminal and storage medium for predicting public opinion data according to the present application, by setting different crawler programs corresponding to different types of data sources, using a multi-threaded crawler program to crawl keywords that are crawled and input from corresponding data sources Related disease data, parallel crawling method can speed up the efficiency of crawling, and the data format of the disease data obtained by crawling is relatively uniform, and can avoid difficulty in crawling due to storage format or other problems of data of different data sources or The problem of the problem of parsing the data after the crawling cannot be performed; the data of the disease is analyzed, the depth analysis and the calculation are performed, and the disease data obtained by the crawling is refined, and then formed into a graph or a table. Classes, the results are more clear and easy to analyze problems intuitively. In addition, a number of variables are derived from the disease's sensation factors, which increases the data indicators and provides a reference for disease prediction, so that the disease prediction will not be blind, empirical, and the prediction results will be more accurate.

DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can obtain other drawings according to the provided drawings without any creative work.

FIG. 1 is a flowchart of a method for predicting public opinion data provided in Embodiment 1 of the present application.

FIG. 2 is a flowchart of a method for predicting public opinion data provided in Embodiment 2 of the present application.

FIG. 3 is a structural diagram of a public opinion data prediction apparatus according to Embodiment 3 of the present application.

4 is a structural diagram of a public opinion data prediction apparatus provided in Embodiment 4 of the present application.

FIG. 5 is a structural diagram of a terminal provided in Embodiment 5 of the present application.

The present application will be further described in conjunction with the above drawings in the following detailed description.

detailed description

The above described objects, features, and advantages of the present invention will be more clearly understood from the following detailed description. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

All technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention applies, unless otherwise defined. The terminology used herein is for the purpose of describing particular embodiments, and is not intended to be limiting.

The public opinion data prediction method of the embodiment of the present application is applied to one or more terminals. The method of predicting data can also be applied to a hardware environment composed of a terminal and a server connected to the terminal through a network. Networks include, but are not limited to, wide area networks, metropolitan area networks, or local area networks. The public opinion data prediction method in the embodiment of the present application may be executed by a server or may be performed by a terminal; or may be performed by a server and a terminal together.

For the terminal that needs to perform the public opinion data prediction method, the public opinion data prediction function provided by the method of the present application may be directly integrated on the terminal, or the client for implementing the method of the present application may be installed. For example, the method provided by the present application may also be run on a server or the like in the form of a software development kit (step DK), providing an interface for public opinion data prediction function in the form of step DK, terminal or other The device can predict the public opinion data through the provided interface.

Embodiment 1

FIG. 1 is a flowchart of a method for predicting public opinion data provided in Embodiment 1 of the present application. The order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.

Step 11. Receive at least one keyword of a disease input by the user.

The keyword is a word related to the symptoms of the disease, for example, when the disease is a cold, the keywords may include: sneezing, runny nose, stuffy nose, headache, dizziness, cough, innocence, sore throat, and the like. For another example, when the disease is hand, foot and mouth, the keywords may include: mouth pain, anorexia, hypothermia, hand herpes, small mouth ulcers, and the like.

To facilitate subsequent crawling of more disease-related data, users can enter multiple keywords for the disease. The keyword may be a symptom of a disease obtained by the user according to his or her own experience, or may be a symptom of a disease collected from a disease expert.

In this embodiment, the terminal presets a function for the user to input a keyword of the disease. For example, the terminal provides a text input box through which the user can input at least one keyword. Alternatively, the terminal provides a function of a voice assistant, and the user can input at least one keyword through the voice assistant.

Step 12: Determine a data source related to the keyword in the Internet, and use a crawler program to crawl disease data related to the keyword from the data source.

The data sources related to the keywords in the Internet may include, but are not limited to, Baidu, Google, Tencent, Weibo, Hot Search, and any website that supports user search access. Using the crawler program to crawl disease data related to the keyword from various data sources may include: Baidu Index, Google Trends, Tencent Analysis, news information, advertisement data, channel data, microblogging heat, forum public opinion information, and the like.

In this embodiment, the user determines a Uniform Resource Locator (URL) of the data source in the Internet, and the crawler crawls the disease data related to the keyword according to the URL.

Step 13. Analyze the disease data to obtain a sensation factor of the disease.

The specific analysis work including the public opinion analysis of the disease data, including text processing, text analysis, word frequency statistics, correlation analysis, etc., to obtain the disease sensation factors.

In this embodiment, the sensation factor of the disease may include a plurality of sub-sentiment factors, for example, a first sub-sentiment factor, a second sub-sentiment factor, a third sub-sentiment factor, a fourth sentiment factor, and the like.

For example, the first sub-sentiment factor may be a headache, the second sub-sentiment factor may be a runny nose, the third sub-sentiment factor may be a fever, and the fourth sub-sense factor may be a cough.

Step 14. Perform data cleaning and abnormal value processing on the sensation factor of the disease.

Data cleaning and outlier processing of the disease sensation factors are performed to eliminate redundant data in the grievance factors of the disease, and to obtain disease data in a consistent standard format, so that the disease after washing and abnormal value processing The lyric factor is available and more suitable for subsequent analysis work.

In this embodiment, the data cleaning of the sensation factor of the disease comprises: performing data cleaning on the sensation factor of the disease according to the type of the sensation factor of the disease.

The types of sensation factors of the disease include, but are not limited to, estrus factors of noise-containing diseases, sensation factors of unconformity diseases, sensation factors of diseases containing repeated information, sensation factors of diseases with unbalanced data, inconsistencies The disease factor of the disease, the sensation factor of the incomplete disease, and the like.

For the sensation factor of the noise-containing disease, the data is cleaned by removing the extra large value and the negative value point; the lyric factor for the unconformity disease is cleaned by the method of removing the abnormal value; The lyric factor of the disease of the information is cleaned by means of deleting duplicates; the lyric factor of the unbalanced disease is cleaned by data denoising method; the lyric factor for the inconsistent disease is determined by data type The method of class is used for data cleaning; for the lyric factor of the incomplete disease, data cleaning is performed by establishing a reference value of the relevant standard.

In this embodiment, the abnormal value processing of the sensation factor of the disease comprises: performing a missing value replacement on the sensation factor of the disease according to the distribution of the sensation factor of the disease.

In this embodiment, the distribution of the sensation factors of the disease includes, but is not limited to, a stable type and a severe type. The lyric factor of the stable distributed disease means that the trend of the sensation factor of the disease is relatively stable, for example, 50, 53, 52, 49, 51 and the like. The lyric factor of the severely distributed disease means that the change trend of the sensation factor of the disease is sharp and the change range is large, for example, 50, 100, 43, 89, 4, and the like.

For a stable distribution of the disease's sensation factor, the K-nearest distance neighbor method can be used to determine the nearest K samples from the lyric factor sample with the missing disease according to the Euclidean distance or correlation analysis, and the K disease The weighted average of the lyric factor values is used to estimate the missing data of the sample; for the stable distribution of the estrous factor of the disease, a predictive model can also be used to predict the estrus factor of each missing disease, if the lyric factor of the missing disease is a numerical value Type, the mean value can be used to fill the lyric factor of the missing disease. If the lyric factor of the missing disease is non-numeric, the mode can be used to fill the lyric factor of the missing disease.

For a violent distribution of the disease's sensation factor, a mean method can be used to replace the lyric factor of the missing disease.

Preferably, since the method of using the averaging method to replace the lyric factor of the missing disease is based on the assumption of completely random deletion, the variance and standard deviation of the sensation factor causing the disease become small, and thus the method may further include : The lyric factor of the disease obtained by the mean substitution is integrated with the preset expansion coefficient, and the sensation factor of the new disease is obtained as the sensation factor of the final disease.

The preset expansion coefficient is a preset expansion coefficient, and the expansion coefficient is greater than 1.

In other embodiments, the outlier treatment of the disease factor of the disease further comprises: directly discarding the sensation factor of the abnormal disease. Directly discarding the lyric factors of abnormal diseases can ensure that the lyric factors of the disease obtained by the climb are clean and avoid interference when analyzing the grievance factors of the disease.

Step 15. Standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.

The data standardization of the sensation factors of the disease after data cleansing and outlier processing is to convert the lyric factors of the disease into dimensionless pure values, so that indicators of different units or magnitudes can be compared and weighted.

In this embodiment, the method for data standardization includes, but is not limited to, sum standardization, standard deviation standardization, maximum value standardization, range difference standardization, and the like. It is preferable to standardize the range, and the maximum value of the new data obtained after the range normalization processing is 1 and the minimum value is 0, and the remaining values are between 0 and 1.

Step 16. Calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.

In this embodiment, the derived variables include: maximum value, minimum value, average number, variance, standard deviation, covariance, range (maximum-minimum value), median, mode, quartile. Wherein, the mean, median, mode, quartile describes the concentration of the disease's sensation factors, and the greater the concentration of the disease's sensation factors, indicating that the disease is predicted to be more severe; The variance and standard deviation characterize the degree of dispersion of the disease's sensation factors, and the smaller the degree of dispersion of the disease's sensation factors, the more serious the disease is predicted.

The public opinion data prediction method determines a data source related to the keyword in the Internet by receiving at least one keyword of a disease input by a user, and crawls the data source from the data source by using a crawler program The disease data, the disease data is analyzed to obtain the sensation factor of the disease, and then the lyric factors of the disease are cleaned and the abnormal value is processed, and the lyric factors of the disease after the data cleaning and the abnormal value processing are standardized. Obtaining new disease data, calculating a derivative variable of the disease sensation factor based on the new disease data, thereby predicting the disease according to the derived variable. Through the user's rough input of the disease-related keywords, the crawler program is used to climb the disease data related to the input keywords, and the lyric factors of the more comprehensive diseases related to the disease are obtained; the lyric factors of the diseases are performed. Data collation, in-depth analysis and calculation, this kind of refinement of the disease data obtained by crawling can obtain the purpose from basic data display to decision-making data display, and provide reference for disease prediction, and the prediction result is accurate.

Embodiment 2

FIG. 2 is a flowchart of a method for predicting public opinion data provided in Embodiment 2 of the present application. The order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.

Step 21: Receive at least one keyword of a disease input by a user.

Step 21 in this embodiment is the same as step 11 in the first embodiment, and details are not described herein again.

Step 22: Determine a data source related to the keyword in the Internet, and classify the data source according to the type of the data source.

In this embodiment, the data sources related to the keyword may be classified into two categories according to the type of the data source, the first type is an exponential data source, and the second type is a public opinion data source. The index type data source includes, but is not limited to, Baidu, Google, 360, and the like. The data source includes: but not limited to: Weibo, forum, WeChat, hot search, and the like.

Step 23: Set a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source.

Setting different crawler programs for different types of data sources can facilitate smoother crawling of data of data sources of the category, and can avoid crawling difficulties or failures due to different data source storage formats or other problems. The data after the crawl is parsed.

In this embodiment, if the data source is divided into two categories, the corresponding dual-thread crawler is set. For example, Baidu and Weibo are two different types of data sources, each having its own text storage format, and the first crawler is set to crawl the disease data related to the keyword in Baidu, and the second crawler program is used. It is designed to crawl disease data related to the keyword in Weibo.

In other embodiments, the data source related to the keyword in the Internet may be subdivided into a plurality of categories according to actual needs, and corresponding crawling programs are respectively set for each category of data sources.

Step 24: The disease data related to the keyword is respectively crawled from the corresponding data source by using the multi-threaded crawler program.

In this embodiment, the URL of the data source corresponding to the crawler program is placed in the crawl queue, and the multi-threaded crawler crawls the disease data related to the keyword from the data source in parallel.

Step 25: Analyze the disease data to obtain a sensation factor of the disease.

Step 26: Perform data cleaning and abnormal value processing on the sensation factor of the disease.

Step 27. Standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.

Steps 25 to 27 in this embodiment respectively correspond to steps 13-15 in the first embodiment, and details are not described herein again.

Step 28: Calculate a derivative variable of the sensation factor of the disease according to the new disease data, and create a chart to perform visual display according to the calculated derivative variable.

Preferably, the step 24 may further include: classifying and storing the disease data obtained by the crawling.

The disease data is stored in a local database or stored in a storage server or stored in the cloud. For example, the disease data crawled from Baidu is stored in a first storage location in the local database, and the disease data crawled from the meager data is stored in a second storage location in the local database. The first storage location and the second storage location may be located in the same root directory in the local data at the same time, or may be located in different root directories. The first storage location and the second storage location may also be displayed in different names with different names. The data collected from different data sources is classified and stored, which is convenient for analyzing data of the same data source.

Preferably, in order to ensure that the crawled disease data is up-to-date, the disease data needs to be updated periodically, and the method may further include: crawling from the data source by using a crawler program during a preset crawler period. The disease data related to the keyword.

The preset crawler time period is a preset crawler time period. For example, the pre-set crawler time period is from 24 to 3 every night, so when the server accessing the data source is generally small, the server of the data source is not given. It creates a lot of access pressure, which is conducive to the smooth running of the server of the data source and can improve the crawling efficiency.

Preferably, after the disease data related to the keyword is crawled from the data source in a preset crawling time period by using a crawler program, and the disease data is analyzed to obtain a sensation factor of the disease, the method further The method further comprises: quantifying a sub-sentiment factor of each of the diseases, obtaining a weight of a sub-sentiment factor of the disease, and determining a sub-sentiment factor whose weight is greater than a preset weight threshold as a public opinion factor of the disease.

The specific process of quantifying the sub-sentiment factor of each of the diseases to obtain the weight of the sub-sentiment factor of the disease is: calculating the sum of the quantities of all the sub-sentiment factors of the disease, and calculating each sub-sentiment factor to account for the The percentage of the sum, which is the weight of the corresponding sub-sense factor.

The preset weight threshold is a preset weight threshold. When the weight of the child sentiment factor is greater than the preset weight threshold, the child sentiment factor is determined as the disease sensation factor, and the child sensation factor with less weight can be effectively filtered out. , can reduce the amount of data calculation, effectively shorten the disease prediction time, and the child weight factor with less weight will not have any impact on the outcome of disease prediction.

In summary, the method for predicting public opinion data determines a data source related to the keyword in the Internet by receiving at least one keyword of a disease input by a user, and performing the data source according to the type of the data source. Classification, according to the number of categories obtained by classifying the data source, setting a multi-threaded crawler program having the same number of the categories, and crawling from the corresponding data source with the multi-threaded crawler program Key words related disease data, followed by data cleaning and outlier processing of the disease sensation factor, data standardization of the sensation factors of the disease after data cleaning and abnormal value processing, to obtain new disease data, according to the The new disease data calculates the derived variables of the disease's sensation factors, and the calculated derivative variables are graphically displayed to visualize the disease. By setting different crawler programs to correspond to different types of data sources, the multi-threaded crawler is used to crawl and retrieve the disease data related to the input keywords from the corresponding data sources, and the parallel crawling method can speed up the crawling efficiency. The data format of the disease data obtained by the crawling is relatively uniform, and the problem that the crawling difficulty or the parsing of the crawled data cannot be caused due to the storage format or other problems of the data of different data sources can be avoided; The disease sensation factor is used for data collation, in-depth analysis and calculation. After the disease data obtained by the climb is refined, it is made into a graph or a table, and the results are more clear and easy to analyze the problem intuitively, providing disease prediction. Based on the reference, the prediction results are accurate.

The above description is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and those skilled in the art can also make without departing from the concept of the present application. Improvements, but these are all within the scope of this application.

The function modules and hardware structures of the terminal for realizing the above-described public opinion data prediction method will be respectively described below with reference to the third to fifth figures.

Embodiment 3

FIG. 3 is a functional block diagram of a public opinion data prediction apparatus according to Embodiment 3 of the present application.

In some embodiments, the public opinion data prediction device 30 operates in a terminal. The public opinion data predicting device 30 can include a plurality of functional modules consisting of program code segments. The program code for each of the program segments in the public opinion data predicting device 30 can be stored in a memory and executed by at least one processor to perform (see Figure 1 and its associated description) predictions of the public opinion data.

In this embodiment, the public opinion data prediction device 30 of the terminal may be divided into a plurality of functional modules according to functions performed by the terminal. The function module may include: a receiving module 301, a crawling module 302, a parsing module 303, a cleaning module 304, an expanding module 305, a standardizing module 306, and a predicting module 307. A module as referred to in this application refers to a series of computer readable instruction segments that are executable by at least one processor and capable of performing a fixed function, which are stored in the memory. In some embodiments, the functionality of each module will be detailed in subsequent embodiments.

The receiving module 301 is configured to receive at least one keyword of a disease input by the user.

The crawling module 302 is configured to determine a data source related to the keyword in the Internet, and crawl the disease data related to the keyword from the data source by using a crawler program.

The parsing module 303 is configured to parse the disease data to obtain a sensation factor of the disease.

The cleaning module 304 is configured to perform data cleaning and abnormal value processing on the sensation factor of the disease.

The cleaning module 304 is further configured to perform data cleaning on the sensation factor of the disease according to the type of the sensation factor of the disease.

The cleaning module 304 is further configured to perform a missing value replacement on the sensation factor of the disease according to the distribution of the sensation factor of the disease.

The cleaning module 304 is also used to directly discard the sensation factor of the abnormal disease. Directly discarding the lyric factors of abnormal diseases can ensure that the lyric factors of the disease obtained by the climb are clean and avoid interference when analyzing the grievance factors of the disease.

The expansion module 305 is configured to integrate the sensation factor of the disease obtained by the mean replacement with the preset expansion coefficient to obtain a sensation factor of the new disease as a sensation factor of the final disease. The method of using the averaging method to replace the lyric factor of the missing disease is based on the assumption of completely random deletion, which causes the variance and standard deviation of the disease's estrous factor to become smaller. The preset expansion coefficient is a preset expansion coefficient, and the expansion coefficient is greater than 1.

The normalization module 306 is configured to standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.

The data standardization of the lyric factors of the disease after data cleansing and outlier processing is to convert the lyric factors of the disease into dimensionless pure values, so that indicators of different units or magnitudes can be compared and weighted.

The prediction module 307 is configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.

The sensation data prediction device 30 receives at least one keyword of the disease input by the user through the receiving module 301, and the crawling module 302 determines a data source related to the keyword in the Internet, and uses the crawler program from the data source. Climbing the disease data related to the keyword, the parsing module 303 parses the disease data to obtain a sensation factor of the disease, and then the cleaning module 304 performs data cleaning and abnormal value processing on the sensation factor of the disease, and the normalization module 306 Data is normalized to the lyric factors of the disease after data cleaning and abnormal value processing to obtain new disease data, and the prediction module 307 calculates a derivative variable of the disease sensation factor according to the new disease data, thereby The disease is predicted. Through the user's rough input of the disease-related keywords, the crawler program is used to climb the disease data related to the input keywords, and the lyric factors of the more comprehensive diseases related to the disease are obtained; the lyric factors of the diseases are performed. Data collation, in-depth analysis and calculation, this kind of refinement of the disease data obtained by crawling can obtain the purpose from basic data display to decision-making data display, and provide reference for disease prediction, and the prediction result is accurate.

Embodiment 4

FIG. 4 is a functional block diagram of a public opinion data prediction apparatus according to Embodiment 4 of the present application.

In some embodiments, the public opinion data prediction device 40 operates in a terminal. The public opinion data predicting device 40 may include a plurality of functional modules composed of program code segments. The program code for each of the program segments in the public opinion data predicting device 40 may be stored in a memory and executed by at least one processor to perform (see FIG. 2 and its associated description) predictions of the public opinion data.

In this embodiment, the public opinion data prediction device 40 of the terminal may be divided into a plurality of functional modules according to the functions performed by the terminal. The function module may include: a receiving module 401, a classification module 402, a setting module 403, a crawling module 404, a parsing module 405, a cleaning module 406, a standardization module 407, a visualization module 408, a storage module 409, and a quantization module 410. A module as referred to in this application refers to a series of computer readable instruction segments that are executable by at least one processor and capable of performing a fixed function, which are stored in the memory. In some embodiments, the functionality of each module will be detailed in subsequent embodiments.

The receiving module 401 is configured to receive at least one keyword of a disease input by the user.

The classification module 402 is configured to determine a data source related to the keyword in the Internet, and classify the data source according to the type of the data source.

The setting module 403 is configured to set a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source.

The crawling module 404 is configured to use the multi-threaded crawler to respectively crawl disease data related to the keyword from the corresponding data source.

The parsing module 405 is configured to parse the disease data to obtain a sensation factor of the disease.

The cleaning module 406 is configured to perform data cleaning and abnormal value processing on the sensation factor of the disease.

The standardization module 407 is configured to standardize the data of the sensation factors of the disease after the data cleaning and the abnormal value processing to obtain new disease data.

The visualization module 408 is configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and perform a visual display according to the calculated derivative variable.

The storage module 409 is configured to classify and store the disease data obtained by the crawl.

Preferably, in order to ensure that the crawled disease data is up to date, the disease data needs to be updated periodically, and the crawling module 404 is further configured to use the crawler to climb from the data source during the preset crawling period. The disease data associated with the keyword is taken.

Preferably, after the disease data related to the keyword is crawled from the data source in a preset crawling time period by using a crawler program, and the disease data is analyzed to obtain a sensation factor of the disease, the public opinion data The prediction device 40 may further include a quantification module 410 for separately quantizing the sub-sentiment factor of each of the diseases, obtaining the weight of the sub-sentiment factor of the disease, and determining the sub-sentiment factor whose weight is greater than the preset weight threshold as the disease Lyric factor.

In summary, the sensation data prediction device 40 receives at least one keyword of the disease input by the user through the receiving module 401, and the classification module 402 determines a data source related to the keyword in the Internet, according to the data source. Types the data source, the setting module 403 sets a multi-threaded crawler having the same number of categories as the number of categories obtained by classifying the data source, and the crawl module 404 utilizes the multi-threaded crawler The disease data related to the keyword is respectively crawled from the corresponding data source, and then the parsing module 405 parses the disease data to obtain a sensation factor of the disease, and the cleaning module 406 performs data on the sensation factor of the disease. The cleaning and outlier processing, the normalization module 407 normalizes the data of the disease factor of the disease after the data cleaning and the outlier processing to obtain new disease data, and the visualization module 408 calculates the derivative of the disease factor based on the new disease data. Variables, which are graphically displayed based on the calculated derived variables, thereby Disease prediction. By setting different crawler programs to correspond to different types of data sources, the multi-threaded crawler is used to crawl and retrieve the disease data related to the input keywords from the corresponding data sources, and the parallel crawling method can speed up the crawling efficiency. The data format of the disease data obtained by the crawling is relatively uniform, and the problem that the crawling difficulty or the parsing of the crawled data cannot be caused due to the storage format or other problems of the data of different data sources can be avoided; The disease sensation factor is used for data collation, in-depth analysis and calculation. After the disease data obtained by the climb is refined, it is made into a graph or a table, and the results are more clear and easy to analyze the problem intuitively, providing disease prediction. Based on the reference, the prediction results are accurate. The above-described integrated unit implemented in the form of a software function module can be stored in a non-volatile readable storage medium. The software function modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a dual screen device, or a network device, etc.) or a processor to perform portions of the methods described in various embodiments of the present application. .

Embodiment 5

FIG. 5 is a schematic diagram of a terminal according to Embodiment 5 of the present application.

The terminal 5 comprises a memory 51, at least one processor 52, computer readable instructions 53 stored in the memory 51 and operable on the at least one processor 52, and at least one communication bus 54.

The at least one processor 52 executes the steps of the embodiment of the public opinion data prediction method when the computer readable instructions 53 are executed, or the apparatus implementation is implemented when the at least one processor 52 executes the computer readable instructions 53 The function of each module/unit in the example.

Illustratively, the computer readable instructions 53 may be partitioned into one or more modules/units, the one or more modules/units being stored in the memory 51 and by the at least one processor 52 Execute to complete this application. The one or more modules/units may be a series of computer readable instruction segments capable of performing a particular function, the instruction segments being used to describe the execution of the computer readable instructions 53 in the terminal 5.

The terminal 5 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. It can be understood by those skilled in the art that the schematic diagram 5 is only an example of the terminal 5, does not constitute a limitation of the terminal 5, may include more or less components than the illustration, or combine some components, or different components. For example, the terminal 5 may further include an input/output device, a network access device, a bus, and the like.

The at least one processor 52 may be a central processing unit, or may be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gates or transistor logic devices, discrete Hardware components, etc. The processor 52 may be a microprocessor or the processor 52 may be any conventional processor or the like. The processor 52 is a control center of the terminal 5, and connects the entire terminal 5 with various interfaces and lines. section.

The memory 51 can be used to store the computer readable instructions 53 and/or modules/units by running or executing computer readable instructions and/or modules/units stored in the memory 51, and The data stored in the memory 51 is called to implement various functions of the terminal 5. The memory 51 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.); and the storage data area may be Data (such as audio data, phone book, etc.) created according to the use of the terminal 5 is stored. In addition, the memory 51 may include a high speed random access memory, and may also include a nonvolatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card, a secure digital card, a flash memory card, at least one disk storage device, a flash memory device. Or other volatile solid-state storage devices.

The modules/units integrated by the terminal 5, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the processes in the foregoing embodiments, and may also be implemented by computer-readable instructions, which may be stored in a non-volatile manner. In reading a storage medium, the computer readable instructions, when executed by a processor, implement the steps of the various method embodiments described above. Wherein, the computer readable instructions comprise computer readable instruction code, which may be in the form of source code, an object code form, an executable file or some intermediate form or the like. The non-transitory readable medium may include any entity or device capable of carrying the computer readable instruction code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read only memory, and a random memory. Take memory, electrical carrier signals, telecommunication signals, and software distribution media. It should be noted that the contents of the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, Volatile readable media does not include electrical carrier signals and telecommunication signals.

In the several embodiments provided by the present application, it should be understood that the disclosed terminal and method may be implemented in other manners. For example, the terminal embodiment described above is only illustrative. For example, the division of the unit is only a logical function division, and the actual implementation may have another division manner.

In addition, each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist physically separately, or two or more units may be integrated in the same unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software function modules.

It is obvious to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, and the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Therefore, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims instead All changes in the meaning and scope of equivalent elements are included in this application. Any reference signs in the claims should not be construed as limiting the claim. In addition, it is to be understood that the term "comprising" does not exclude other elements or the singular does not exclude the plural. A plurality of units or devices recited in the system claims can also be implemented by a unit or device by software or hardware. The first, second, etc. words are used to denote names and do not denote any particular order.

It should be noted that the above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto. Although the present application is described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solutions of the present application can be applied. Modifications or equivalent substitutions are made without departing from the spirit of the invention.

Claims

A method for predicting public opinion data, characterized in that the method comprises:

Receiving at least one keyword of a disease input by a user;

Determining a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;

Parsing the disease data to obtain a disease factor of the disease;

Data cleaning and outlier processing of the disease factor of the disease;

Standardize data on the sensation factors of diseases after data cleansing and outlier processing to obtain new disease data;

Derived variables of the estrous factors of the disease are calculated based on the new disease data, and the disease is predicted based on the derived variables.
The method of claim 1 wherein said determining a data source associated with said keyword in the Internet and crawling said disease data associated with said keyword from said data source using a crawler program comprises: :

Determining a data source related to the keyword in the Internet, and classifying the data source according to the type of the data source;

Setting a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source;

The multi-threaded crawler program is used to respectively crawl disease data related to the keyword from the corresponding data source.
The method of claim 1 wherein the method further comprises:

Visualized display based on the calculated derivative variables, including: maximum, minimum, mean, variance, standard deviation, covariance, range, median, mode, quartile .
The method of claim 1 wherein said data normalization comprises one or a combination of the following:

Sum standardization, standard deviation standardization, maximum value standardization or range standardization.
The method of claim 1 wherein said crawling said disease data associated with said keyword from said data source using a crawler program comprises:

The disease data associated with the keyword is crawled from the data source during a preset crawler period using a crawler program.
The method of claim 1 wherein said analysing the disease data to obtain a disease comprises:

Calculating a sum of the number of all sub-sentiment factors of the disease, calculating a percentage of each sub-sentiment factor in the sum, the percentage being a weight of the corresponding sub-sentiment factor, determining a sub-sentiment factor having a weight greater than a preset weight threshold A sensation factor for the disease.
The method of claim 1 wherein said data cleaning and outlier processing of the sensation factor of said disease comprises:

Data cleaning of the sensation factors of the disease according to the type of sensation factor of the disease;

Deletion values for the disease factor of the disease are replaced according to the distribution of the estrous factors of the disease; or

Directly discard the sensation factors of abnormal diseases.
A public opinion data prediction apparatus, characterized in that the apparatus comprises:

a receiving module, configured to receive at least one keyword of a disease input by the user;

a crawling module, configured to determine a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;

An analysis module, configured to parse the disease data to obtain a disease factor of the disease;

a cleaning module for performing data cleaning and abnormal value processing on the grievance factor of the disease;

a standardized module for standardizing data on the grievance factors of diseases after data cleansing and outlier processing to obtain new disease data;

And a prediction module, configured to calculate a derivative variable of the sensation factor of the disease according to the new disease data, and predict the disease according to the derivative variable.
A terminal, comprising: a processor and a memory, wherein when the processor is configured to execute the computer readable instructions stored in the memory, the following steps are implemented:

Receiving at least one keyword of a disease input by a user;

Determining a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;

Parsing the disease data to obtain a disease factor of the disease;

Data cleaning and outlier processing of the disease factor of the disease;

Standardize data on the sensation factors of diseases after data cleansing and outlier processing to obtain new disease data;

Derived variables of the estrous factors of the disease are calculated based on the new disease data, and the disease is predicted based on the derived variables.
The terminal according to claim 9, wherein said determining a data source associated with said keyword in the Internet, and crawling from said data source using said crawler program for disease data related to said keyword comprises: :

Determining a data source related to the keyword in the Internet, and classifying the data source according to the type of the data source;

Setting a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source;

The multi-threaded crawler program is used to respectively crawl disease data related to the keyword from the corresponding data source.
The terminal according to claim 9, wherein the processor is further configured to: when the computer readable instructions are executed: implementing the following steps:

Visualized display based on the calculated derivative variables, including: maximum, minimum, mean, variance, standard deviation, covariance, range, median, mode, quartile .
The terminal according to claim 9, wherein the crawling the disease data related to the keyword from the data source by using a crawler program comprises:

The disease data associated with the keyword is crawled from the data source during a preset crawler period using a crawler program.
The terminal according to claim 9, wherein the parsing factor for parsing the disease data to obtain a disease comprises:

Calculating a sum of the number of all sub-sentiment factors of the disease, calculating a percentage of each sub-sentiment factor in the sum, the percentage being a weight of the corresponding sub-sentiment factor, determining a sub-sentiment factor having a weight greater than a preset weight threshold A sensation factor for the disease.
The terminal according to claim 9, wherein said data cleaning and outlier processing of the sensation factor of said disease comprises:

Data cleaning of the sensation factors of the disease according to the type of sensation factor of the disease;

Loss value substitution for the disease factor of the disease according to the distribution of the sensation factor of the disease; or

Directly discard the sensation factors of abnormal diseases.
A non-volatile readable storage medium having stored thereon computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the following steps:

Receiving at least one keyword of a disease input by a user;

Determining a data source related to the keyword in the Internet, and crawling the disease data related to the keyword from the data source by using a crawler program;

Parsing the disease data to obtain a disease factor of the disease;

Data cleaning and outlier processing of the disease factor of the disease;

Standardize data on the sensation factors of diseases after data cleansing and outlier processing to obtain new disease data;

Derived variables of the estrous factors of the disease are calculated based on the new disease data, and the disease is predicted based on the derived variables.
The storage medium according to claim 15, wherein said determining a data source associated with said keyword in the Internet, and crawling the disease data associated with said keyword from said data source using a crawler program include:

Determining a data source related to the keyword in the Internet, and classifying the data source according to the type of the data source;

Setting a multi-threaded crawler program having the same number of categories as the number of categories obtained by classifying the data source;

The multi-threaded crawler program is used to respectively crawl disease data related to the keyword from the corresponding data source.
The storage medium of claim 15 wherein said computer readable instructions are further executed by said processor to:

Visualized display based on the calculated derivative variables, including: maximum, minimum, mean, variance, standard deviation, covariance, range, median, mode, quartile .
The storage medium of claim 15, wherein the crawling the disease data associated with the keyword from the data source using a crawler program comprises:

The disease data associated with the keyword is crawled from the data source during a preset crawler period using a crawler program.
The storage medium according to claim 15, wherein said analysing factor for analyzing said disease data to obtain a disease comprises:

Calculating a sum of the number of all sub-sentiment factors of the disease, calculating a percentage of each sub-sentiment factor in the sum, the percentage being a weight of the corresponding sub-sentiment factor, determining a sub-sentiment factor having a weight greater than a preset weight threshold A sensation factor for the disease.
The storage medium according to claim 15, wherein said data cleaning and outlier processing of the sensation factor of said disease comprises:

Data cleaning of the sensation factors of the disease according to the type of sensation factor of the disease;

Loss value substitution for the disease factor of the disease according to the distribution of the sensation factor of the disease; or

Directly discard the sensation factors of abnormal diseases.