CN108647249A - Public sentiment data prediction technique, device, terminal and storage medium - Google Patents

Public sentiment data prediction technique, device, terminal and storage medium Download PDF

Info

Publication number
CN108647249A
CN108647249A CN201810351128.0A CN201810351128A CN108647249A CN 108647249 A CN108647249 A CN 108647249A CN 201810351128 A CN201810351128 A CN 201810351128A CN 108647249 A CN108647249 A CN 108647249A
Authority
CN
China
Prior art keywords
disease
data
public sentiment
factor
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810351128.0A
Other languages
Chinese (zh)
Other versions
CN108647249B (en
Inventor
阮晓雯
徐亮
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810351128.0A priority Critical patent/CN108647249B/en
Priority to PCT/CN2018/100229 priority patent/WO2019200786A1/en
Publication of CN108647249A publication Critical patent/CN108647249A/en
Application granted granted Critical
Publication of CN108647249B publication Critical patent/CN108647249B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A kind of public sentiment data prediction technique, including:Receive at least one keyword of disease input by user;It determines in internet with the relevant data source of the keyword, and is crawled and the relevant disease data of the keyword from the data source using crawlers;The disease data is parsed to obtain the public sentiment factor of disease;Data cleansing and outlier processing are carried out to the public sentiment factor of the disease;Data normalization is carried out to the public sentiment factor of the disease after data cleansing and outlier processing, obtains new disease data;And the derivative variable of the public sentiment factor of disease is calculated according to the new disease data, disease is predicted according to the derivative variable.The present invention also provides a kind of public sentiment data prediction meanss, terminal and storage mediums.The present invention can crawl more comprehensive disease data, and data preparation, depth analysis and calculating are carried out to disease data, achieved the purpose that be shown to policy-making data displaying from basic data, reference frame is provided for disease forecasting.

Description

Public sentiment data prediction technique, device, terminal and storage medium
Technical field
The present invention relates to technical field of data prediction, and in particular to a kind of public sentiment data prediction technique, device, terminal and deposits Storage media.
Background technology
With the fast development of internet, computer technology facilitates people’s lives in all trades and professions, in medical treatment Field is no exception.The interrogation of expert data and user that a large amount of disease is hidden on network records, but these data Inadequate system, sufficiently complete, when a kind of epidemic disease is broken out rapidly, can not often timely update site information, lead to information Typing lags, and user cannot understand up-to-date information in time, prevents in time, prevents trouble before it happens.
The public sentiment data about disease is crawled using web crawlers technology at present, but crawling method is relatively simple, uses The method of simple reptile.Secondly, the data crawled are not carried out effectively, timely examining.In addition, for difference point The data of cloth, by the way of identical data cleansing, filling, data process effects are poor.
Invention content
In view of the foregoing, it is necessary to propose a kind of public sentiment data prediction technique, device, terminal and storage medium, it can The disease data in different data sources is crawled, and using different data inspections, cleaning and outlier processing method.
The first aspect of the present invention provides a kind of public sentiment data prediction technique, the method includes:
Receive at least one keyword of disease input by user;
It determines in internet with the relevant data source of the keyword, and is crawled from the data source using crawlers With the relevant disease data of the keyword;
The disease data is parsed to obtain the public sentiment factor of disease;
Data cleansing and outlier processing are carried out to the public sentiment factor of the disease;
Data normalization is carried out to the public sentiment factor of the disease after data cleansing and outlier processing, is obtained new Disease data;And
The derivative variable that the public sentiment factor of disease is calculated according to the new disease data, according to the derivative variable to disease Disease is predicted.
According to a preferred embodiment of the present invention, in the determining internet with the relevant data source of the keyword, And it is crawled from the data source using crawlers and includes with the relevant disease data of the keyword:
Determine in internet with the relevant data source of the keyword, according to the type of the data source to the data source Classify;
According to pair classification number classified with the data source, multithreading identical with the classification number is set and is climbed Worm program;
It is crawled from the corresponding data source respectively using the multithreading crawlers relevant with the keyword Disease data.
According to a preferred embodiment of the present invention, the method further includes:
Chart is fabricated to according to calculated derivative variable to be visualized, the derivative variable includes:Maximum value, Minimum value, average, variance, standard deviation, covariance, very poor, median, mode, quartile.
According to a preferred embodiment of the present invention, the data normalization includes the combination of following one or more:
Summation standardization, standard deviation standardization, maximum standardization or very poor standardization.
According to a preferred embodiment of the present invention, described to be crawled from the data source and the pass using crawlers The relevant disease data of keyword includes:
It is crawled and the relevant disease of the keyword from the data source within the default reptile period using crawlers Sick data.
According to a preferred embodiment of the present invention, the public sentiment for being parsed to obtain disease to the disease data because Attached bag includes:
The quantity summation of all sub- public sentiment factors of the disease is calculated, each sub- public sentiment factor is calculated and accounts for the summation Percentage, the percentage is the weight of the corresponding sub- public sentiment factor, by weight be more than the sub- public sentiment of default weight threshold because Son is determined as the public sentiment factor of disease.
According to a preferred embodiment of the present invention, the public sentiment factor to the disease carries out data cleansing and exception Value is handled:
Data cleansing is carried out to the public sentiment factor of the disease according to the type of the public sentiment factor of the disease;
Missing values replacement is carried out to the public sentiment factor of the disease according to the distribution of the public sentiment factor of the disease;Or
Directly abandon the public sentiment factor for having abnormal disease.
The second aspect of the present invention provides a kind of public sentiment data prediction meanss, and described device includes:
Receiving module, at least one keyword for receiving disease input by user;
Crawl module, for determine in internet with the relevant data source of the keyword, and using crawlers from institute It states and is crawled in data source and the relevant disease data of the keyword;
Parsing module obtains the public sentiment factor of disease for being parsed to the disease data;
Cleaning module carries out data cleansing and outlier processing for the public sentiment factor to the disease;
Standardized module carries out data for the public sentiment factor to the disease after data cleansing and outlier processing Standardization, obtains new disease data;And
Prediction module, the derivative variable of the public sentiment factor for calculating disease according to the new disease data, according to institute Derivative variable is stated to predict disease.
The third aspect of the present invention provides a kind of terminal, and the terminal includes processor and memory, and the processor is used The public sentiment data prediction technique is realized when executing the computer program stored in the memory.
The fourth aspect of the present invention provides a kind of computer readable storage medium, is stored thereon with computer program, described The public sentiment data prediction technique is realized when computer program is executed by processor.
Public sentiment data prediction technique, device, terminal and storage medium of the present invention, by the way that different reptile journeys is arranged Ordered pair answers different classes of data source, and the key crawled from corresponding data source with input is crawled using multithreading crawlers The relevant disease data of word, parallel to crawl mode and accelerate the efficiency crawled, the data lattice of the disease data crawled Formula is more unified, and storage format or other problems due to the data of different data sources can be avoided to cause to crawl difficulty Or the generation for the problem of data after crawling can not being parsed;To the public sentiment factor of the disease carry out data preparation, Depth analysis and calculating, it is this process of refinement is carried out to the disease data that crawls after, be fabricated to figure or table class, tie Fruit displaying is more clear, and is convenient for intuitive problem analysis.In addition, deriving multiple variables according to the public sentiment factor of disease, increase Data target, reference frame is provided for disease forecasting so that the prediction of disease will no longer blindly, by rule of thumb, prediction result It is more accurate.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is the flow chart for the public sentiment data prediction technique that the embodiment of the present invention one provides.
Fig. 2 is the flow chart of public sentiment data prediction technique provided by Embodiment 2 of the present invention.
Fig. 3 is the structure chart for the public sentiment data prediction meanss that the embodiment of the present invention three provides.
Fig. 4 is the structure chart for the public sentiment data prediction meanss that the embodiment of the present invention four provides.
Fig. 5 is the structure chart for the terminal that the embodiment of the present invention five provides.
Following specific implementation mode will be further illustrated the present invention in conjunction with above-mentioned attached drawing.
Specific implementation mode
To better understand the objects, features and advantages of the present invention, below in conjunction with the accompanying drawings and specific real Applying example, the present invention will be described in detail.It should be noted that in the absence of conflict, the embodiment of the present invention and embodiment In feature can be combined with each other.
Elaborate many details in the following description to facilitate a thorough understanding of the present invention, described embodiment only It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill The every other embodiment that personnel are obtained without making creative work, shall fall within the protection scope of the present invention.
Unless otherwise defined, all of technologies and scientific terms used here by the article and belong to the technical field of the present invention The normally understood meaning of technical staff is identical.Used term is intended merely to description tool in the description of the invention herein The purpose of the embodiment of body, it is not intended that in the limitation present invention.
The public sentiment data prediction technique of the embodiment of the present invention is applied in one or more terminal.The public sentiment data is pre- Survey method can also be applied to the hardware environment being made of terminal and the server being attached by network and the terminal In.Network includes but not limited to:Wide area network, Metropolitan Area Network (MAN) or LAN.The public sentiment data prediction technique of the embodiment of the present invention can be with It is executed, can also be executed by terminal by server;It can also be and executed jointly by server and terminal.
The terminal for needing progress public sentiment data prediction technique can integrate the side of the present invention directly in terminal The public sentiment data forecast function that method is provided, or installation is for realizing the client of the method for the present invention.For another example, institute of the present invention The method of offer can also be transported in the form of Software Development Kit (step oftware Development Kit, step DK) Row provides the interface of public sentiment data forecast function in the equipment such as server in the form of step DK, and terminal or other equipment are logical Cross the prediction that public sentiment data can be realized in the interface provided.
Embodiment one
Fig. 1 is the flow chart for the public sentiment data prediction technique that the embodiment of the present invention one provides.The stream according to different requirements, Execution sequence in journey figure can change, and certain steps can be omitted.
Step 11, at least one keyword for receiving disease input by user.
The keyword be with the relevant word of the symptom of disease, for example, when disease is flu, the keyword can be with Including:Sneezing, rhinorrhea, nasal obstruction, headache and dizzy, cough but no phlegm, sore throat etc..For another example, described when disease is brothers mouthful Keyword may include:Stomatalgia, apocleisis, low-heat, hand exanthema vesiculosum, oral area aphtha etc..
For the ease of subsequently crawling more with the relevant data of disease, user can be with multiple keywords of imported disease. The keyword can be the symptom for the disease that user obtains according to experience, can also be to collect to obtain from disease expert Disease symptom.
In the present embodiment, terminal pre-sets the function of the keyword for user's imported disease, for example, the terminal provides One Text Entry, user can input at least one keyword by the Text Entry.Alternatively, the terminal provides voice The function of assistant, user can input at least one keyword by the voice assistant.
Step 12, determine in internet with the relevant data source of the keyword, and using crawlers from the data It is crawled in source and the relevant disease data of the keyword.
It may include with the relevant data source of the keyword in internet, but be not limited to:It is Baidu, Google, Tencent, micro- It wins, heat is searched, know and any support user searches for the website etc. accessed.It is crawled from various data sources using crawlers and institute Stating the relevant disease data of keyword may include:Baidu's index, Google Trends, Tencent's analysis, Domestic News, ad data, Channel data, microblogging temperature, forum's public feelings information etc..
In the present embodiment, user determines URL(Uniform Resource Locator) (the Uniform Resource of the data source in internet Locator, URL), the crawlers crawl and the relevant disease data of the keyword according to URL.
Step 13 parses the disease data to obtain the public sentiment factor of disease.
Disease data is carried out include the analysis of public opinion concrete analysis work, including text-processing, text analyzing, word The processing such as frequency statistics, correlation analysis, to obtain the public sentiment factor of disease.
In the present embodiment, the public sentiment factor of the disease may include multiple sub- public sentiment factors, for example, the first sub- public sentiment because Son, the second sub- public sentiment factor, the sub- public sentiment factor of third, the 4th public sentiment factor etc..
For example, the described first sub- public sentiment factor can be headache, and the second sub- public sentiment factor can be had a running nose, The sub- public sentiment factor of third can be fever, the 4th sub- public sentiment factor can be cough.
Step 14 carries out data cleansing and outlier processing to the public sentiment factor of the disease.
Data cleansing and outlier processing are carried out to the public sentiment factor of the disease, are to eliminate the public sentiment of the disease Redundant data in the factor obtains the disease data of consistent reference format so that after cleaning and outlier processing The public sentiment factor of disease is available and is more suitable for carrying out subsequent analysis work.
In the present embodiment, the public sentiment factor to the disease carries out data cleansing and includes:According to the carriage of the disease The type of the feelings factor carries out data cleansing to the public sentiment factor of the disease.
The type of the public sentiment factor of the disease includes, but are not limited to:The public sentiment factor of noise-containing disease is not met The public sentiment factor of the disease of convention, the public sentiment factor of the disease containing duplicate message, the public sentiment factor of the disease of data nonbalance, The public sentiment factor of inconsistent disease, public sentiment factor of incomplete disease etc..
Data are carried out using the method for removing especially big value and negative value point for the public sentiment factor of the noise-containing disease Cleaning;Data cleansing is carried out using the method for removal exceptional value for the public sentiment factor of the disease for not meeting convention;For The public sentiment factor of the disease containing duplicate message carries out data cleansing using the method for deleting duplicate keys;For the injustice The public sentiment factor of the disease of weighing apparatus carries out data cleansing using the method for data de-noising;For the inconsistent disease public sentiment because Son carries out data cleansing using the method sorted out by data type;For the public sentiment factor of the incomplete disease, using true The method of vertical relevant criterion reference point carries out data cleansing.
In the present embodiment, the public sentiment factor to the disease carries out outlier processing and includes:According to the disease The distribution of the public sentiment factor carries out missing values replacement to the public sentiment factor of the disease.
In the present embodiment, the distribution of the public sentiment factor of the disease includes, but are not limited to:Stable type and violent type.It is described Stable type distribution disease the public sentiment factor refer to the disease the public sentiment factor variation tendency it is more steady, for example, 50, 53,52,49,51 etc..The public sentiment factor of the disease of the violent type distribution refers to the variation tendency of the public sentiment factor of the disease Compare sharply, amplitude of variation is larger, for example, 50,100,43,89,4 etc..
For the public sentiment factor of the disease of stable type distribution, K- minimum distance neighbours' methods may be used, according to European Distance or correlation analysis come determine distance have missing disease public sentiment because of K nearest sample of subsample, by this K disease Public sentiment factor values weighted average estimate the missing data of the sample;For stable type distribution the disease public sentiment because Son, can also using prediction model come predict each missing disease the public sentiment factor, if missing disease public sentiment because Son is numeric type, average value may be used fill the missing disease the public sentiment factor, if missing disease public sentiment The factor is non-numeric type, mode may be used fill the missing disease the public sentiment factor.
For the public sentiment factor of the disease of violent type distribution, the carriage that averaging method substitutes lacked disease may be used The feelings factor.
Preferably due to which the method for substituting the public sentiment factor of lacked disease using averaging method is built upon completely random On the hypothesis of missing, the variance of the public sentiment factor of disease and standard deviation can be caused to become smaller, thus, the method can also wrap It includes:The public sentiment factor of the disease obtained after being substituted by mean value carries out quadrature with default sampling factor, obtains new disease The public sentiment factor of the public sentiment factor as final disease.
The default sampling factor is pre-set sampling factor, and the sampling factor is more than 1.
In other embodiments, the public sentiment factor progress outlier processing to the disease further includes:Directly abandon There is the public sentiment factor of abnormal disease.Directly the public sentiment factor for having abnormal disease is abandoned, it is ensured that crawl to obtain Disease the public sentiment factor it is clean, avoid causing interference when analyzing the public sentiment factor of the disease.
Step 15 carries out data normalization to the public sentiment factor of the disease after data cleansing and outlier processing, Obtain new disease data.
Data normalization is carried out to the public sentiment factor of the disease after process data cleansing and outlier processing, is to incite somebody to action The factor converting public sentiment of the disease is nondimensional pure values, can be compared convenient for the index of not commensurate or magnitude and Weighting.
In the present embodiment, the method for the data normalization includes, but are not limited to:Summation standardizes, standard deviation standardizes, Maximum standardization, very poor standardization etc..Preferably very poor standardization, the obtained new data after very poor standardization Maximum be 1, minimum 0, remaining each numerical value is between 0 and 1.
Step 16, calculated according to the new disease data disease the public sentiment factor derivative variable, according to the derivative Variable predicts disease.
In the present embodiment, the derivative variable includes:Maximum value, minimum value, average, variance, standard deviation, covariance, Very poor (maximum value-minimum value), median, mode, quartile.Wherein, the average, median, mode, quartile The intensity of the public sentiment factor of disease is described, the intensity of the public sentiment factor of disease is bigger, shows the disease predicted Disease is more serious;Very poor, variance, standard deviation feature the dispersion degree of the public sentiment factor of disease, the public sentiment factor of disease it is discrete Degree is smaller, shows that the disease predicted is more serious.
The public sentiment data prediction technique determines interconnection by receiving at least one keyword of disease input by user With the relevant data source of the keyword in net, and crawled from the data source using crawlers related to the keyword Disease data, the disease data is parsed to obtain the public sentiment factor of disease, then to the public sentiment factor of the disease Data cleansing and outlier processing are carried out, to the public sentiment factor of the disease after data cleansing and outlier processing into line number According to standardization, new disease data is obtained, the derivative variable of the public sentiment factor of disease is calculated according to the new disease data, from And disease is predicted according to the derivative variable.By the rough input of user and the relevant keyword of disease, using climbing Worm program crawls the relevant disease data of keyword with input, has obtained the public sentiment with the relevant more comprehensive disease of the disease The factor;Data preparation, depth analysis and calculating, this disease data to crawling are carried out to the public sentiment factor of the disease The purpose that policy-making data displaying is shown to from basic data can be obtained by carrying out process of refinement, and ginseng is provided for disease forecasting Foundation is examined, prediction result is accurate.
Embodiment two
Fig. 2 is the flow chart of public sentiment data prediction technique provided by Embodiment 2 of the present invention.The stream according to different requirements, Execution sequence in journey figure can change, and certain steps can be omitted.
Step 21, at least one keyword for receiving disease input by user.
Step 21 in the present embodiment is with the step 11 in embodiment one, this is no longer described in detail again herein.
Step 22, determine in internet with the relevant data source of the keyword, according to the type of the data source to institute Data source is stated to classify.
In the present embodiment, it can will be divided into two major classes with the relevant data source of the keyword according to the type of data source, The first kind is exponential type data source, and the second class is public sentiment amount data source.The exponential type data source includes, but are not limited to:Baidu, Google, 360 etc..The public sentiment amount data source includes, but are not limited to:Microblogging, forum, wechat, heat are searched.
Step 23, the basis pair classification number classified with the data source, setting are identical with the classification number Multithreading crawlers.
Different crawlers are set and correspond to different classes of data source, it can be in order to more smoothly crawling the category The data of data source can cause to crawl difficulty to avoid the storage format or other problems of the data due to different data sources Or the data after crawling can not be parsed.
In the present embodiment, if the data source is divided into two classes, corresponding setting dual-thread crawlers.For example, Baidu It is two different types of data sources with microblogging, there is respective text storage format, then the first crawlers is set and are exclusively used in Crawl in Baidu with the relevant disease data of the keyword, the second crawlers be exclusively used in crawling in microblogging with the pass The relevant disease data of keyword.
In other embodiments, can also according to actual needs, by internet with the relevant data source of the keyword Multiple classifications are subdivided into, and corresponding crawlers are arranged in the data source of respectively each classification.
Step 24 is crawled and the keyword from the corresponding data source respectively using the multithreading crawlers Relevant disease data.
In the present embodiment, the URL of the data source of corresponding crawlers is put into and is crawled in queue, the multithreading reptile journey Sequence is concurrently crawled from the data source and the relevant disease data of the keyword.
Step 25 parses the disease data to obtain the public sentiment factor of disease.
Step 26 carries out data cleansing and outlier processing to the public sentiment factor of the disease.
Step 27 carries out data normalization to the public sentiment factor of the disease after data cleansing and outlier processing, Obtain new disease data.
The step 13-15 in step 25-27 difference corresponding embodiments one in the present embodiment, it is no longer superfluous in detail herein It states.
Step 28, calculated according to the new disease data disease the public sentiment factor derivative variable, according to calculated Derivative variable is fabricated to chart and is visualized.
Preferably, the step 24 can also include:Classification storage is carried out to the disease data crawled.
The disease data storage is either stored in storage server or is stored in high in the clouds in the local database. It, will be from the meagre disease crawled for example, the disease data crawled from Baidu is stored in the first storage location in local data base Data are stored in the second storage location in local data base.First storage location and second storage location can be same When under the same root in the local data, can also be located at different roots under.First storage location It can also be differently shown with different titles with second storage location.To the number crawled from different data sources According to classification storage is carried out, analyzed convenient for the data to same data source.
Preferably, the disease data crawled for guarantee is newest, needs periodically to be updated disease data, described Method can also include:It is crawled and the keyword phase from the data source within the default reptile period using crawlers The disease data of pass.
The default reptile period is the pre-set reptile period, for example, pre-setting the reptile period in daily evening On 24 points to 3 points, because at this time generally access data source server people it is fewer, will not be caused to the server of data source Prodigious access pressure is conducive to the even running of the server of data source, and can improve and crawl efficiency.
Preferably, it is crawled and the keyword from the data source within the default reptile period using crawlers Relevant disease data parses after obtaining the public sentiment factor of disease the disease data, and the method can also wrap It includes:The sub- public sentiment factor of each disease is quantified respectively, obtains the weight of the sub- public sentiment factor of disease, weight is big It is determined as the public sentiment factor of disease in the sub- public sentiment factor of default weight threshold.
The sub- public sentiment factor to each disease quantifies, and obtains the tool of the weight of the sub- public sentiment factor of disease Body process is:The quantity summation of all sub- public sentiment factors of the disease is calculated, each sub- public sentiment factor of calculating accounts for described total The percentage of sum, the percentage are the weight of the corresponding sub- public sentiment factor.
Default weight threshold is pre-set weight threshold, and the weight of the group public sentiment factor is more than the default weight threshold When value, which is determined as to the public sentiment factor of disease, can effectively screen out the smaller sub- public sentiment factor of weight, Data calculation amount can be reduced, effectively shorten the disease forecasting time, and the smaller sub- public sentiment factor of weight will not be to disease forecasting Result have any impact.
In conclusion the public sentiment data prediction technique, by receiving at least one keyword of disease input by user, It determines in internet with the relevant data source of the keyword, the data source is divided according to the type of the data source Multithreading reptile journey identical with the classification number is arranged according to pair classification number classified with the data source in class Sequence is crawled from the corresponding data source and the relevant disease number of the keyword respectively using the multithreading crawlers According to then to the progress data cleansing of the public sentiment factor of the disease and outlier processing, at by data cleansing and exceptional value The public sentiment factor of disease after reason carries out data normalization, new disease data is obtained, according to the new disease data meter The derivative variable for calculating the public sentiment factor of disease, is fabricated to chart according to calculated derivative variable and is visualized, to Disease is predicted.Different classes of data source is corresponded to by the crawlers for being arranged different, utilizes multithreading crawlers Crawl the relevant disease data of keyword crawled from corresponding data source with input, it is parallel to crawl mode and accelerate to climb The data format of the efficiency taken, the disease data crawled is more unified, and can avoid the number due to different data sources According to storage format either other problems cause to crawl difficulty or can not parse the data after crawling the problem of hair It is raw;Data preparation, depth analysis and calculating carried out to the public sentiment factor of the disease, it is this to the disease data that crawls into After row process of refinement, it is fabricated to figure or table class, as a result displaying is more clear, and is convenient for intuitive problem analysis, is disease Prediction provides reference frame, and prediction result is accurate.
The above is only the specific implementation mode of the present invention, but scope of protection of the present invention is not limited thereto, for For those skilled in the art, without departing from the concept of the premise of the invention, improvement, but these can also be made It all belongs to the scope of protection of the present invention.
With reference to the 3rd to 5 figure, respectively to the function module and hardware of the terminal of the above-mentioned public sentiment data prediction technique of realization Structure is introduced.
Embodiment three
Fig. 3 is the functional block diagram for the public sentiment data prediction meanss that the embodiment of the present invention three provides.
In some embodiments, the public sentiment data prediction meanss 30 are run in terminal.The public sentiment data prediction dress It may include multiple function modules being made of program code segments to set 30.Each journey in the public sentiment data prediction meanss 30 The program code of sequence section can be stored in memory, and performed by least one processor, with execution (refer to Fig. 1 and its Associated description) prediction to public sentiment data.
In the present embodiment, the function of the public sentiment data prediction meanss 30 of the terminal performed by it can be divided For multiple function modules.The function module may include:Receiving module 301 crawls module 302, parsing module 303, cleaning Module 304, extension module 305, standardized module 306 and prediction module 307.The so-called module of the present invention refers to that one kind can be by At least one processor is performed and can complete the series of computation machine program segment of fixed function, is stored in the storage In device.In some embodiments, it will be described in detail in subsequent embodiment about the function of each module.
Receiving module 301, at least one keyword for receiving disease input by user.
The keyword be with the relevant word of the symptom of disease, for example, when disease is flu, the keyword can be with Including:Sneezing, rhinorrhea, nasal obstruction, headache and dizzy, cough but no phlegm, sore throat etc..For another example, described when disease is brothers mouthful Keyword may include:Stomatalgia, apocleisis, low-heat, hand exanthema vesiculosum, oral area aphtha etc..
For the ease of subsequently crawling more with the relevant data of disease, user can be with multiple keywords of imported disease. The keyword can be the symptom for the disease that user obtains according to experience, can also be to collect to obtain from disease expert Disease symptom.
In the present embodiment, terminal pre-sets the function of the keyword for user's imported disease, for example, the terminal provides One Text Entry, user can input at least one keyword by the Text Entry.Alternatively, the terminal provides voice The function of assistant, user can input at least one keyword by the voice assistant.
Crawl module 302, for determine in internet with the relevant data source of the keyword, and using crawlers from It is crawled in the data source and the relevant disease data of the keyword.
It may include with the relevant data source of the keyword in internet, but be not limited to:It is Baidu, Google, Tencent, micro- It wins, heat is searched, know and any support user searches for the website etc. accessed.It is crawled from various data sources using crawlers and institute Stating the relevant disease data of keyword may include:Baidu's index, Google Trends, Tencent's analysis, Domestic News, ad data, Channel data, microblogging temperature, forum's public feelings information etc..
In the present embodiment, user determines URL(Uniform Resource Locator) (the Uniform Resource of the data source in internet Locator, URL), the crawlers crawl and the relevant disease data of the keyword according to URL.
Parsing module 303 obtains the public sentiment factor of disease for being parsed to the disease data.
Disease data is carried out include the analysis of public opinion concrete analysis work, including text-processing, text analyzing, word The processing such as frequency statistics, correlation analysis, to obtain the public sentiment factor of disease.
In the present embodiment, the public sentiment factor of the disease may include multiple sub- public sentiment factors, for example, the first sub- public sentiment because Son, the second sub- public sentiment factor, the sub- public sentiment factor of third, the 4th public sentiment factor etc..
For example, the described first sub- public sentiment factor can be headache, and the second sub- public sentiment factor can be had a running nose, The sub- public sentiment factor of third can be fever, the 4th sub- public sentiment factor can be cough.
Cleaning module 304 carries out data cleansing and outlier processing for the public sentiment factor to the disease.
Data cleansing and outlier processing are carried out to the public sentiment factor of the disease, are to eliminate the public sentiment of the disease Redundant data in the factor obtains the disease data of consistent reference format so that after cleaning and outlier processing The public sentiment factor of disease is available and is more suitable for carrying out subsequent analysis work.
The cleaning module 304, be additionally operable to according to the type of the public sentiment factor of the disease to the public sentiment of the disease because Son carries out data cleansing.
The type of the public sentiment factor of the disease includes, but are not limited to:The public sentiment factor of noise-containing disease is not met The public sentiment factor of the disease of convention, the public sentiment factor of the disease containing duplicate message, the public sentiment factor of the disease of data nonbalance, The public sentiment factor of inconsistent disease, public sentiment factor of incomplete disease etc..
Data are carried out using the method for removing especially big value and negative value point for the public sentiment factor of the noise-containing disease Cleaning;Data cleansing is carried out using the method for removal exceptional value for the public sentiment factor of the disease for not meeting convention;For The public sentiment factor of the disease containing duplicate message carries out data cleansing using the method for deleting duplicate keys;For the injustice The public sentiment factor of the disease of weighing apparatus carries out data cleansing using the method for data de-noising;For the inconsistent disease public sentiment because Son carries out data cleansing using the method sorted out by data type;For the public sentiment factor of the incomplete disease, using true The method of vertical relevant criterion reference point carries out data cleansing.
The cleaning module 304, be additionally operable to according to the distribution of the public sentiment factor of the disease to the public sentiment of the disease because Son carries out missing values replacement.
In the present embodiment, the distribution of the public sentiment factor of the disease includes, but are not limited to:Stable type and violent type.It is described Stable type distribution disease the public sentiment factor refer to the disease the public sentiment factor variation tendency it is more steady, for example, 50, 53,52,49,51 etc..The public sentiment factor of the disease of the violent type distribution refers to the variation tendency of the public sentiment factor of the disease Compare sharply, amplitude of variation is larger, for example, 50,100,43,89,4 etc..
For the public sentiment factor of the disease of stable type distribution, K- minimum distance neighbours' methods may be used, according to European Distance or correlation analysis come determine distance have missing disease public sentiment because of K nearest sample of subsample, by this K disease Public sentiment factor values weighted average estimate the missing data of the sample;For stable type distribution the disease public sentiment because Son, can also using prediction model come predict each missing disease the public sentiment factor, if missing disease public sentiment because Son is numeric type, average value may be used fill the missing disease the public sentiment factor, if missing disease public sentiment The factor is non-numeric type, mode may be used fill the missing disease the public sentiment factor.
For the public sentiment factor of the disease of violent type distribution, the carriage that averaging method substitutes lacked disease may be used The feelings factor.
The cleaning module 304 is additionally operable to directly abandon the public sentiment factor for having abnormal disease.To directly there be abnormal disease Disease the public sentiment factor abandoned, it is ensured that the public sentiment factor of the disease crawled it is clean, avoid to the disease The public sentiment factor causes interference when being analyzed.
Extension module 305, the public sentiment factor of the disease for being obtained after being substituted by mean value and default sampling factor into Row quadrature obtains the public sentiment factor of the public sentiment factor as final disease of new disease.By being lacked using averaging method replacement The method of the public sentiment factor of the disease of mistake is built upon on the hypothesis of completely random missing, can cause the public sentiment factor of disease Variance and standard deviation become smaller.The default sampling factor is pre-set sampling factor, and the sampling factor is more than 1.
Standardized module 306 is carried out for the public sentiment factor to the disease after data cleansing and outlier processing Data normalization obtains new disease data.
Data normalization is carried out to the public sentiment factor of the disease after process data cleansing and outlier processing, is to incite somebody to action The factor converting public sentiment of the disease is nondimensional pure values, can be compared convenient for the index of not commensurate or magnitude and Weighting.
In the present embodiment, the method for the data normalization includes, but are not limited to:Summation standardizes, standard deviation standardizes, Maximum standardization, very poor standardization etc..Preferably very poor standardization, the obtained new data after very poor standardization Maximum be 1, minimum 0, remaining each numerical value is between 0 and 1.
Prediction module 307, the derivative variable of the public sentiment factor for calculating disease according to the new disease data, according to The derivative variable predicts disease.
In the present embodiment, the derivative variable includes:Maximum value, minimum value, average, variance, standard deviation, covariance, Very poor (maximum value-minimum value), median, mode, quartile.Wherein, the average, median, mode, quartile The intensity of the public sentiment factor of disease is described, the intensity of the public sentiment factor of disease is bigger, shows the disease predicted Disease is more serious;Very poor, variance, standard deviation feature the dispersion degree of the public sentiment factor of disease, the public sentiment factor of disease it is discrete Degree is smaller, shows that the disease predicted is more serious.
The public sentiment data prediction meanss 30 receive at least one pass of disease input by user by receiving module 301 Keyword, crawl module 302 determine in internet with the relevant data source of the keyword, and using crawlers from the data It is crawled in source and the disease data is parsed to obtain disease with the relevant disease data of the keyword, parsing module 303 The public sentiment factor, then cleaning module 304 data cleansing and outlier processing, standardization are carried out to the public sentiment factor of the disease Module 306 carries out data normalization to the public sentiment factor of the disease after data cleansing and outlier processing, obtains new Disease data, prediction module 307 calculates the derivative variable of the public sentiment factor of disease according to the new disease data, to basis The derivative variable predicts disease.By the rough input of user and the relevant keyword of disease, crawlers are utilized The relevant disease data of keyword with input is crawled, the public sentiment factor with the relevant more comprehensive disease of the disease has been obtained; Data preparation, depth analysis and calculating are carried out to the public sentiment factor of the disease, it is this that the disease data crawled is carried out Process of refinement can obtain from basic data and be shown to the purpose that policy-making data are shown, for disease forecasting provide reference according to According to prediction result is accurate.
Example IV
Fig. 4 is the functional block diagram for the public sentiment data prediction meanss that the embodiment of the present invention four provides.
In some embodiments, the public sentiment data prediction meanss 40 are run in terminal.The public sentiment data prediction dress It may include multiple function modules being made of program code segments to set 40.Each journey in the public sentiment data prediction meanss 40 The program code of sequence section can be stored in memory, and performed by least one processor, with execution (refer to Fig. 2 and its Associated description) prediction to public sentiment data.
In the present embodiment, the function of the public sentiment data prediction meanss 40 of the terminal performed by it can be divided For multiple function modules.The function module may include:Receiving module 401, setup module 403, crawls sort module 402 Module 404, parsing module 405, cleaning module 406, standardized module 407, visualization model 408, memory module 409 and quantization Module 410.The so-called module of the present invention, which refers to one kind, performed by least one processor and capable of completing fixed work( The series of computation machine program segment of energy, is stored in the memory.It in some embodiments, will about the function of each module It is described in detail in subsequent embodiment.
Receiving module 401, at least one keyword for receiving disease input by user.
Sort module 402, for determine in internet with the relevant data source of the keyword, according to the data source Type classifies to the data source.
In the present embodiment, it can will be divided into two major classes with the relevant data source of the keyword according to the type of data source, The first kind is exponential type data source, and the second class is public sentiment amount data source.The exponential type data source includes, but are not limited to:Baidu, Google, 360 etc..The public sentiment amount data source includes, but are not limited to:Microblogging, forum, wechat, heat are searched.
Setup module 403, for according to pair classification number classified with the data source, setting and the classification The identical multithreading crawlers of number.
Different crawlers are set and correspond to different classes of data source, it can be in order to more smoothly crawling the category The data of data source can cause to crawl difficulty to avoid the storage format or other problems of the data due to different data sources Or the data after crawling can not be parsed.
In the present embodiment, if the data source is divided into two classes, corresponding setting dual-thread crawlers.For example, Baidu It is two different types of data sources with microblogging, there is respective text storage format, then the first crawlers is set and are exclusively used in Crawl in Baidu with the relevant disease data of the keyword, the second crawlers be exclusively used in crawling in microblogging with the pass The relevant disease data of keyword.
In other embodiments, can also according to actual needs, by internet with the relevant data source of the keyword Multiple classifications are subdivided into, and corresponding crawlers are arranged in the data source of respectively each classification.
Crawl module 404, for crawled from the corresponding data source respectively using the multithreading crawlers with The relevant disease data of keyword.
In the present embodiment, the URL of the data source of corresponding crawlers is put into and is crawled in queue, the multithreading reptile journey Sequence is concurrently crawled from the data source and the relevant disease data of the keyword.
Parsing module 405 obtains the public sentiment factor of disease for being parsed to the disease data.
Cleaning module 406 carries out data cleansing and outlier processing for the public sentiment factor to the disease.
Standardized module 407 is carried out for the public sentiment factor to the disease after data cleansing and outlier processing Data normalization obtains new disease data.
Visualization model 408, the derivative variable of the public sentiment factor for calculating disease according to the new disease data, root Chart is fabricated to according to calculated derivative variable to be visualized.
Memory module 409, for carrying out classification storage to the disease data crawled.
The disease data storage is either stored in storage server or is stored in high in the clouds in the local database. It, will be from the meagre disease crawled for example, the disease data crawled from Baidu is stored in the first storage location in local data base Data are stored in the second storage location in local data base.First storage location and second storage location can be same When under the same root in the local data, can also be located at different roots under.First storage location It can also be differently shown with different titles with second storage location.To the number crawled from different data sources According to classification storage is carried out, analyzed convenient for the data to same data source.
Preferably, the disease data crawled for guarantee is newest, needs periodically to be updated disease data, described Module 404 is crawled, is additionally operable to crawl and the key from the data source within the default reptile period using crawlers The relevant disease data of word.
The default reptile period is the pre-set reptile period, for example, pre-setting the reptile period in daily evening On 24 points to 3 points, because at this time generally access data source server people it is fewer, will not be caused to the server of data source Prodigious access pressure is conducive to the even running of the server of data source, and can improve and crawl efficiency.
Preferably, it is crawled and the keyword from the data source within the default reptile period using crawlers Relevant disease data parses after obtaining the public sentiment factor of disease the disease data, the public sentiment data prediction Device 40 can also obtain disease including quantization modules 410 for quantifying respectively to the sub- public sentiment factor of each disease The weight of the sub- public sentiment factor of disease, the sub- public sentiment factor that weight is more than to default weight threshold are determined as the public sentiment factor of disease.
The sub- public sentiment factor to each disease quantifies, and obtains the tool of the weight of the sub- public sentiment factor of disease Body process is:The quantity summation of all sub- public sentiment factors of the disease is calculated, each sub- public sentiment factor of calculating accounts for described total The percentage of sum, the percentage are the weight of the corresponding sub- public sentiment factor.
Default weight threshold is pre-set weight threshold, and the weight of the group public sentiment factor is more than the default weight threshold When value, which is determined as to the public sentiment factor of disease, can effectively screen out the smaller sub- public sentiment factor of weight, Data calculation amount can be reduced, effectively shorten the disease forecasting time, and the smaller sub- public sentiment factor of weight will not be to disease forecasting Result have any impact.
In conclusion the public sentiment data prediction meanss 40, disease input by user is received extremely by receiving module 401 A few keyword, sort module 402 determine in internet with the relevant data source of the keyword, according to the data source Type classifies to the data source, setup module 403 according to pair classification number classified with the data source, if Multithreading crawlers identical with the classification number are set, crawl module 404 using the multithreading crawlers respectively from right Crawled in the data source answered with the relevant disease data of the keyword, then parsing module 405 is to the disease data It is parsed to obtain the public sentiment factor of disease, cleaning module 406 carries out data cleansing and exception to the public sentiment factor of the disease Value processing, standardized module 407 carry out data mark to the public sentiment factor of the disease after data cleansing and outlier processing Standardization, obtains new disease data, and visualization model 408 calculates spreading out for the public sentiment factor of disease according to the new disease data The amount of changing is fabricated to chart according to calculated derivative variable and is visualized, to predict disease.By setting It sets different crawlers and corresponds to different classes of data source, crawled using multithreading crawlers and climbed from corresponding data source Take the relevant disease data of keyword with input, parallel to crawl mode and accelerate the efficiency crawled, the disease crawled The data format of sick data is more unified, and can avoid due to the storage format of the data of different data sources or other ask The generation for the problem of topic causes to crawl difficulty or can not parse the data after crawling;To the public sentiment factor of the disease Carry out data preparation, depth analysis and calculating, it is this process of refinement is carried out to the disease data crawled after, be fabricated to figure Shape or table class, as a result displaying are more clear, are convenient for intuitive problem analysis, reference frame is provided for disease forecasting, predict As a result accurate.It is above-mentioned in the form of software function module realize integrated unit, can be stored in one it is computer-readable In storage medium.Above-mentioned software function module is stored in a storage medium, including some instructions are used so that a calculating Machine equipment (can be personal computer, double screen equipment or the network equipment etc.) or processor execute each embodiment of the present invention The part of the method.
Embodiment five
Fig. 5 is the schematic diagram for the terminal that the embodiment of the present invention five provides.
The terminal 5 includes:Memory 51, at least one processor 52 are stored in the memory 51 and can be in institute State the computer program 53 run at least one processor 52, at least one communication bus 54.
At least one processor 52 realizes that above-mentioned public sentiment data prediction technique is real when executing the computer program 53 The step in example is applied, alternatively, at least one processor 52 realizes that above-mentioned apparatus is implemented when executing the computer program 53 The function of each module/unit in example.
Illustratively, the computer program 53 can be divided into one or more module/units, it is one or Multiple module/units are stored in the memory 51, and are executed by least one processor 52, to complete this hair It is bright.One or more of module/units can be the series of computation machine program instruction section that can complete specific function, this refers to Enable section for describing implementation procedure of the computer program 53 in the terminal 5.
The terminal 5 can be the computing devices such as desktop PC, notebook, palm PC and cloud server.This Field technology personnel are appreciated that the schematic diagram 5 is only the example of terminal 5, and the not restriction of structure paired terminal 5 can be with Including components more more or fewer than diagram, certain components or different components are either combined, such as the terminal 5 may be used also To include input-output equipment, network access equipment, bus etc..
At least one processor 52 can be central processing unit, can also be other general processors, number letter Number processor, application-specific integrated circuit, ready-made programmable gate array either other programmable logic device, discrete gate or transistor Logical device, discrete hardware components etc..The processor 52 can be microprocessor or the processor 52 can also be any normal The processor etc. of rule, the processor 52 are the control centres of the terminal 5, utilize various interfaces and the entire terminal of connection 5 various pieces.
The memory 51 can be used for storing the computer program 53 and/or module/unit, and the processor 52 passes through Operation executes the computer program and/or module/unit being stored in the memory 51, and calls and be stored in memory Data in 51 realize the various functions of the terminal 5.The memory 51 can include mainly storing program area and storage data Area, wherein storing program area can storage program area, needed at least one function application program (such as sound-playing function, Image player function etc.) etc.;Storage data field can be stored uses created data (such as audio data, electricity according to terminal 5 Script for story-telling etc.) etc..In addition, memory 51 may include high-speed random access memory, can also include nonvolatile memory, example Such as hard disk, memory, plug-in type hard disk, intelligent memory card, safe digital card, flash card, at least one disk memory, flash memory Device or other volatile solid-state parts.
If the integrated module/unit of the terminal 5 is realized in the form of SFU software functional unit and as independent product Sale in use, can be stored in a computer read/write memory medium.Based on this understanding, in present invention realization All or part of flow in embodiment method is stated, relevant hardware can also be instructed to complete by computer program, institute The computer program stated can be stored in a computer readable storage medium, which, can when being executed by processor The step of realizing above-mentioned each embodiment of the method.Wherein, the computer program includes computer program code, the computer Program code can be source code form, object identification code form, executable file or certain intermediate forms etc..The computer can Reading medium may include:Any entity or device, recording medium, USB flash disk, mobile hard of the computer program code can be carried Disk, magnetic disc, CD, computer storage, read-only memory, random access memory, electric carrier signal, telecommunication signal and soft Part distribution medium etc..It should be noted that the content that the computer-readable medium includes can be stood according in jurisdiction The requirement of method and patent practice carries out increase and decrease appropriate, such as is calculated according to legislation and patent practice in certain jurisdictions Machine readable medium does not include electric carrier signal and telecommunication signal.
In several embodiments provided by the present invention, it should be understood that disclosed terminal and method can pass through it Its mode is realized.For example, terminal embodiment described above is only schematical, for example, the division of the unit, only Only a kind of division of logic function, formula that in actual implementation, there may be another division manner.
In addition, each functional unit in each embodiment of the present invention can be integrated in same treatment unit, it can also That each unit physically exists alone, can also two or more units be integrated in same unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of hardware adds software function module.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation includes within the present invention.Any attached drawing table in claim should not be remembered to be considered as and limit the claims involved.This Outside, it is clear that one word of " comprising " is not excluded for other units or, odd number is not excluded for plural number.The multiple units stated in system claims Or device can also be realized by a unit or device by software or hardware.The first, the second equal words are used for indicating name Claim, and does not represent any particular order.
Finally it should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although reference Preferred embodiment describes the invention in detail, it will be understood by those of ordinary skill in the art that, it can be to the present invention's Technical solution is modified or equivalent replacement, without departing from the spirit of the technical scheme of the invention range.

Claims (10)

1. a kind of public sentiment data prediction technique, which is characterized in that the method includes:
Receive at least one keyword of disease input by user;
It determines in internet with the relevant data source of the keyword, and is crawled from the data source using crawlers and institute State the relevant disease data of keyword;
The disease data is parsed to obtain the public sentiment factor of disease;
Data cleansing and outlier processing are carried out to the public sentiment factor of the disease;
Data normalization is carried out to the public sentiment factor of the disease after data cleansing and outlier processing, obtains new disease Data;And
The derivative variable that the public sentiment factor of disease is calculated according to the new disease data, according to the derivative variable to disease into Row prediction.
2. the method as described in claim 1, which is characterized in that in the determining internet with the relevant data of the keyword Source, and crawled from the data source using crawlers and include with the relevant disease data of the keyword:
It determines in internet with the relevant data source of the keyword, the data source is carried out according to the type of the data source Classification;
According to pair classification number classified with the data source, multithreading reptile journey identical with the classification number is set Sequence;
It is crawled from the corresponding data source respectively and the relevant disease of the keyword using the multithreading crawlers Data.
3. the method as described in claim 1, which is characterized in that the method further includes:
Chart is fabricated to according to calculated derivative variable to be visualized, the derivative variable includes:Maximum value, minimum Value, average, variance, standard deviation, covariance, very poor, median, mode, quartile.
4. the method as described in claim 1, which is characterized in that the data normalization includes the group of following one or more It closes:
Summation standardization, standard deviation standardization, maximum standardization or very poor standardization.
5. the method as described in claim 1, which is characterized in that the utilization crawlers are crawled from the data source and institute Stating the relevant disease data of keyword includes:
It is crawled and the relevant disease number of the keyword from the data source within the default reptile period using crawlers According to.
6. the method as described in claim 1, which is characterized in that described to be parsed to obtain the carriage of disease to the disease data The feelings factor includes:
The quantity summation of all sub- public sentiment factors of the disease is calculated, each sub- public sentiment factor is calculated and accounts for the hundred of the summation It is the weight of the corresponding sub- public sentiment factor to divide ratio, the percentage, and the sub- public sentiment factor that weight is more than to default weight threshold is true It is set to the public sentiment factor of disease.
7. the method as described in claim 1, which is characterized in that the public sentiment factor to the disease carry out data cleansing and Outlier processing includes:
Data cleansing is carried out to the public sentiment factor of the disease according to the type of the public sentiment factor of the disease;
Missing values replacement is carried out to the public sentiment factor of the disease according to the distribution of the public sentiment factor of the disease;Or
Directly abandon the public sentiment factor for having abnormal disease.
8. a kind of public sentiment data prediction meanss, which is characterized in that described device includes:
Receiving module, at least one keyword for receiving disease input by user;
Crawl module, for determine in internet with the relevant data source of the keyword, and using crawlers from the number According to being crawled in source and the relevant disease data of the keyword;
Parsing module obtains the public sentiment factor of disease for being parsed to the disease data;
Cleaning module carries out data cleansing and outlier processing for the public sentiment factor to the disease;
Standardized module carries out data standard for the public sentiment factor to the disease after data cleansing and outlier processing Change, obtains new disease data;And
Prediction module, the derivative variable of the public sentiment factor for calculating disease according to the new disease data, spreads out according to described The amount of changing predicts disease.
9. a kind of terminal, which is characterized in that the terminal includes processor and memory, and the processor is for executing described deposit Public sentiment data prediction technique as claimed in any of claims 1 to 7 in one of claims is realized when the computer program stored in reservoir.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program Public sentiment data prediction technique as claimed in any of claims 1 to 7 in one of claims is realized when being executed by processor.
CN201810351128.0A 2018-04-18 2018-04-18 Public opinion data prediction method, device, terminal and storage medium Active CN108647249B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810351128.0A CN108647249B (en) 2018-04-18 2018-04-18 Public opinion data prediction method, device, terminal and storage medium
PCT/CN2018/100229 WO2019200786A1 (en) 2018-04-18 2018-08-13 Method for forecasting public sentiment data, device, terminal, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810351128.0A CN108647249B (en) 2018-04-18 2018-04-18 Public opinion data prediction method, device, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN108647249A true CN108647249A (en) 2018-10-12
CN108647249B CN108647249B (en) 2022-08-02

Family

ID=63746630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810351128.0A Active CN108647249B (en) 2018-04-18 2018-04-18 Public opinion data prediction method, device, terminal and storage medium

Country Status (2)

Country Link
CN (1) CN108647249B (en)
WO (1) WO2019200786A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110299208A (en) * 2019-05-22 2019-10-01 平安科技(深圳)有限公司 Disease surveillance data exception detection method, system, equipment and storage medium
CN110321342A (en) * 2019-05-27 2019-10-11 平安科技(深圳)有限公司 Business valuation studies method, apparatus and storage medium based on intelligent characteristic selection
CN110569298A (en) * 2019-09-12 2019-12-13 成都中科大旗软件股份有限公司 data docking and visualization method and system
CN110675959A (en) * 2019-08-19 2020-01-10 平安科技(深圳)有限公司 Intelligent data analysis method and device, computer equipment and storage medium
CN111968753A (en) * 2020-08-06 2020-11-20 平安科技(深圳)有限公司 Epidemic situation monitoring method and device, computer equipment and storage medium
CN111986763A (en) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 Disease data analysis method and device, electronic device and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749341B (en) * 2021-01-22 2024-03-29 南京莱斯网信技术研究院有限公司 Important public opinion recommendation method, readable storage medium and data processing device
CN113590914B (en) * 2021-06-23 2024-02-20 北京百度网讯科技有限公司 Information processing method, apparatus, electronic device and storage medium
CN116629913B (en) * 2023-07-24 2023-10-03 山东青上化工有限公司 Data extraction system and processing method for compound fertilizer production process

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2335801A1 (en) * 1998-04-29 2002-05-14 Justin Winfield A system and method for text mining
CA2578513A1 (en) * 2006-02-14 2007-08-14 Accenture Global Services Gmbh System and method for online information analysis
CN102043893A (en) * 2009-10-13 2011-05-04 北京大学 Disease pre-warning method and system
US20120233229A1 (en) * 2011-03-03 2012-09-13 Zillian SA Method of Generating Statistical Opinion Data
US20120296974A1 (en) * 1999-04-27 2012-11-22 Joseph Akwo Tabe Social network for media topics of information relating to the science of positivism
CN103577557A (en) * 2013-10-21 2014-02-12 北京奇虎科技有限公司 Device and method for determining capturing frequency of network resource point
CN105653527A (en) * 2014-11-11 2016-06-08 江苏威盾网络科技有限公司 Public sentiment treatment and information deploying method based on web crawler technology
CN105740228A (en) * 2016-01-25 2016-07-06 云南大学 Internet public opinion analysis method
CN106096056A (en) * 2016-06-30 2016-11-09 西南石油大学 A kind of based on distributed public sentiment data real-time collecting method and system
CN106599553A (en) * 2016-11-29 2017-04-26 中国科学院深圳先进技术研究院 Disease early-warning method and device
CN106649270A (en) * 2016-12-19 2017-05-10 四川长虹电器股份有限公司 Public opinion monitoring and analyzing method
CN106951698A (en) * 2017-03-13 2017-07-14 成都育芽科技有限公司 A kind of disease risks forecasting system based on network big data platform
CN107220297A (en) * 2017-05-02 2017-09-29 北京大学 The multi-source heterogeneous automated data acquiistion method and system of software-oriented project
CN107239892A (en) * 2017-05-26 2017-10-10 山东省科学院情报研究所 Region talent's equilibrium of supply and demand quantitative analysis method based on big data
CN107330613A (en) * 2017-06-29 2017-11-07 平安万家医疗投资管理有限责任公司 A kind of public sentiment monitoring method, equipment and computer-readable recording medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170316080A1 (en) * 2016-04-29 2017-11-02 Quest Software Inc. Automatically generated employee profiles

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2335801A1 (en) * 1998-04-29 2002-05-14 Justin Winfield A system and method for text mining
US20120296974A1 (en) * 1999-04-27 2012-11-22 Joseph Akwo Tabe Social network for media topics of information relating to the science of positivism
CA2578513A1 (en) * 2006-02-14 2007-08-14 Accenture Global Services Gmbh System and method for online information analysis
CN102043893A (en) * 2009-10-13 2011-05-04 北京大学 Disease pre-warning method and system
US20120233229A1 (en) * 2011-03-03 2012-09-13 Zillian SA Method of Generating Statistical Opinion Data
CN103577557A (en) * 2013-10-21 2014-02-12 北京奇虎科技有限公司 Device and method for determining capturing frequency of network resource point
CN105653527A (en) * 2014-11-11 2016-06-08 江苏威盾网络科技有限公司 Public sentiment treatment and information deploying method based on web crawler technology
CN105740228A (en) * 2016-01-25 2016-07-06 云南大学 Internet public opinion analysis method
CN106096056A (en) * 2016-06-30 2016-11-09 西南石油大学 A kind of based on distributed public sentiment data real-time collecting method and system
CN106599553A (en) * 2016-11-29 2017-04-26 中国科学院深圳先进技术研究院 Disease early-warning method and device
CN106649270A (en) * 2016-12-19 2017-05-10 四川长虹电器股份有限公司 Public opinion monitoring and analyzing method
CN106951698A (en) * 2017-03-13 2017-07-14 成都育芽科技有限公司 A kind of disease risks forecasting system based on network big data platform
CN107220297A (en) * 2017-05-02 2017-09-29 北京大学 The multi-source heterogeneous automated data acquiistion method and system of software-oriented project
CN107239892A (en) * 2017-05-26 2017-10-10 山东省科学院情报研究所 Region talent's equilibrium of supply and demand quantitative analysis method based on big data
CN107330613A (en) * 2017-06-29 2017-11-07 平安万家医疗投资管理有限责任公司 A kind of public sentiment monitoring method, equipment and computer-readable recording medium

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
XIAOHUI CUI: "Chinese social media analysis for disease surveillance", 《PERSONAL AND UBIQUITOUS COMPUTING》 *
于兴隆等: "基于用户行为的高校BBS热帖预测模型", 《计算机应用与软件》 *
张庆民等: "基于企业污染行为视角的网络舆情危机演化研究――以环境事件为例", 《情报杂志》 *
张长利: "面向特定领域的互联网舆情分析技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *
彭艺等: "一种基于马尔可夫链的微信舆情热度预测模型", 《信息技术》 *
赵青等: "网络舆情研判高校群体性事件的预警监测及引导机制研究", 《经济与社会发展》 *
黄江妙: ""基于数据采集的SPC系统研究"", 《中国优秀硕士学位论文全文库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110299208A (en) * 2019-05-22 2019-10-01 平安科技(深圳)有限公司 Disease surveillance data exception detection method, system, equipment and storage medium
CN110321342A (en) * 2019-05-27 2019-10-11 平安科技(深圳)有限公司 Business valuation studies method, apparatus and storage medium based on intelligent characteristic selection
CN110675959A (en) * 2019-08-19 2020-01-10 平安科技(深圳)有限公司 Intelligent data analysis method and device, computer equipment and storage medium
WO2020215671A1 (en) * 2019-08-19 2020-10-29 平安科技(深圳)有限公司 Method and device for smart analysis of data, and computer device and storage medium
CN110569298A (en) * 2019-09-12 2019-12-13 成都中科大旗软件股份有限公司 data docking and visualization method and system
CN110569298B (en) * 2019-09-12 2023-03-24 成都中科大旗软件股份有限公司 Data docking and visualization method and system
CN111968753A (en) * 2020-08-06 2020-11-20 平安科技(深圳)有限公司 Epidemic situation monitoring method and device, computer equipment and storage medium
CN111986763A (en) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 Disease data analysis method and device, electronic device and storage medium
CN111986763B (en) * 2020-09-03 2024-05-14 深圳平安智慧医健科技有限公司 Disease data analysis method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2019200786A1 (en) 2019-10-24
CN108647249B (en) 2022-08-02

Similar Documents

Publication Publication Date Title
CN108647249A (en) Public sentiment data prediction technique, device, terminal and storage medium
WO2022141861A1 (en) Emotion classification method and apparatus, electronic device, and storage medium
Chattopadhyay et al. A Case‐Based Reasoning system for complex medical diagnosis
CA2617954C (en) Method and system for extracting web data
WO2015085154A1 (en) Trend identification and reporting
Jeong et al. Intellectual structure of biomedical informatics reflected in scholarly events
Al Kilani et al. Automatic classification of apps reviews for requirement engineering: Exploring the customers need from healthcare applications
CN104598474B (en) Information recommendation method based on data semantic under cloud environment
Ohniwa et al. Generating process of emerging topics in the life sciences
Hsu et al. Topic analysis of studies on total quality management and business excellence: an update on research from 2010 to 2019
Rahkovsky et al. AI research funding portfolios and extreme growth
Poluru et al. Applications of Domain-Specific Predictive Analytics Applied to Big Data
Ali et al. Big data sentiment analysis of Twitter data
He et al. Research on the dynamic monitoring system model of university network public opinion under the big data environment
CN112733538B (en) Ontology construction method and device based on text
CN117786131A (en) Industrial chain safety monitoring analysis method, medium and equipment
CN111221881B (en) User characteristic data synthesis method and device and electronic equipment
Handali et al. Industry demand for analytics: A longitudinal study
Gabriel et al. Summarizing dynamic social tagging systems
US10459925B2 (en) Computer-enabled method of assisting to generate an innovation
CN116756373A (en) Project review expert screening method, system and medium based on knowledge graph update
KR20160121132A (en) Analysis apparatus and method for product trends and sale based on social big data
Voronov et al. Forecasting popularity of news article by title analyzing with BN-LSTM network
Mukherjee et al. Self-organization of the Sound Inventories: Analysis and Synthesis of the Occurrence and Co-occurrence Networks of Consonants∗
Savić et al. Analysis of co-authorship networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant