CN105068991A

CN105068991A - Big data based public sentiment discovery method

Info

Publication number: CN105068991A
Application number: CN201510458540.9A
Authority: CN
Inventors: 肖会
Original assignee: CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd
Current assignee: CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd
Priority date: 2015-07-30
Filing date: 2015-07-30
Publication date: 2015-11-18

Abstract

The invention proposes a big data based public sentiment discovery method. The method comprises: step S100, data acquisition: performing data acquisition on network data based on a distributed cloud computing mode; step S200, data preprocessing: performing preprocessing on the network data acquired in the step S100, performing word segmentation and part-of-speech tagging processing on the acquired network data first, then performing punctuation mark processing, expression character processing and stop word processing and finally obtaining a feature item used for representing a text; step S300, topic extraction: based on the preprocessed network data, extracting a topic from the network data; and step S400, public sentiment analysis: performing public sentiment analysis based on the topic extracted in the step S300. The method adopts the distributed cloud computing mode and can perform mining and analysis on various network data acquired on a large scale.

Description

A kind of public sentiment discover method based on large data

Technical field

The present invention relates to data processing field, be specifically related to a kind of public sentiment discover method based on large data.

Background technology

Along with the development of Web2.0 technology, there is earth-shaking change in internet.Internet, by static Web page and information, is transformed into the display platform of " group intelligence " that everybody participates in.By blog, microblogging, BBS, SNS, news analysis etc., netizen freely can issue the viewpoint idea of oneself and comment on any event.In the epoch of a network interconnection, any suggestion, idea all may affect large quantities of people, form network public opinion.Present stage, increasing event showed: network public opinion is affecting public opinion development trend, even as a kind of mode of main public opinion.On network consciously or the unconscious condition of the people will of the people expressed, the value orientation give expression to the concern of the hot spot of society and viewpoint more and more have research and reference value to netizen.

The detection of much-talked-about topic and focus incident and extraction become one of branch of topic detection and tracking research in recent years, from internet data, mainly news, blog, forum, obtain in social network sites and data such as search daily record etc. focus feature group or with behavior phrase, carry out abstract extraction, thus obtain much-talked-about topic.Owing to traditional blog, microblogging and social network sites existing the topic and data that grow with each passing day, much-talked-about topic can be supplied to the reference of user search keyword, and such as, the much-talked-about topic in Baidu search engine is recommended, the much-talked-about topic of acquisition is recommended, and can upgrade constantly.These impel hot news to become the focus of public opinion undoubtedly, and the sight line of wide influence masses guides popular public opinion, embody the society and politics attitude of the common people to a certain extent.In the face of the network public-opinion of diversification expresses demand, if do not add guiding, negative network public-opinion will cause certain harm to the public safety of society.

In sum, carry out Internet public opinion analysis, in real time control public sentiment situation, forming the network public-opinion environment in front, for building a harmonious socialist society, there is practical guided significance.But at present for the method based on text opining mining and classification or seldom of internet, with instant discovery focus, analyzing and processing the requirement assessing public opinion still has distance, therefore, be necessary to provide a kind of method based on text opining mining and classification.The method is applied in Timeliness coverage and the much-talked-about topic of monitor network public sentiment, is conducive to allowing the common people understand social hotspots information in real time, supports for government and relevant departments' policies provide auxiliary.The combing of reinforcement information, ensures the timely, accurate and comprehensive of public sentiment, effectively can prevent the propagation of flame.

In addition, along with the develop rapidly of the application such as mobile Internet, Internet of Things, there is explosive growth in global metadata amount.The growth at full speed of data volume imply that and entered large data age now.In prior art, the platform based on Hadoop is adopted to the process of large data.Hadoop is a Distributed Computing Platform of increasing income, and its core comprises HDFS (HadoopDistributedFilesSystem, Hadoop distributed file system).The many merits (mainly comprising high fault tolerance, high scalability etc.) of HDFS allows user to be deployed on cheap hardware by Hadoop, builds distributed type assemblies, forms distributed system.HBase (HadoopDataBase, Hadoop database) be the distributed data base system that the storage of high reliability, high-performance, row, scalable, real-time read-write are provided be based upon on distributed file system HDFS, be mainly used to store destructuring and semi-structured unstructured data.

Summary of the invention

For solving problems of the prior art, the present invention proposes a kind of public sentiment discover method based on large data.

A kind of public sentiment discover method based on large data that the present invention proposes, comprising:

Step S100, data acquisition, carry out data acquisition based on distributed cloud computing mode to network data, described data acquisition is realized by web crawlers; By the network data of distributed storage device storage of collected, described distributed storage device realizes based on HDFS;

Step S200, data prediction, carries out pre-service to the network data that step S100 gathers, and first carries out participle and part-of-speech tagging process to the network data gathered, then carrying out punctuation mark process, expression character process and stop words process, finally obtaining the characteristic item for representing text;

Step S300, topic extracts, and based on pretreated network data, therefrom extracts topic.

Step S400, the analysis of public opinion, the topic extracted based on step S300 carries out the analysis of public opinion.

Wherein, step S200 comprises further:

High-quality word retrieval, each characteristic item that step S200 obtains implies a mass value, its response feature item contribution degree in the text, and the quality Q (t) of characteristic item t is expressed as:

Q (t) = l_{t}^{2} (Σ_{i = 1}^{N} f_{i}^{2} - \frac{1}{N} {(Σ_{i = 1}^{N} f_{i})}^{2}),

Wherein, N represents the quantity of all documents, f _irepresent the number of times that document feature item t occurs in document i, l _tthe length of representation feature item t,

Setting threshold value Q, the characteristic item for Q (t) >Q is retained, otherwise deletes.

Wherein, step S300 comprises: carry out text cluster to the text that step S200 pre-service obtains, and utilizes hierarchical clustering algorithm to calculate the average of each class textual object, obtains the initial cluster center of k-means algorithm.Utilize k-means algorithm by recalculating the distance of each text object and cluster centre, revise the ownership class of level cluster result textual object, algorithm steps is as follows:

(1) the number k of cluster centre is determined;

(2) utilize the method for hierarchical clustering to carry out Hierarchical clustering analysis to data set, obtain the average of k class, using they initial cluster centers as k-means;

(3) calculate the distance of each text object and cluster centre, text object is divided into from representated by the nearest cluster centre of text object bunch;

(4) value obtained is utilized to recalculate the cluster centre of each bunch;

(5) (3) and (4) are repeated, until the class belonging to each text object no longer changes;

Namely the class that cluster result obtains is defined as topic.

Preferably, the present invention comprises further:

Determine much-talked-about topic, the temperature of the topic obtained by following formulae discovery step S300,

R _i＝α ₁·RF _i+α ₂·RT _i+α ₃·CN _i+α ₄·DN _i,

Wherein, R _irepresent the temperature of topic i, RF _i: the report frequency representing topic i, RT _i: represent within the predetermined N days time, to the report number of days of topic i and the ratio of all number of days, CN _i: represent that the netizen of topic i within predetermined number of days reads quantity to its click, DN _i: represent that the netizen of topic i within predetermined number of days is to its comment number; α ₁, α ₂, α ₃, α ₄for weight coefficient; Work as R _iwhen being greater than given threshold value R, topic i is defined as much-talked-about topic.

Preferably, described in the present invention, network data comprises blog, microblogging, forum, the data of the several classification of news report webpage, for blog, microblogging, forum, the data of news report webpage independently carry out topic extraction, supposes for blog, microblogging, forum, the much-talked-about topic set that the data of news report webpage extract is respectively BLOG, M-BLOG, BBS, NEWS, calculates BLOG, M-BLOG, BBS, the common factor of NEWS, the result obtained is defined as the first much-talked-about topic set, calculates BLOG, M-BLOG, BBS, NEWS is every three intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and be defined as the set of second hot area topic, calculates BLOG, M-BLOG, BBS, NEWS is every two intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and the set of second hot area topic is defined as the 3rd much-talked-about topic set, set B LOG, M-BLOG, BBS, NEWS and deduct the first much-talked-about topic set, the result of the set of second hot area topic and the 3rd much-talked-about topic set is defined as the 4th much-talked-about topic set.

The present invention adopts distributed cloud computing mode, can excavate, analyze the extensive various network datas gathered; And by carrying out computational analysis respectively to different pieces of information source data, obtaining the much-talked-about topic in different pieces of information source, and then determining the temperature of topic further, thus can more objectively obtain current hotspot topic.The present invention is for the unit such as Party and government offices, large enterprise and organize Timeliness coverage network sensitive information, grasp network public-opinion focus, hold network public-opinion trend, tackle network public-opinion crisis provides robotization, systematization and scientific Informational support.Effectively improve described network public-opinion monitoring system judge accuracy, for the subsequent treatment of the micro-letter public feelings information of network provide more truly, accurately basis.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the analysis of public opinion of the present invention;

Embodiment

Below in conjunction with accompanying drawing of the present invention, technical scheme of the present invention is clearly and completely described.Here will be described exemplary embodiment in detail, its sample table shows in the accompanying drawings.When description below relates to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawing represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all embodiments consistent with the present invention.On the contrary, they only with as in appended claims describe in detail, the example of apparatus and method that aspects more of the present invention are consistent.

See Fig. 1, a kind of public sentiment discover method based on large data that the present invention proposes.The following embodiment of the present invention is mainly that the situation of microblogging is described for example with data source, be blog, the news report situation such as webpage, forum for data source, those skilled in the art can implement according to the known technology of given method and this area.

Step S100, data acquisition

Based on distributed cloud computing mode, data acquisition is carried out to network data, described network data comprises the data of blog, microblogging, forum, the several classification of news report webpage, and described network data is marked according to blog, microblogging, forum, the several classification of news report webpage, and stores described network data respectively by described classification; Wherein, it is reported that webpage refers to the webpage of the news that the news media website such as the portal websites such as Tengxun's news, Sina News and such as People's Daily provides.

Described data acquisition is realized by web crawlers.By the network data of distributed storage device storage of collected, described distributed storage device realizes based on HDFS.

Step S200, data prediction, carries out pre-service to the network data that step S100 gathers,

First participle and part-of-speech tagging process are carried out to the network data gathered;

Due to microblogging text formatting freely, there is the feature of the brief and fragmentation of content, for the ease of analyzing, needing to carry out standardization processing to content of microblog.

(1) punctuation mark process

In microblogging, often there will be the situation that punctuation mark combinationally uses or punctuation mark is used in conjunction.Although this usage does not meet syntax rule, be often used to express some single symbols emotion beyond expression of words in microblogging.Such as " ~ " often expresses a kind of fluctuating and extension of emotion in the text.As " very glad ~ ~ ~! ".In order to the tone and the intensity of expresser can be reflected in microblogging text-processing.Need punctuation mark combination and the situation that is used in conjunction are marked.

(2) expression character process

Character or character combination that some are used for embodying expression is often there will be in microblogging.Such as--! Represent embarrassment etc.This kind of symbol often contains emotion or the attitude tendency of microblog users.For this kind of character, mark by setting up a mapping table comprising main emoticon.Because these expression characters concentrate on several class, make an exception less, we take statistical computation mode for wherein most symbol foundation mapping, and pedestrian's work of going forward side by side evaluation marks, and composes with feeling polarities weight.

(3) stop words process

Based in the automatic classification of SVM, stop words does not have the function word of real justice and the not strong neutral words of classification color, and the frequency of occurrences of this kind of word is relatively high.The quality that effectively process can improve word segmentation speed and raising subsequent analysis is carried out separately to them.Still need to mark to these words during word segmentation.In order to correctly stop words can be processed, need to use inactive vocabulary.The true and false structure of inactive vocabulary and the identification of stop words be stop words process key.At present, the foundation of inactive vocabulary has been come by the machine learning mode of manual type and Corpus--based Method.The method of machine learning counts high frequency words and automatically constructs inactive vocabulary from language material, or from preliminary participle, obtains stop words and constantly update in follow-up participle and checking.

Step S300, topic extracts, and based on pretreated network data, therefrom extracts much-talked-about topic;

Microblogging has dapper feature, and the maximum limit that the standard of microblogging sends out number of words is 140 words, that is will meaning sake of clarity within this restriction.So-called document sets refers to each microblogging, the thousands of bar of microblogging quantity, so the quantity of document sets is also huge.Much-talked-about topic finds to refer to and huge micro-blog information is classified, and finds the focus of society's words, inquires into the emotion that the common people hold, understand public demands at any time.

Text cluster technology is the basis that topic finds and follows the tracks of, and good cluster can produce classifies accurately, reduces large class phenomenon.Text cluster mainly comprises feature extraction, text representation, text cluster, opining mining four steps.

The feature of so-called text refers in text the general name can transmitting text message and semantic vocabulary and character.Undressed text is characterized in that unconspicuous, needs manual intervention, uses certain method to extract from text.Feature extraction is in order to diminishing of making the dimension of text vector try one's best, and the speed of computer treatmenting information and efficiency are improved.Feature extraction mainly comprises two main steps: pre-service, high-quality word retrieval.Main three parts of document pre-service: remove the filtration of stop words, part of speech, invalid word filtering.Go stop words first to set up a stop words dictionary, but then coupling remove the very high word whether having practical significance of those frequencies of occurrences, such as " " " " " " etc.Part of speech is filtered and is based upon on the basis of part-of-speech tagging, processes disaggregatedly vocabulary of different nature, and experiment proves, adjective and the raising of adverbial word to the effect of cluster do not have too large improvement, so should give removal, only retain noun, verb and initialism.

Vocabulary after pre-service is still huge, so still need to carry out second step, high-quality word retrieval.Each characteristic item in document implies a mass value, the words-frequency feature of so-called mass value mainly feature based item, response feature item contribution degree in the text.Quality is larger, illustrates that contribution is larger, can stay for text cluster; Otherwise, then reject.

The quality Q (t) of characteristic item t is expressed as:

Q (t) = l_{t}^{2} (Σ_{i = 1}^{N} f_{i}^{2} - \frac{1}{N} {(Σ_{i = 1}^{N} f_{i})}^{2}),

Vector space model is shown as the hyperspace of characteristic item composition text table, and each characteristic item is as the one dimension in space, and such text model can be expressed as follows:

D={t ₁, t ₂..., t _n, the wherein number of n representative feature item.

K-means clustering algorithm is a kind of widely used clustering algorithm, it using the average of all samples in each cluster subset as cluster centre, by iterative process by Data Placement to different classes of, make the function of evaluation clustering performance reach optimum.This algorithm has good Clustering Effect for the lengthy document of process continuous type, but algorithm for noise and isolated point data very sensitive, data so on a small quantity can have a huge impact average, for this data with discrete type attribute and Sparse feature of microblogging text, Clustering Effect is unsatisfactory.

Hierarchical clustering algorithm is another kind of widely used clustering algorithm.This algorithm carries out hierachical decomposition to given set of data objects.According to the generation type of level, be divided into again the hierarchical clustering algorithm of cohesion and the hierarchical clustering algorithm of division.Hierarchical clustering algorithm can control noises effectively, reduces the impact of isolated point data for cluster result.But hierarchical clustering algorithm is existing defects also, once a division or merging are performed in cluster process, just can not revise.If a certain step goes wrong in cluster process, so such mistake will be extended, and the interative computation of stratification makes error constantly accumulate, and causes cluster result poor.

In order to effectively overcome the defect that these two kinds of algorithms exist, make full use of the advantage on their each comfortable text clusters, these two kinds of algorithms combine and carry out text cluster by we simultaneously.Utilize hierarchical clustering algorithm to calculate the average of each class textual object, obtain the initial cluster center of k-means algorithm.Utilizing k-means algorithm by recalculating the distance of each text object and cluster centre, revising the ownership class of level cluster result textual object.Algorithm steps is as follows:

(1) the number k of cluster centre is determined;

(4) value obtained is utilized to recalculate the cluster centre of each bunch;

(5) (3) and (4) are repeated, until the class belonging to each text object no longer changes.

Text object in above-mentioned algorithm, is the document (text d) in described network data.

Namely the class that above-mentioned cluster result obtains is defined as topic.

Much-talked-about topic is exactly take internet as communication media, by general population's extensive concern, and wide-scale distribution can spread and continue for some time in a short period of time, the information aggregate of internet public feelings can be reflected, wherein also comprise and the semantization of internet hot spots topic described and comprises the circulation way etc. of topic.Internet hot spots topic is normally by information that numerous netizens pay close attention to, relevant information can the appearance of large frequency in a network, wherein focus word compares the General Matters that directly can describe out much-talked-about topic, a much-talked-about topic must have the characteristic quantity of multiple much-talked-about topic to describe, and has certain similarity between these characteristic quantities.

The characteristic quantity of much-talked-about topic has:

The report frequency of topic; For an important topic, media will increase than usual to its relevant report, so this report frequency also can have influence on the attention rate of much-talked-about topic, that is within a period of time about the report quantity of some topics with report sum ratio, ratio is larger, then the suffered attention rate of topic is higher.

The duration of topic; For a much-talked-about topic, if the report duration of media to it is longer, and it is also long to cause the time that netizen discusses, and so also just illustrates that the concerned degree of this topic is larger.Because each topic has the concerned time attribute of oneself, the initial time that we define topic is the topic incipient time, be exactly when the report of topic is lower than moment of certain threshold value between extinction time, therefore, the span of the time of event is defined as the difference between this topic start time and extinction time.

The amount of reading of topic; Because the report of certain much-talked-about topic is mostly from WEB website, so click the netizen reading topic relevant report more, also just illustrate that the attention rate that this topic is subject to is larger, we just can read with the click of the relevant report of much-talked-about topic the attention rate that quantity records topic.

The comment number of topic; If the number of reviews of netizen to a certain much-talked-about topic is more on internet, also can illustrate that the comment number of topic is also the factor affecting the topic degree of correlation.

For each text in the application, the report time of capital mark text (such as it is reported the time, blog, microblogging, forum deliver the time), the number of clicks of text, the number of reviews of text, the report frequency of topic obtained after cluster can being determined according to the report time of text, the duration of topic; The amount of reading of the topic obtained after can determining cluster according to the number of clicks of text, the comment number of the topic obtained after can determining cluster according to the number of reviews of text

RF _i: the report frequency representing topic i;

RT _i: represent within the predetermined N days time, media are to about effective report number of days of topic i and the ratio of all number of days, and when the report quantity about topic i in one day is greater than some threshold values, namely we assert that this day is just effective report number of days of topic i;

CN _i: represent that the netizen of topic i within predetermined number of days reads quantity to its click;

DN _i: represent that the netizen of topic i within predetermined number of days is to its comment number;

Topic temperature computing formula:

R _i＝α ₁·RF _i+α ₂·RT _i+α ₃·CN _i+α ₄·DN _i,

R _irepresent the temperature of topic i, α ₁, α ₂, α ₃, α ₄for weight coefficient, work as R _iwhen being greater than given threshold value R, topic i is defined as much-talked-about topic.

The present invention is for blog, microblogging, forum, the data of news report webpage independently carry out topic extraction, supposes for blog, microblogging, forum, the much-talked-about topic set that the data of news report webpage extract is respectively BLOG, M-BLOG, BBS, NEWS, calculates BLOG, M-BLOG, BBS, the common factor of NEWS, the result obtained is defined as the first much-talked-about topic set, calculates BLOG, M-BLOG, BBS, NEWS is every three intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and be defined as the set of second hot area topic, calculates BLOG, M-BLOG, BBS, NEWS is every two intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and the set of second hot area topic is defined as the 3rd much-talked-about topic set, set B LOG, M-BLOG, BBS, NEWS and deduct the first much-talked-about topic set, the result of the set of second hot area topic and the 3rd much-talked-about topic set is defined as the 4th much-talked-about topic set.

The focus reflected due to blog, microblogging, forum, news report webpage may difference to some extent, so the content paid close attention to when blog, microblogging, forum, news report webpage should be the content that temperature is the highest simultaneously, in blog, microblogging, forum, news report webpage, three content temperatures simultaneously paid close attention to are taken second place, in blog, microblogging, forum, news report webpage, two content temperatures simultaneously paid close attention to are taken second place again, and the content temperature only having to pay close attention in blog, microblogging, forum, news report webpage is relatively minimum.

Based on extracting the topic determined, various the analysis of public opinion work can be carried out.

Those skilled in the art, at consideration instructions and after putting into practice invention disclosed herein, will easily expect other embodiment of the present invention.The application is intended to contain any modification of the present invention, purposes or adaptations, and these modification, purposes or adaptations are followed general principle of the present invention and comprised the undocumented common practise in the art of the present invention or conventional techniques means.

Should be understood that, the present invention is not limited to precision architecture described above and illustrated in the accompanying drawings, and can carry out various amendment and change not departing from its scope.Scope of the present invention is only limited by appended claim.

Claims

1., based on a public sentiment discover method for large data, comprising:

Step S300, topic extracts, and based on pretreated network data, therefrom extracts topic;

2., as claimed in claim 1 based on the public sentiment discover method of large data, wherein, step S200 comprises further:

Q (t) = l_{t}^{2} (Σ_{i = 1}^{N} f_{i}^{2} - \frac{1}{N} {(Σ_{i = 1}^{N} f_{i})}^{2}),

3. as claimed in claim 1 based on the public sentiment discover method of large data, wherein, step S300 comprises: carry out text cluster to the text that step S200 pre-service obtains, and utilizes hierarchical clustering algorithm to calculate the average of each class textual object, obtains the initial cluster center of k-means algorithm.Utilize k-means algorithm by recalculating the distance of each text object and cluster centre, revise the ownership class of level cluster result textual object, algorithm steps is as follows:

(1) the number k of cluster centre is determined;

(4) value obtained is utilized to recalculate the cluster centre of each bunch;

Namely the class that cluster result obtains is defined as topic.

4., as claimed in claim 3 based on the public sentiment discover method of large data, comprise further:

R _i＝α ₁·RF _i+α ₂·RT _i+α ₃·CN _i+α ₄·DN _i,

5., as claimed in claim 1 based on the public sentiment discover method of large data, wherein, described network data comprises blog, microblogging, forum, the data of the several classification of news report webpage, for blog, microblogging, forum, the data of news report webpage independently carry out topic extraction, supposes for blog, microblogging, forum, the much-talked-about topic set that the data of news report webpage extract is respectively BLOG, M-BLOG, BBS, NEWS, calculates BLOG, M-BLOG, BBS, the common factor of NEWS, the result obtained is defined as the first much-talked-about topic set, calculates BLOG, M-BLOG, BBS, NEWS is every three intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and be defined as the set of second hot area topic, calculates BLOG, M-BLOG, BBS, NEWS is every two intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and the set of second hot area topic is defined as the 3rd much-talked-about topic set, set B LOG, M-BLOG, BBS, NEWS and deduct the first much-talked-about topic set, the result of the set of second hot area topic and the 3rd much-talked-about topic set is defined as the 4th much-talked-about topic set.