CN104537097A

CN104537097A - Microblog public opinion monitoring system

Info

Publication number: CN104537097A
Application number: CN201510009995.2A
Authority: CN
Inventors: 张鹏
Original assignee: BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Current assignee: Shanghai Keyi Culture Communication Co.,Ltd.
Priority date: 2015-01-09
Filing date: 2015-01-09
Publication date: 2015-04-22
Anticipated expiration: 2035-01-09
Also published as: CN104537097B

Abstract

The invention discloses a microblog public opinion monitoring system which comprises a public opinion popularization degree obtaining module, an intelligent crawler crawling module, an extracting and preprocessing module, a feature phrase filtering module, a public opinion analyzing module, an emotion tendency analyzing module and a user interaction module. According to the system, by means of the distributed cloud computing mode, microblog public opinion hot spots are obtained through various microblog public opinion monitoring algorithms, the obtained microblog public opinion hot spots are comprehensively judged, classified and assessed, and accordingly microblog public opinion hot spot topics are efficiently and accurately monitored.

Description

Microblogging public sentiment monitoring system

Technical field

The present invention relates to internet information processing technology field, specifically, relate to a kind of microblogging public sentiment monitoring system.

Background technology

Along with internet develop rapidly in the world, the network media has been acknowledged as " fourth media " after newspaper, broadcast, TV, and network becomes one of main carriers of reflection Social Public Feelings.

Network public-opinion is by transmission on Internet, what the public held some focus, focal issue in actual life has stronger influence power, tendentious emotion, attitude, suggestion, speech or viewpoint, and it realizes mainly through post comment and follow-up post, news, the blog Blog etc. on forum BBS and strengthened.Because internet has virtual, disguised, the feature such as diversity, perviousness and randomness, increasing netizen gladly expresses viewpoint, propagating thought by this channel.

Along with developing rapidly of Internet technology, the New Generation of Media being representative with microblog media etc. breaks control and the monopolization of information, on network, people freely express attitude and the suggestion of oneself, no longer so easily unconditionally accepted as the past, on the contrary, the Interest demands of different estate presents one after another, different thought viewpoint head-on crash.Concerning related governmental departments, how awareness network microblogging public sentiment promptly and accurately, strengthens the timely monitoring to network microblog public opinion, effectively guides, become a large difficult point of network microblog public sentiment management.In this case, the microblogging public sentiment monitoring system that construction can cover microblog data source is very necessary, this type systematic can for new microblogging media transmission environment, the impact that the focus analysis method of further further investigation microblogging public sentiment and new media bring, carries out abundant and perfect to the research of microblogging public sentiment.

Although there has been a lot of unit to propose some different solutions for the monitoring of network microblog public sentiment at present.But the technical matters needing those skilled in the art to solve how to improve the Efficiency and accuracy judging network microblog public feelings information.Because so far, not yet have comparatively efficiently, accurately for the network public-opinion monitoring system of microblog media data.

In prior art, the general data source of network public-opinion is generally various website or forum, then fewer for the monitoring system of microblogging public sentiment data separately; Even specially for the monitoring system of microblogging public sentiment data, also often due to a variety of causes accuracy rate or efficiency lower.And the present invention proposes a kind of specially for the monitoring system of the public sentiment data of micro blog network data source.

Compared with prior art, the present invention includes following advantage:

First, microblogging public sentiment monitoring system of the present invention is towards micro blog network resource, the microblog data gathered obtains through public sentiment temperature, intelligent reptile crawls, extract and the data processing step such as pre-service, feature phrase filtration, the analysis of public opinion, emotional orientation analysis, effectively improves the microblogging public sentiment data filtration efficiency of micro blog network data source;

Secondly, by distributed cloud computing mode, can excavate extensive image data, analyze, and microblogging public sentiment hot can be obtained based on multiple microblogging public sentiment monitoring algorithm module, to described microblogging public sentiment hot comprehensive descision classification, thus the discovery realized microblogging public sentiment hot topic and tracking, to the social network analysis of microblogging, analysis result is visual to be presented, for Party and government offices, units such as large enterprise and organize Timeliness coverage microblogging sensitive information, grasp microblogging public sentiment hot, hold microblogging public sentiment trend, the crisis of reply microblogging public sentiment provides robotization, systematization and scientific Informational support.Effectively improve described microblogging public sentiment monitoring system judge accuracy, the subsequent treatment for network microblog public feelings information provides more truly, accurately basis.

Summary of the invention

The present invention is exactly for the weak point in above-mentioned background technology, and the public sentiment monitoring system of a kind of microblog media proposed, it has higher accuracy rate.The object of the invention is to be achieved by the following technical measures.

The present invention proposes a kind of microblogging public sentiment monitoring system, this system comprises: public sentiment temperature acquisition module 1, intelligent reptile crawl module 2, extract and pretreatment module 3, feature phrase filtering module 4, the analysis of public opinion module 5, emotional orientation analysis module 6 and user interactive module 7, wherein

Public sentiment temperature acquisition module 1 screens for the public sentiment temperature weights according to microblogging the microblog page needing to carry out the analysis of public opinion;

Intelligence reptile crawls module 2 for the microblog data by crawling the microblog page of specifying in the fixed time, and analyzes crawled microblog data according to predefined event, filters out the microblog data irrelevant with the public sentiment that will monitor;

To extract and pretreatment module 3 carries out extracting and pre-service for the information in the microblog data that intelligent reptile crawled module 2 and obtain;

Feature phrase filtering module 4 is for carrying out filtering screening to the feature phrase in the microblog data after extraction and pretreatment module 3 process;

The analysis of public opinion module 5, for based on the microblog data after feature phrase filtering module 4 process, finds microblogging public sentiment hot;

Emotional orientation analysis module 6 is for performing emotional orientation analysis to found microblogging public sentiment hot;

User interactive module 7, for chart or report form display translation microblogging the analysis of public opinion result, realizes integration of user interaction functionality.

Preferably, described public sentiment temperature acquisition module 1 calculates the public sentiment temperature weights ρ of described microblogging, if ρ is greater than the threshold value T ρ preset, then using this microblogging as the analysis of public opinion Data Source and analyze foundation, particularly:

The clicks of browsing supposing microblogging is K1, and comment number is K2, and reply number is K3, and click support number is K4, and click inverse logarithm is K5, and forwarding number is K6, collection number be K7, β 1 ~ β 4 for preset and adjustable coefficient, then

ρ＝(lg(K1) ^3/4+0.03)*β1+(lg((K2) ^2/3+(K3) ^2/3)+0.02)*β2+(lg((K4) ^1/2+(K5) ^1/2)+0.01)*β3+(lg((K6) ^1/3+(K7) ^1/3)+0.005)*β4；

Wherein, β 1 ~ β 4 can be set to: β 1=0.4; β 2=0.2; β 3=0.1; β 4=0.1.

Preferably, described intelligent reptile crawls module 2 and performs following steps:

Step 2-1, by system predefined event, microblog page is analyzed, with this, link filter irrelevant with the predefined event that will monitor is fallen, remaining link relevant with predefined event, these link relevant with predefined event are remained, and them stored in waiting for the URL queue capturing the page;

Step 2-2, according to predefined search strategy, selects the URL corresponding to the page captured according to described predefined search strategy, repeats step 2-1 from described URL queue, when then stopping after the stop condition meeting systemic presupposition crawling process.

Preferably, described extraction and pretreatment module 3 perform following steps:

First, extract the information to the useful microblogging body part of microblogging the analysis of public opinion, microblogging body part is reconstructed, the representational microblog data of theme will be had and flock together;

Secondly, word segmentation processing, filtration stop words, named entity recognition, syntax parsing, part-of-speech tagging, emotion recognition, Feature Words extraction are carried out to described microblog data; Then feature phrase extraction is carried out.

Preferably, described feature phrase filtering module 4 performs following steps:

Step 4-1, duplicate removal is carried out to feature phrase, comprising: the repeated feature phrase that occurs and its number of times occurred in the text of record microblogging, filtering out the frequency of occurrences lower than repeating the repeated feature phrase of threshold value and length lower than the repeated feature phrase repeating threshold value;

Step 4-2, divides into groups to feature phrase, comprising: calculate the Similarity value between each feature phrase and other feature phrases, and Similarity value is divided into identical group higher than the feature phrase of similarity threshold; If the Similarity value between a feature phrase and every other feature phrase is all 0, then this feature phrase is filtered out; Particularly, one of following two steps can be selected to calculate the Similarity value Sims (X, Y) of described two feature phrases X, Y, then carry out feature phrase grouping:

Step 4-2-1:

First, suppose that the quantity of the sentence simultaneously occurring feature phrase X, Y is sum (XY); Only there is feature phrase X, do not occur that the quantity of the sentence of feature phrase Y is sum (X); Only there is feature phrase Y, do not occur that the quantity of the sentence of feature phrase X is sum (Y); Now, Similarity value Sims (X, the Y) computing formula of feature phrase X, Y is as follows:

Sims(X,Y)＝log ₂(sum(XY))/log ₂(sum(X))+log ₂(sum(XY))/log ₂(sum(Y))；

Secondly, if Sims (X, Y)≤threshold value TD1, then feature phrase Y is divided into the group at feature phrase X place;

Step 4-2-2:

First, suppose that the number that two feature phrases X, Y comprise character is respectively m and n, make k get smaller value in m, n, respectively with the subphrase that i character before in Xi, Yi representative feature phrase X, Y forms, wherein, i=1,2 ..., k; Definition | Xi-Yi| represents the character quantity comprised in the longest common characters string of subphrase Xi, Yi, then Similarity value Sims (X, the Y) computing formula of feature phrase X, Y is as follows:

Sims(X,Y)＝(|X1-Y1| ³+|X2-Y2| ³+…+|Xk-Yk| ³) ^1/3；

Secondly, if Sims (X, Y)≤threshold value TD2, then feature phrase Y is divided into the group at feature phrase X place;

Step 4-3, carries out entropy filtration to feature phrase, comprising: calculate the entropy of feature phrase, filters out entropy lower than the feature phrase of the lower threshold value preset and the entropy feature phrase higher than the upper threshold value preset.

Preferably, described the analysis of public opinion module 5, for analyzing and finding microblogging public sentiment hot, comprises the steps:

First, use multiple microblogging focus to find submodule, obtain microblogging public sentiment hot by parallel MapReduce distributed computing, described microblogging focus finds that submodule comprises:

1) Single-Pass microblogging focus finds submodule 5.1, adopts single pass algorithm;

2) KNN microblogging focus finds submodule 5.2, adopts KNN arest neighbors sorting algorithm;

3) SVM microblogging focus finds submodule 5.3, adopts support vector machines algorithm;

4) K-means microblogging focus finds submodule 5.4, adopts K means Data Cluster Algorithm; And

5) SOM microblogging focus finds submodule 5.5, adopts Self-organizing Maps SOM neural network clustering algorithm;

Secondly, each microblogging focus above-mentioned is found that all microblogging public sentiment hot that submodule obtains respectively gather, carries out following classification and judge:

If the microblogging public sentiment hot obtained derives from above-mentioned more than three focuses and finds submodule, be then senior microblogging public sentiment hot by the category label of this microblogging public sentiment hot;

If the microblogging public sentiment hot obtained derives from above-mentioned two focuses and finds submodule, be then intermediate microblogging public sentiment hot by the category label of this microblogging public sentiment hot;

If the microblogging public sentiment hot obtained only derives from an above-mentioned focus and finds submodule, be then elementary microblogging public sentiment hot by the category label of this microblogging public sentiment hot;

Finally, described senior microblogging public sentiment hot, intermediate microblogging public sentiment hot and elementary microblogging public sentiment hot are sent to described emotional orientation analysis module 6 successively.

Preferably, described emotional orientation analysis module 6, for performing the emotion tendentiousness of text analysis of microblogging, comprises the following steps:

Step 6-1, manually chooses the Chinese of some common emotion tendencies with English adjective, noun and verb with as initialization seed collection; Wherein, described initialization seed is concentrated, and adjectival quantity can be 50, and the quantity of noun and verb can be 100;

Step 6-2, is reduced to nominal original referents by pronouns with the relation of referring to all in the text of microblogging, to prevent failing to judge or misjudging of object in analytic process;

Step 6-3, in units of the sentence of microblogging, utilizes part-of-speech tagging POS and semantic character labeling SRL to analyze the sentence element of each sentence in microblogging, extracts the subjectivity word in each sentence;

Step 6-4, inputs the subjectivity word in each sentence successively, carries out emotion tendency automatic marking according to described subset to the subjectivity word in the sentence of microblogging; For cannot the subjectivity word of automatic marking, after its emotion tendency of artificial judgment, this subjectivity word be replenished described subset.

Preferably, described user interactive module 7 is for realizing integration of user interaction functionality, and the chart that can be formed or report comprise: microblogging public feelings information temperature seniority among brothers and sisters form, microblogging public sentiment early warning information distribution form, the distribution of microblogging public sentiment geography information form, microblogging public sentiment sentiment analysis form, microblogging public sentiment statistic form and microblogging public sentiment trend move towards analysis chart.

Accompanying drawing explanation

Below in conjunction with accompanying drawing, technical scheme of the present invention is further detailed.Described accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.

Fig. 1 shows the functional structure chart of microblogging public sentiment monitoring system according to an embodiment of the invention.

Embodiment

By hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Described description is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and in order to above and other objects of the present invention, feature and advantage can be become apparent.

Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.

A kind of microblogging public sentiment monitoring system is protected in request of the present invention, and this system comprises: public sentiment temperature acquisition module, intelligent reptile crawl module, extraction and pretreatment module, feature phrase filtering module, the analysis of public opinion module, emotional orientation analysis module and user interactive module.Wherein, described the analysis of public opinion module is by distributed cloud computing mode, use multiple microblogging public sentiment monitoring algorithm submodule to obtain microblogging public sentiment hot, and comprehensive descision classification assessment is carried out to the microblogging public sentiment hot obtained, thus realize to microblogging public sentiment hot topic more efficient, monitor accurately.

Fig. 1 is the functional structure chart of microblogging public sentiment monitoring system according to an embodiment of the invention.

As shown in Figure 1, described microblogging public sentiment monitoring system comprises 7 modules, is respectively: public sentiment temperature acquisition module 1, intelligent reptile crawl module 2, extract and pretreatment module 3, feature phrase filtering module 4, the analysis of public opinion module 5, emotional orientation analysis module 6 and user interactive module 7.Wherein:

Particularly, described public sentiment temperature acquisition module 1 calculates the public sentiment temperature weights ρ of described microblogging, if ρ is greater than the threshold value T ρ preset, then using this microblogging as the analysis of public opinion Data Source and analyze foundation, particularly:

Preferably, above-mentioned factor beta 1 ~ β 4 can be set to: β 1=0.4; β 2=0.2; β 3=0.1; β 4=0.1.

Particularly, described intelligent reptile crawls module 2 and performs following steps:

Particularly, described extraction and pretreatment module 3 perform following steps:

Particularly, described feature phrase filtering module 4 performs following steps:

Step 4-2-1:

Step 4-2-2:

Sims(X,Y)＝(|X1-Y1| ³+|X2-Y2| ³+…+|Xk-Yk| ³) ^1/3；

Particularly, described the analysis of public opinion module 5 is for analyzing and finding microblogging public sentiment hot, and the principle of work of described the analysis of public opinion module 5 is as follows:

The present invention adopts distributed cloud computing mode, can excavate, analyze the extensive microblog data that gathers; And microblogging public sentiment hot can be obtained based on multiple public sentiment monitoring algorithm module; to described microblogging public sentiment hot comprehensive descision classification; thus realize the discovery of microblogging public sentiment hot topic and tracking, social network analysis to microblogging; analysis result is visual to be presented, for the unit such as Party and government offices, large enterprise and organize Timeliness coverage microblogging sensitive information, grasp microblogging public sentiment hot, hold microblogging public sentiment trend, the crisis of reply microblogging public sentiment provides robotization, systematization and scientific Informational support.Effectively improve described microblogging public sentiment monitoring system judge accuracy, the subsequent treatment for network microblog public feelings information provides more truly, accurately basis.Particularly:

By microblog data and the analysis result of distributed storage layer storage of collected, described distributed storage layer realizes based on HDFS;

And at Distributed Calculation layer, adopt MapReduce parallel calculating method to realize parallelization and calculate;

Optimized by the storage of HDFS file and transmission optimization, MapReduce parallel computation, achieve the optimization of the microblogging public sentiment monitoring of magnanimity, and achieve stable, efficient large data store optimization, make the microblogging public sentiment data query processing optimization of magnanimity, be with good expansibility, reliability, security.This system, based on cloud platform, has good response speed, supports that massive micro-blog data analysis is served with excavation.

Described the analysis of public opinion module 5 is analyzed and is found that the step of microblogging public sentiment hot is as follows:

First, use multiple microblogging focus to find submodule, obtain microblogging public sentiment hot by parallel distributed computing, described microblogging focus finds that submodule comprises:

1) Single-Pass microblogging focus finds submodule 5.1, and this submodule adopts the single pass algorithm based on MapReduce;

2) KNN microblogging focus finds submodule 5.2, and this submodule adopts the KNN arest neighbors sorting algorithm based on MapReduce;

3) SVM microblogging focus finds submodule 5.3, and this submodule adopts the support vector machines algorithm based on MapReduce;

4) K-means microblogging focus finds submodule 5.4, and this submodule adopts the average cluster of K (K-means) algorithm based on MapReduce; And

5) SOM microblogging focus finds submodule 5.5, and this submodule adopts the Self-organizing Maps SOM neural network clustering algorithm based on MapReduce;

The algorithm that above-mentioned focus discovery submodule 5.1 ~ 5.5 adopts all adopts the general-purpose algorithm of this area in general sense.Therefore improvements of the present invention are not above-mentioned several algorithm itself.Because in existing microblogging public sentiment monitoring system, a kind of microblogging public sentiment hot be often only the use of wherein finds algorithm, and not yet find above-mentioned multiple microblogging public sentiment hot to find that algorithm uses simultaneously, and the system of grade separation is carried out to the result of concentrated algorithm.And, although microblogging public sentiment monitoring system of the present invention employs multiple public sentiment hot and finds algorithm, but because system of the present invention have employed the distributed structure/architecture based on cloud computing, therefore the expense being difficult to bear can't be brought, and due to the combination of various ways, substantially increase the accuracy of microblogging public sentiment monitoring system, achieve good technique effect.

Particularly, described emotional orientation analysis module 6, for performing the emotion tendentiousness of text analysis of microblogging, comprises the following steps:

Step 6-1, manually chooses the Chinese of some common emotion tendencies with English adjective, noun and verb with as initialization seed collection; As preferably, described initialization seed is concentrated, and adjectival quantity can be 50, and the quantity of noun and verb can be 100;

Particularly, described user interactive module 7 can be user formed chart or report comprise: microblogging public feelings information temperature seniority among brothers and sisters form, microblogging public sentiment early warning information distribution form, microblogging public sentiment geography information distribution form, microblogging public sentiment sentiment analysis form, microblogging public sentiment statistic form and microblogging public sentiment trend move towards analysis chart.

System described in this instructions and the embodiment of comprising modules thereof are only schematic, and some or all of module wherein can be selected according to the actual needs to realize the object of embodiment of the present invention scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.

In sum; be only the present invention's preferably embodiment, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. a microblogging public sentiment monitoring system, this system comprises: public sentiment temperature acquisition module (1), intelligent reptile crawl module (2), extract and pretreatment module (3), feature phrase filtering module (4), the analysis of public opinion module (5), emotional orientation analysis module (6) and user interactive module (7), wherein

Public sentiment temperature acquisition module (1) screens for the public sentiment temperature weights according to microblogging the microblog page needing to carry out the analysis of public opinion;

Intelligence reptile crawls module (2) for the microblog data by crawling the microblog page of specifying in the fixed time, and analyzes crawled microblog data according to predefined event, filters out the microblog data irrelevant with the public sentiment that will monitor;

To extract and pretreatment module (3) carries out extracting and pre-service for the information in the microblog data that intelligent reptile crawled module (2) and obtain;

Feature phrase filtering module (4) is for carrying out filtering screening to the feature phrase in the microblog data after extraction and pretreatment module (3) process;

The analysis of public opinion module (5), for based on the microblog data after feature phrase filtering module (4) process, finds microblogging public sentiment hot;

Emotional orientation analysis module (6) is for performing emotional orientation analysis to found microblogging public sentiment hot;

User interactive module (7), for chart or report form display translation microblogging the analysis of public opinion result, realizes integration of user interaction functionality.

2. microblogging public sentiment monitoring system according to claim 1, is characterized in that:

Described public sentiment temperature acquisition module (1) calculates the public sentiment temperature weights ρ of described microblogging, if ρ is greater than the threshold value T ρ preset, then using this microblogging as the analysis of public opinion Data Source and analyze foundation, particularly:

Wherein, β 1 ~ β 4 can be set to: β 1=0.4; β 2=0.2; β 3=0.1; β 4=0.1.

3. microblogging public sentiment monitoring system according to claim 2, is characterized in that:

Described intelligent reptile crawls module (2) and performs following steps:

4. microblogging public sentiment monitoring system according to claim 3, is characterized in that:

Described extraction and pretreatment module (3) perform following steps:

5. microblogging public sentiment monitoring system according to claim 4, is characterized in that:

Described feature phrase filtering module (4) performs following steps:

Step 4-2-1:

Step 4-2-2:

Sims(X,Y)＝(|X1-Y1| ³+|X2-Y2| ³+…+|Xk-Yk| ³) ^1/3；

6. microblogging public sentiment monitoring system according to claim 5, is characterized in that:

Described the analysis of public opinion module (5), for analyzing and finding microblogging public sentiment hot, comprises the steps:

1) Single-Pass microblogging focus finds submodule (5.1), adopts single pass algorithm;

2) KNN microblogging focus finds submodule (5.2), adopts KNN arest neighbors sorting algorithm;

3) SVM microblogging focus finds submodule (5.3), adopts support vector machines algorithm;

4) K-means microblogging focus finds submodule (5.4), adopts K means Data Cluster Algorithm; And

5) SOM microblogging focus finds submodule (5.5), adopts Self-organizing Maps SOM neural network clustering algorithm;

Finally, described senior microblogging public sentiment hot, intermediate microblogging public sentiment hot and elementary microblogging public sentiment hot are sent to described emotional orientation analysis module (6) successively.

7. microblogging public sentiment monitoring system according to claim 6, is characterized in that:

Described emotional orientation analysis module (6), for performing the emotion tendentiousness of text analysis of microblogging, comprises the following steps:

8. microblogging public sentiment monitoring system according to claim 7, is characterized in that:

Described user interactive module (7) is for realizing integration of user interaction functionality, and the chart that can be formed or report comprise: microblogging public feelings information temperature seniority among brothers and sisters form, microblogging public sentiment early warning information distribution form, the distribution of microblogging public sentiment geography information form, microblogging public sentiment sentiment analysis form, microblogging public sentiment statistic form and microblogging public sentiment trend move towards analysis chart.