CN109903176B - Real-time public opinion analysis method based on streaming cloud platform - Google Patents

Real-time public opinion analysis method based on streaming cloud platform Download PDF

Info

Publication number
CN109903176B
CN109903176B CN201910109044.0A CN201910109044A CN109903176B CN 109903176 B CN109903176 B CN 109903176B CN 201910109044 A CN201910109044 A CN 201910109044A CN 109903176 B CN109903176 B CN 109903176B
Authority
CN
China
Prior art keywords
key
public
bolt
key phrase
public opinion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910109044.0A
Other languages
Chinese (zh)
Other versions
CN109903176A (en
Inventor
王永生
赵禹萌
云静
邢红梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN201910109044.0A priority Critical patent/CN109903176B/en
Publication of CN109903176A publication Critical patent/CN109903176A/en
Application granted granted Critical
Publication of CN109903176B publication Critical patent/CN109903176B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a real-time public opinion analysis method based on a streaming cloud platform, which comprises the steps of accurately determining key word groups of key word clusters based on singular non-retrospective clustering, and more efficiently reducing the dimensions of the key word groups; and a semantic similarity measurement mechanism based on the keywords, which is used for screening redundant keywords in the keyword group; and updating a representation model of the public sentiment in real time based on an Apache Storm platform, and detecting the public sentiment in real time. The method extracts the key phrase of the public opinion based on the singular non-backtracking algorithm, performs dimension reduction processing on the public opinion representation model based on the design of the weight of the key phrase, and enables the selection of the public opinion key phrase to be more accurate based on the topological structure of the Apache Storm cloud platform, compresses the dimension of the key phrase, and improves the real-time public opinion analysis efficiency. Therefore, the public sentiment information can be accurately represented, so that an efficient streaming public sentiment analysis platform is realized, and the public sentiment information can be found in a mass text stream in real time.

Description

Real-time public opinion analysis method based on streaming cloud platform
Technical Field
The invention belongs to the technical field of big data analysis and application, relates to public opinion analysis, and particularly relates to a real-time public opinion analysis method based on a streaming cloud platform.
Background
With the development of social networks and the popularization of 4G networks, Facebook, microblog and Twitter have become widely popularized all over the world, and become a main means for people to acquire information. People are more inclined to obtain the latest information in the real-time public sentiment of social media, and the real-time public sentiment of the social media needs to be effectively and quickly represented so as to filter the public sentiment in time.
Traditional public sentiment representation is quite sparse, and therefore accuracy and efficiency of public sentiment analysis are greatly restricted. Social media users generally prefer to post brief public opinions. Similar to other types of texts, the brief public sentiments present all types of social events, the number of keywords used for describing the public sentiments is the same as that of the keywords used for describing other types of texts, and the public sentiment representation model has the same dimension as that of other text representation models. A keyword representative model constructed by the conventional public opinion analysis method introduces a large number of redundant keywords, and does not achieve a good public opinion analysis effect. More seriously, the redundant key phrase will involve useless information, thereby having negative influence on public opinion analysis.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a real-time public opinion analysis method based on a streaming cloud platform Apache Storm, so as to achieve a better public opinion analysis effect.
In order to achieve the purpose, the invention adopts the technical scheme that:
a real-time public opinion analysis method based on a streaming cloud platform comprises the following steps:
step1, for given public sentiment, carrying out keyword clustering to obtain a plurality of keyword groups with the same semantics;
step2, screening redundant key word groups, and reducing dimension of a public opinion representative model, wherein the public opinion representative model is a key word group which is clustered according to public opinion characteristics and is used for representing public opinions;
and 3, updating the key phrase by using the Apache Storm topological structure so as to ensure that the real-time public sentiment is expressed by the key phrase. That is, the above method is embedded into the topology of the Apache Storm, the weight of the word is calculated according to the value of KPF/NKP, if the weight of the word is larger, the word is filled into the representation model, and the word with the smallest weight in the representation model is correspondingly screened out for updating.
In the step1, keyword clustering is performed by using a singular non-backtracking clustering algorithm, and a method for extracting a keyword group according to the same semantic relationship is as follows:
Figure BDA0001967365470000021
wherein syn (t)k,lij) For key word groups with the same semantics, M is the number of key words, tkA word in public sentiment, wherein k is a serial number, M keywords and tkSharing a common semantic tree,/ijIs the key phrase liThe number of the key words in (1),CN(tk,lij) For t in the same semantic treekAnd lijNumber of common father nodes, DN (t)k,lij) For t in the same semantic treekThe distance between the node and the nearest father node, β and ξ, is a constant dynamic influence factor, and is selected in real time according to the semantic relation.
In step2, the screening method of the redundant key phrases is as follows: counting the occurrence frequency of each keyword in public sentiment, firstly screening redundant keyword groups by calculating the value of KPF/NKP, secondly counting KPF and NKP, and then calculating the weight of the keyword groups by the calculated KPF and NKP, wherein the method for calculating the weight of the keyword groups is as follows:
Figure BDA0001967365470000022
wherein WijIs the ith real-time public opinion riThe weight of a certain key phrase, | R | represents a public sentiment set, KPF represents the sum of the frequencies of each key phrase in the key phrases, NKP represents the number of the key phrases in the public sentiment, KPF (l)i,rj) Is shown in public sentiment riMiddle key phrase liKPF value of (1), NKP (l)i) Represents that it contains liRespectively calculating the weight of each key phrase according to the number of the key phrases, then arranging the key phrases according to a descending order, and taking the first 5 percent of elements as valuable key phrases.
In the step3, the Apache Storm topology structure comprises six parts of Log Hub, IRich Sport, Cluster Bolt, Pattern Bolt, Increment Bolt and HBase Shell, and the work flow is as follows:
step1, receiving public sentiment by using a Log Hub, and sending the received public sentiment to an IRich Spout;
step2, sending the public sentiment received by the IRich Spout to the Cluster Bolt;
step3, sending the data received by the Cluster Bolt to the Pattern Bolt and caching and updating the data received from the IRich Spout;
step4, sending new data to a Pattern Bolt to obtain a key phrase through clustering by a keyword clustering method, then gradually calculating the value of KPF/NKP of the obtained key phrase, sequencing the calculated values, selecting the key phrase as an element of a representative model, and obtaining a new model while receiving | R | data;
step5, performing incremental update on the new model obtained in the previous Step, sending the representative model to an Increment Bolt, which obtains a public opinion representative model from a Pattern Bolt, and sending the representative model to the Increment Bolt. After the Cluster Bolt receives the public sentiment, the Increment Bolt uses the weight vector of the key phrase to describe the public sentiment;
step6, calculating the weight by calculating the KPF/NKP value of the key phrase in the public sentiment, and finally sending the representation result to the HBase Shell by the Increment Bolt.
Compared with the existing public opinion analysis method, the method firstly performs a singular non-retrospective clustering algorithm on real-time public opinions, obtains characteristic phrases by calculating the same expression relation of the characteristic words, and screens key phrases according to the KPF/NKP value of the calculated key words; then modeling is carried out, and finally a public opinion topology is realized on the Apache Storm to ensure that real-time public opinions are represented by key phrases, so that the accuracy and the effectiveness of public opinion analysis are effectively improved.
Drawings
FIG. 1 is a system flow diagram according to an embodiment of the present invention.
FIG. 2 is a topology of an Apache Storm according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 presents a schematic flow chart diagram in accordance with an embodiment of the present invention. In summary, the method comprises:
step 1) carrying out keyword clustering on given public sentiments, and clustering the keywords based on the same semantic relation;
step 2) for the clustered key phrases, the capacity of the key words to represent real-time public sentiment is measured by calculating the weight of each key word in the key phrases;
step 3) regarding the calculated weight of each keyword, the keyword with low weight is considered not to well represent real-time public sentiment, otherwise, the public sentiment can be well represented, and then the keyword with lower weight is screened out;
and 4) updating the key phrases with reduced dimensions according to the designed topological structure of the Apache Storm so as to achieve the purpose of real-time analysis.
More specifically, in step 1), first, a number M of keywords and t are designedkTree DN (t) sharing one and the same semantic relationk,lij),lijIs the key phrase liSecond, CN (t)k,lij) Set as t in the same semantic treekAnd lijFinally, two constant dynamic influence factors β and ξ are designed, the two constants are selected in real time according to the expression relationship, the coefficient β is usually taken as 1, the constant ξ is taken as 2, and syn (t)k,lij) Are key phrases with the same expression relationship. The following formula is a method for extracting key phrases according to the same expression relationship.
Figure BDA0001967365470000041
More specifically, the frequency of each keyword in the keyword group is counted in step 2) and step 3), the frequency of each keyword is added and the sum is named as KPF (l)i,rj) The key phrase l contained in public sentimentiThe number of (c) is named NKP (l)i);
The redundant key phrases are first screened by calculating the values of KPF/NKP.
Secondly, counting the KPF and NKP, then expressing real-time public sentiment as | R |, and expressing RjSet to each specific real-time public opinion, WijAs rjOne of the other gatesWeight of the key phrase.
And finally, if the weight of the key phrase is small, the key phrase is considered not to well represent the public opinion, and the public opinion is screened out. The following formula is a method for calculating the weight of the keyword group.
Figure BDA0001967365470000051
FIG. 2 depicts a schematic of the topology of an Apache Storm according to one embodiment of the present invention.
In summary, the method comprises the following steps:
step 1) firstly, in order to solve the problem of parallel reading of the Spout components, a Log Hub system is used for receiving public sentiment, and the received public sentiment is sent to an IRich Spout component.
Step 2) then sending the public sentiment received by the IRich Spout to the Cluster Bolt
Step 3) the Cluster Bolt component sends the received data to the Pattern Bolt and buffers and updates the data received from the IRich Spout component before | R |.
And 4) sending the data to a Pattern Bolt to obtain a representative model, then gradually calculating the KPF/NKP value of the obtained key phrases, sequencing the calculated values, selecting the key phrases as elements of the representative model, and obtaining a new model while receiving the | R | data.
And 5) implementing incremental updating on the new model obtained in the last step, and sending the model to an Increment Bolt component, wherein the Increment Bolt component can obtain an Increment public opinion representation model from Pattern Bolt, and send an aggregation model to the Increment Bolt component, and after the Cluster Bolt receives public opinion, the Increment Bolt uses a key phrase weight vector to describe the public opinion.
And 6) calculating the weight by calculating the KPF/NKP value of the key phrase in the public sentiment, and finally sending the expression result to the HBase Shell by the Increment Bolt so as to support the public sentiment analysis.
The dimension reduction method and the overall implementation process of the method are described below by a specific example.
The embodiment is established on a cloud computing platform, and the platform consists of 15 servers, including Vmware Esxi5,20T disk array and 1000M network switch. All embodiments are deployed on virtual machines. On the virtual machines, a LogHub cluster comprising three nodes is configured, and an Apache Storm cluster is deployed, and the embodiment is performed in a designed topology of the Apache Storm. The public sentiment set comprises 5 public sentiments r1,r2,r3,r4,r5}. Public opinion r1After word segmentation, the words include explosion, combustion, fire and the like, r2Including heavy smoke and fire, r3Including explosions and heavy smoke, r4Including burns, r5Also included are burns. To reduce the dimension of the representation model, the words appearing in different public sentiments are first clustered. The clustering process is shown in the following table:
TABLE 1 keyword clustering step
Figure BDA0001967365470000061
On the basis of the results, the keyword groups are used as a candidate keyword group set, the frequency of each keyword group is calculated by utilizing a keyword weight calculation method, and the keywords capable of representing five public sentiments in the example are screened out based on the frequency. The frequency statistics are shown in table 2 below, and table 3 is the corresponding keyword group weight table.
TABLE 2 keyword group frequency statistics
Figure BDA0001967365470000071
TABLE 3 Key phrase weights
Figure BDA0001967365470000072
Assume that the key phrases "fire, smoke" and "burn" in the example are retained in the final set of key phrases, and the key phrase "explosion" is to be deleted from the feature set.
By the example, the public sentiment representation relation is comprehensively considered in the public sentiment representation model dimension reduction process based on the same semantic relation public sentiment, and the public sentiment model dimension can be effectively reduced by eliminating the characteristic key phrases with low frequency and reducing the number of the key phrases from the initial 5 items to 2 items.
Although the present invention has been described by way of preferred embodiments, the present invention is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present invention.

Claims (2)

1. A real-time public opinion analysis method based on a streaming cloud platform is characterized by comprising the following steps:
step1, for given public sentiments, based on the same semantic relation, keyword clustering is carried out by using a singular non-backtracking clustering algorithm to obtain a plurality of keyword groups with the same semantics as follows:
Figure FDA0002371970320000011
wherein syn (t)k,lij) For key word groups with the same semantics, M is the number of key words, tkA word in public sentiment, wherein k is a serial number, M keywords and tkSharing a common semantic tree,/ijIs the key phrase liCN (t) as a keyword in (1)k,lij) For t in the same semantic treekAnd lijNumber of common father nodes, DN (t)k,lij) For t in the same semantic treekDistance between the node and its nearest parent, β and ξ, is a constant dynamic influence factor, β is taken as 1, ξ is taken as 2;
step2, for the clustered key phrase, the capability of the key phrase representing real-time public sentiment is measured by calculating the weight of each key phrase in the key phrase;
step3, screening redundant key word groups, screening key words with lower weight, and reducing dimension of a public opinion representative model, wherein the public opinion representative model is a key word group which is clustered according to public opinion characteristics and is used for representing public opinion;
step4, updating key phrases by using an Apache Storm topological structure to ensure that real-time public sentiment is represented by the key phrases, wherein the Apache Storm topological structure comprises six parts, namely a Log Hub, an IRich spread, a Cluster Bolt, a Pattern Bolt, an Increment Bolt and an HBase Shell, and the working flow is as follows:
step1, receiving public sentiment by using a Log Hub, and sending the received public sentiment to an IRich Spout;
step2, sending the public sentiment received by the IRich Spout to the Cluster Bolt;
step3, sending the data received by the Cluster Bolt to the Pattern Bolt and caching and updating the data received from the IRichSpout;
step4, sending new data to a Pattern Bolt to obtain a key phrase through clustering by a keyword clustering method, then gradually calculating the value of KPF/NKP of the obtained key phrase, sequencing the calculated values, selecting the key phrase as an element of a representative model, and obtaining a new model while receiving | R | data;
step5, performing incremental update on the new model obtained in the previous Step, sending the representative model to an Increment Bolt, the Increment Bolt obtaining a public opinion representative model from a Pattern Bolt and sending the representative model to the Increment Bolt, and after the Cluster Bolt receives the public opinion, the Increment Bolt using the key phrase weight vector to describe the public opinion;
step6, calculating the weight by calculating the KPF/NKP value of the key phrase in the public sentiment, and finally sending the representation result to the HBase Shell by the Increment Bolt.
2. The real-time public opinion analysis method based on streaming cloud platform as claimed in claim 1, wherein in the step3, the method for screening the redundant key phrase is as follows: counting the occurrence frequency of each keyword in public sentiment, firstly screening redundant keyword groups by calculating the value of KPF/NKP, secondly counting KPF and NKP, and then calculating the weight of the keyword groups by the calculated KPF and NKP, wherein the method for calculating the weight of the keyword groups is as follows:
Figure FDA0002371970320000021
wherein WijIs the ith real-time public opinion riThe weight of a certain key phrase, | R | represents a public sentiment set, KPF represents the sum of the frequencies of each key phrase in the key phrases, NKP represents the number of the key phrases in the public sentiment, KPF (l)i,rj) Is shown in public sentiment riMiddle key phrase liKPF value of (1), NKP (l)i) Represents that it contains liRespectively calculating the weight of each key phrase according to the number of the key phrases, then arranging the key phrases according to a descending order, and taking the first 5 percent of elements as valuable key phrases.
CN201910109044.0A 2019-02-03 2019-02-03 Real-time public opinion analysis method based on streaming cloud platform Active CN109903176B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910109044.0A CN109903176B (en) 2019-02-03 2019-02-03 Real-time public opinion analysis method based on streaming cloud platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910109044.0A CN109903176B (en) 2019-02-03 2019-02-03 Real-time public opinion analysis method based on streaming cloud platform

Publications (2)

Publication Number Publication Date
CN109903176A CN109903176A (en) 2019-06-18
CN109903176B true CN109903176B (en) 2020-04-10

Family

ID=66944703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910109044.0A Active CN109903176B (en) 2019-02-03 2019-02-03 Real-time public opinion analysis method based on streaming cloud platform

Country Status (1)

Country Link
CN (1) CN109903176B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094620B (en) * 2021-04-23 2023-10-10 中南大学 Network public opinion cloud platform data analysis model exchange method, system and platform
CN113094621B (en) * 2021-04-23 2023-08-08 中南大学 Internet public opinion cloud platform

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567393A (en) * 2010-12-21 2012-07-11 北大方正集团有限公司 Method, device and system for processing public sentiment topics
CN104598629B (en) * 2015-02-05 2017-11-03 北京航空航天大学 Social networks incident detection method based on streaming graph model
CN106649537A (en) * 2016-11-01 2017-05-10 四川用联信息技术有限公司 Search engine keyword optimization technology based on improved swarm intelligence algorithm
CN107832344A (en) * 2017-10-16 2018-03-23 广州大学 A kind of food security Internet public opinion analysis method based on storm stream calculation frameworks
CN108875786B (en) * 2018-05-23 2021-04-09 北京化工大学 Optimization method of consistency problem of food data parallel computing based on Storm

Also Published As

Publication number Publication date
CN109903176A (en) 2019-06-18

Similar Documents

Publication Publication Date Title
CN109101620B (en) Similarity calculation method, clustering method, device, storage medium and electronic equipment
CN109739849B (en) Data-driven network sensitive information mining and early warning platform
CN103778548B (en) Merchandise news and key word matching method, merchandise news put-on method and device
CN101488150B (en) Real-time multi-view network focus event analysis apparatus and analysis method
CN106599065B (en) Food safety network public opinion early warning system based on Storm distributed framework
EP3918758A1 (en) Real-time event detection on social data streams
CN107291886A (en) A kind of microblog topic detecting method and system based on incremental clustering algorithm
JP2012515978A (en) Sampling analysis of search queries
CN109903176B (en) Real-time public opinion analysis method based on streaming cloud platform
CN104281608A (en) Emergency analyzing method based on microblogs
CN104834739A (en) Internet information storage system
CN110990566A (en) Increment clustering algorithm based on community detection
CN107133321B (en) Method and device for analyzing search characteristics of page
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN109271614A (en) A kind of data duplicate checking method
CN115329078B (en) Text data processing method, device, equipment and storage medium
CN114925286B (en) Public opinion data processing method and device
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
CN112241820A (en) Risk identification method and device for key nodes in fund flow and computing equipment
CN104615605B (en) The method and apparatus of classification for prediction data object
CN114281989A (en) Data deduplication method and device based on text similarity, storage medium and server
Xiao et al. Data analysis algorithms for mining online communities from microblogs
CN114625868A (en) Electric power data text classification algorithm based on selective ensemble learning
Sathish et al. Graph embedding based hybrid social recommendation system
CN111970327A (en) News spreading method and system based on big data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant