CN104298765B

CN104298765B - The Dynamic Recognition and method for tracing of a kind of internet public feelings topic

Info

Publication number: CN104298765B
Application number: CN201410574419.8A
Authority: CN
Inventors: 陈海汉
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2014-10-24
Filing date: 2014-10-24
Publication date: 2017-09-15
Anticipated expiration: 2034-10-24
Also published as: CN104298765A

Abstract

The present invention relates to a kind of Dynamic Recognition of internet public feelings topic and method for tracing, comprise the following steps：1st, by public sentiment topic it is abstract be node, represent there is association between public sentiment topic to connect arc between node, the weights of connection arc represent the degree of correlation of public sentiment topic；2nd, the time issued according to public sentiment topic is incorporated into corresponding timeslice, builds the internet public feelings topic Dynamic Evolution Model being made up of topic information layer, web page information layer and netizen's information layer；3rd, pair newly an enter webpage related to public sentiment topic carries out feature extraction, obtains characteristic item, webpage is converted into the m-vector space of characteristic item formation, its topic degree of correlation between former public sentiment topic is calculated；4th, using incremental clustering, handle successively it is described newly enter webpage, recognize new topic, and by the new topic expanding and updating of the public sentiment tracked into model.This method is conducive to overcoming topic drift and variation in topic evolution, improves network public-opinion topic tracking effect.

Description

Dynamic identification and tracking method for Internet public sentiment topics

Technical Field

The invention relates to the technical field of Internet public sentiment, in particular to a dynamic identification and tracking method for Internet public sentiment topics.

Background

The network public opinion is a set of cognition, attitude, emotion and behavior tendency of the public on the internet to a certain event. The topic derivation is a main characteristic of the propagation and evolution of network public sentiment, particularly in the period of public sentiment decline, because of the attention transfer of netizens, the interest, appeal and needs of related elements of the original public sentiment topic are lost, the original public sentiment topic loses vitality and is replaced by a new derived topic, and the secondary influence of the public sentiment on the society is generated. The derived topics and the original topics are mutually interwoven to form a dynamic derived network, the life cycle of the original event is prolonged, the duration and the duration of the regression period of the original event are prolonged, the emergency treatment difficulty of the emergency event is increased, and sometimes the social influence of the derived topics is far greater than that of the original event, so that great loss is brought to the social environment. Therefore, the method has very important significance for tracking the public sentiment topics, is beneficial to understanding the development situation of the event, avoids infinite derivation and spread of the event, and provides important decision support for emergency management of the emergency event.

The research of topic identification and tracking methods is mainly divided into three categories: firstly, based on keyword matching without considering the problem of topic semantic correlation, in order to give consideration to the semantic information of a text, a method of implicit semantic analysis is introduced to model the corpus information, and topics which are concerned more on a network are found through a two-stage clustering strategy; secondly, the time is discretized into time points, and then the limit condition of the time points is utilized to process the dynamic theme tracking problem of continuous time; thirdly, extracting the network hot topic theme by adopting an LDA model, and finding the hot topic by utilizing a time tag. Due to derivation and dynamics of internet public sentiment, the public sentiment presents complex evolution characteristics, and a topic model constructed by a learner in the past mostly focuses on description of structured text data of a conversation topic and cannot describe dynamic changes of the topic. In fact, besides the structured text information, the public sentiment topics also include multiple information such as web page link information and association information between publishers (i.e., users) of the topics, and the time sequence characteristics between the topics are important bases for describing evolution relationships of the topics. Because the conventional topic identification and tracking method lacks effective description on the dynamic process and microstructure of topic evolution, the evolution mechanism of public sentiment topics is not enough to be revealed, and the problems of topic drift and derivation which cannot be ignored in the later stage of public sentiment development exist, the conventional Internet public sentiment topic identification and tracking method cannot meet the practical application requirements.

Disclosure of Invention

The invention aims to provide a dynamic identification and tracking method for internet public sentiment topics, which is favorable for overcoming topic drift and derivation problems in topic evolution and improving the tracking effect of the internet public sentiment topics.

In order to achieve the purpose, the technical scheme of the invention is as follows: a dynamic identification and tracking method for Internet public sentiment topics comprises the following steps:

step 1: the public sentiment topics are abstracted into nodes, the nodes represent the association among the public sentiment topics through connecting arcs, and the weight values of the connecting arcs represent the correlation degree of the public sentiment topics;

step 2: dividing a time axis into time slices with a certain length, classifying the public sentiment topics into corresponding time slices according to the time for publishing the public sentiment topics, and constructing an internet public sentiment topic dynamic evolution model consisting of a topic information layer, a webpage information layer and a netizen information layer;

and step 3: extracting characteristics of a new webpage related to the public sentiment topics to obtain characteristic items, describing the webpage by using the characteristic items with the weight higher than the average value, converting the webpage into a multivariate vector space formed by the characteristic items, and calculating the topic correlation degree between the webpage and the original public sentiment topics;

and 4, step 4: identifying new topics by incremental clustering, processing the newly entered web pages in sequence, and identifying new topics, namely if topic relevancyRGreater than a set thresholdθIf the new topic is found, the new topic is found in the webpage, and the tracked new topic of the public opinion is expanded and updated into the dynamic evolution model of the Internet public opinion topic.

Further, in step 1, the topic information layer is an architecture corresponding to topic compositions of different time series information, and is represented as:

wherein,Tin the event of an emergency, the system will,t _iis a time slice corresponding to the time slice,e _ijto be in time slicet _iA public sentiment topic related to the emergency is generated in the system and is described in a vector form,E _ias time slicest _iA set of internally generated public sentiment topics;

the web page information layers correspond to different time sequence messagesWeb page collection of informationP={P ₁,P ₂, …,P _TSet of link relationships between web pagesPR={PR ₁,PR ₂, …,PR _T}，P _iAs time slicest _iThe collection of web pages generated in-flight,PR _tis fronttA set of web pages within a time slice, andweb pagep _iPointing to web pages by linksp _j；

The netizen information layer is the collection of information and relation of network usersUG={UG ₁,UG ₂, …,UG _T}，UG _iIs as followsiThe relationship set of topic discussion in each time slice comprises the characteristics of netizens.

Further, in step 3, the relevance of the related topics is calculated as follows:

calculating topic correlation degree between the webpages based on the link relation and the content similarity between the webpages, wherein the topic correlation degree is shown in formula (1):

（1）

wherein,R _Cthe relevancy is calculated according to the content of the webpage;R _Lthe correlation degree between the web page topics is calculated on the premise of distinguishing the link properties according to the link relation between the web pages;presentation pairR _LAndR _Cthe operation between them is generalized addition operation, i.e. topic correlation degree between web pagesRSatisfy the requirement of， Is based onR _LAndR _Cthe relative importance of the adjustment factor;

new web pageP _aTopic relevance to original public sentiment topicR _L(P _a) The specific calculation method of (3) is as shown in formula (2):

（2）

wherein,R _C(P _i) For newly entering web pageP _aWith the original web pageP _iThe degree of similarity of the contents of (a),N(a) Is a new web pageP _aTotal number of links issued.

Further, updating the topic model according to the following method:

definition ofFor internet public opinion report corpusSTopic of harmony public sentimentTThe content similarity of (2) represents the adjustment of the content similarity of the new public opinion report, as shown in formula (3):

（3）

wherein,representing a vector space formed after feature extraction is carried out on the public opinion reports at the time t;showing the existing time topic at the time t;Nis an internet public opinion newspaperRoad corpusSThe length of time that it lasts for is,presentation internet public opinion report corpusSThe sum of the similarity of the topic involved in the step (a) and the topic existing in the time slice in which the topic is located;

for theR _LMainly adjusting according to the link pointing relation between the web page reported by the new public opinion and the original web page; if newly-entered public opinion report webpageP _aOriginal words with directionTIs adjusted according to equation (4)R _L；

（4）

R _c(P _a) Is the content similarity calculated by formula (3);

calculating new public opinion reportsR _L、R _cPost-adjustment topic relevanceR。

Compared with the prior art, the invention has the beneficial effects that: according to the dynamic evolution characteristics and topological structure characteristics of the public sentiment topics, the quantity of the public sentiment topics and the dynamic change of the public sentiment topics along with time are fully considered, the problems of topic drift and derivation in topic evolution are solved, the recognition and tracking effects of the network public sentiment topics can be obviously improved, and therefore a decision basis is provided for emergency management of emergencies.

Drawings

FIG. 1 is a flow chart of an implementation of the method of the present invention.

FIG. 2 is a schematic structural diagram of a dynamic evolution model of Internet public sentiment topics in the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the embodiments.

The invention discloses a dynamic identification and tracking method of internet public sentiment topics, which comprises the following steps as shown in figure 1:

step 1: the public sentiment topics are abstracted into nodes, the nodes represent the association among the public sentiment topics through connecting arcs, and the significance of the connecting arcs represents the degree of correlation of the public sentiment topics.

Step 2: dividing a time axis into time slices with a certain length, classifying the public sentiment topics into corresponding time slices according to the time for publishing the public sentiment topics, and constructing an internet public sentiment topic dynamic evolution model consisting of a topic information layer, a webpage information layer and a netizen information layer.

And step 3: and extracting the characteristics of the newly-entered web pages related to the public sentiment topics to obtain characteristic items, namely obtaining a plurality of characteristic items of each of the newly-entered web pages, describing the web pages by using the characteristic items with the weight higher than the average value, converting the web pages into a multivariate vector space formed by the characteristic items, and calculating the topic relevance between the multivariate vector space and the original public sentiment topics.

In a preferred embodiment of the invention, step 1 determines the mapping relationship of the topic evolution model composition elements according to the analysis of the micro composition and the evolution characteristics of the public sentiment topics: the model abstracts topics into nodes, connection arcs between the nodes represent that associations exist among the topics, and weight values of the arcs represent correlation degrees of the topics. The topic evolution model determines that the topological structure of the topic evolution model is a hierarchical structure according to the composition of the multi-element information of the topic, and each hierarchy corresponds to one type of information of the topic. As shown in fig. 2.

Topic information layer: the concept of time slicing is introduced here due to the chronological nature of the evolution of the topic. The time slices are formed by dividing the topic evolution process in time. The dynamic topic model constructed by introducing the time slice concept through the time sequence characteristic can better reflect the situation that the public sentiment topic evolves along with the evolution of time. The topic information layer is a system structure corresponding to topic compositions of different time sequence information, and is expressed as follows:

wherein,Tfor a particular emergency event, the event may be,t _iis a time slice corresponding to the time slice,e _ijto be in time slicet _iA public sentiment topic related to the emergency is generated in the system and is described in a vector form,E _ias time slicest _iThe public sentiment topic collection is generated internally.

The web page information layer is a web page set corresponding to different time sequence informationP={P ₁,P ₂, …,P _TSet of link relationships between web pagesPR={PR ₁,PR ₂, …,PR _T}，P _iAs time slicest _iThe collection of web pages generated in-flight,PR _tis fronttA set of web pages within a time slice, andweb pagep _iPointing to web pages by linksp _j。

The netizen information layer is the collection of information and relation of network usersUG={UG ₁,UG ₂, …,UG _T}，UG _iIs as followsiThe relationship set of topic discussion in each time slice comprises the characteristics of netizens. The reason for adding the netizen information layer into the model is that the interactive relationship among network users has a key effect on the evolution of the view of the users, when most users have negative attitude towards the view of a certain user, the user is most likely to give up the view, and when most users have inverted attitude towards the view of a certain user, the user is more likely to stick to the view of the user.

In a preferred embodiment of the present invention, step 3, for the corpus information of the related reports around the public sentiment topic of the emergency, the calculation of the relevance between the nodes in the topic information layer in the topic model and the public sentiment topic needs to comprehensively consider the link relation and the content similarity between the node web pages. Based on the link relation and the content relevancy among the webpages, the invention provides a method for calculating the topic relevancy among the webpages, which is shown in a formula (1):

（1）

wherein,R _Cthe method is characterized in that the relevancy is calculated according to the content of a webpage, and the similarity between a content space vector of an internet news report corpus and a content space vector of a public sentiment topic is calculated. Because some web page links are only used for social purposes or attract the attention of others, the relevance of the web page topics is not high, if different properties of the links are ignored and the types of the links are not distinguished, the phenomenon that a model cannot effectively deal with topic drift can be caused, and derived topics cannot be effectively detected. Thus, in formula (1)R _LThe correlation degree between the web page topics is calculated on the premise of distinguishing the link properties according to the link relation between the web pages.Presentation pairR _LAndR _Cthe operation between them is generalized addition operation, i.e. topic correlation degree between web pagesRSatisfy the requirement of， Is based onR _LAndR _Crelative importance of (d) is set by the adjustment factor.

（2）

because the original topic may relate to a plurality of web pages, and if the newly-entered public opinion report web page has link relations with a plurality of web pages of the original reports, the similarity between the topic of the newly-entered web page and the original topic needs to be the average value of the sum of the correlation degrees reported by the original web pages,R _C(P _i) For newly entering web pageP _aWeb page reported with originalP _iThe degree of similarity of the contents of (a),N(a) Is a new web pageP _aTotal number of links issued.

In a preferred embodiment of the present invention, step 4 is based on the timeliness of the public opinion report corpus and the change of the dynamic information stream in the network, in order to identify the derived new topic, the topic model calculates the relevance according to the link relation of the new web page and properly adjusts the historical data, and the present invention provides a topic model updating strategy based on the topic relevance adjusting method. Updating the topic model as follows:

definition ofFor internet public opinion report corpusSWords of harmony public sentimentQuestion (I)TThe content similarity of (2) represents the adjustment of the content similarity of the new public opinion report, as shown in formula (3):

（3）

wherein,representing a vector space formed after feature extraction is carried out on the public opinion reports at the time t;showing the existing time topic at the time t;Nis a report corpus of Internet public sentimentsSThe length of time that it lasts for is,presentation internet public opinion report corpusSThe sum of the similarity of the topic involved in (2) and the topic existing in the time slice in which the topic is located. Since the derivation and drift phenomena of topics often occur between topics with close time distance, and the probability of the derivation and secondary relationship of topics with longer time interval is smaller, only the topics in the same time slice need to be considered when calculating the topic similarity of the new public opinion report.

（4）

R _c(P _a) Is the content similarity calculated by the formula (3).

Calculating new public opinion reportsR _L、R _cThen adjusting topic relevance according to formula (1)R。

In order to determine the generation of new topics, a threshold value needs to be presetθWhen is coming into contact withR≤θAnd if so, considering that a new topic appears in the report, and otherwise, considering that the report is a repeated report of the existing topic.

In a preferred embodiment of the present invention, the topic tracking method in step 4 captures the dynamic changes of public opinion reports from two aspects: on one hand, topic information at the current moment is stored in a topic information layer of the model, and mainly clustering results obtained through topic mining are stored; on the other hand, the relevance of the newly-entered reports is calculated according to a topic model updating strategy, and new information is dynamically expanded to the topic model by using the tracked topic mining results of the public opinion reports. The incremental topic clustering process is equivalent to a clustering algorithm for the whole report set, the algorithm performs incremental clustering on the report set according to the sequence of time slices, and sequentially processes the report webpages in the public opinion report information stream, and the specific algorithm is realized as shown below.

The algorithm is as follows:

inputting:(public opinion report set) output:(topic set)

1Will be provided withR ₁As a seed report, extracting the characteristics of the seed report to obtain a seed topicInitializing a topic model;

2//R _ia web page for subsequent public opinion reporting;

3// judgmentR _iWhether it is a story related to the original topic content;

4ifR _iFor related reports, the method will be describedR _iAdding a topic model and updating the topic model;

5// differentiationR _iThe issued webpage link type removes friend links and advertisement links;

6

7// LinkL _jPointing to web pagesP _jAnd is andP _jnot in the existing topic set;

8// will web pageP _jAdding a topic model;

9v/updating the Web page information layer of the topic model, addingR _iPoint of directionP _jThe link information of (2);

10// analyzing reports based on link relationshipsR _iThe similarity of (2);

11

12v/adjusted and reported according to equation (4)R _iAll web pages with link relationP _jThe degree of correlation of (c);

13

14

15

16// reportR _iThe correlation degree of (2) exceeds a preset threshold value, and the public opinion report is consideredR _iNew topics appear, and the topic collection is updated;

17// returning the tracked topic set;

18

the algorithm shows that the topic model is continuously adjusted along with the updating of the new public opinion report, when an emergency occurs, the initial public opinion report is used as a seed report, topics contained in the seed report are seed topics, and the topic model is gradually constructed and updated on the basis.

Step (1) in the algorithm is a model initialization process for determining seed reports and seed topics, and steps (2) to (4) are processes for judging whether newly-entered reports are related to the seed reports, adding the reports to a topic model if the newly-entered reports are related, and updating report sets. And (5) to (13) calculating the relevance of the report and the relevance of the webpage indicated by the link based on the link relation, and updating the topic model according to the calculation result. The steps (14) to (15) are the process of judging whether new topics are generated in the reports, and finally returning to the topic set in a certain time slice.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A dynamic identification and tracking method for Internet public sentiment topics is characterized by comprising the following steps:

and 4, step 4: identifying new topics by adopting incremental clustering, sequentially processing the newly-entered webpages, and identifying the new topics, namely if the topic correlation degree R is greater than a set threshold value theta, considering that the existing topics are repeatedly reported, discarding the topics, otherwise, considering that the new topics appear in the webpages, and expanding and updating the tracked new public sentiments into an Internet public sentiment topic dynamic evolution model;

in step 2, the topic information layer is an architecture of topic components corresponding to different time series information, and is represented as:

<mrow> <mi>T</mi> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mn>1</mn> </msub> <mo>|</mo> <msub> <mi>e</mi> <mn>11</mn> </msub> <mo>,</mo> <mn>...</mn> <msub> <mi>e</mi> <mrow> <mn>1</mn> <mi>j</mi> </mrow> </msub> <mo>,</mo> <mn>...</mn> <msub> <mi>e</mi> <mrow> <mn>1</mn> <mi>h</mi> </mrow> </msub> <mo>)</mo> <mo>,</mo> <msub> <mi>e</mi> <mrow> <mn>1</mn> <mi>j</mi> </mrow> </msub> <mo>&Element;</mo> <msub> <mi>E</mi> <mn>1</mn> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>e</mi> <mrow> <mi>i</mi> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mn>...</mn> <msub> <mi>e</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>,</mo> <mn>...</mn> <msub> <mi>e</mi> <mrow> <mi>i</mi> <mi>n</mi> </mrow> </msub> <mo>)</mo> <mo>,</mo> <msub> <mi>e</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>&Element;</mo> <msub> <mi>E</mi> <mi>i</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>m</mi> </msub> <mo>|</mo> <msub> <mi>e</mi> <mrow> <mi>m</mi> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mn>...</mn> <msub> <mi>e</mi> <mrow> <mi>m</mi> <mi>j</mi> </mrow> </msub> <mo>,</mo> <mn>...</mn> <msub> <mi>e</mi> <mrow> <mi>m</mi> <mi>n</mi> </mrow> </msub> <mo>)</mo> <mo>,</mo> <msub> <mi>e</mi> <mrow> <mi>m</mi> <mi>j</mi> </mrow> </msub> <mo>&Element;</mo> <msub> <mi>E</mi> <mi>m</mi> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>

wherein T is an emergency event, T_iFor corresponding time slice, e_ijAt a time slice t_iAn internally generated public sentiment topic related to the emergency, described in the form of a vector, E_iIs a time slice t_iA set of internally generated public sentiment topics;

the webpage information layer is a webpage set P ═ { P ═ P corresponding to different time sequence information₁,P₂,…,P_TAnd link relation set PR between web pages { PR ═ PR₁,PR₂,…,PR_T}，P_iIs a time slice t_iInternally generated collection of web pages, PR_tIs a set of web pages in the first t time slices, anWeb page p_iPointing to web page p by link_j；

The netizen information layer is the set UG ═ UG of the information and the relation of the network users₁,UG₂,…,UG_T}，UG_iA relation set of topic discussion persons in the ith time slice comprises characteristics of netizens;

in step 3, the relevance of the topic is calculated as follows:

<mrow> <mi>R</mi> <mo>=</mo> <msub> <mi>R</mi> <mi>L</mi> </msub> <mo>&CirclePlus;</mo> <msub> <mi>R</mi> <mi>C</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

wherein R is_CThe relevancy is calculated according to the content of the webpage; r_LThe correlation degree between the web page topics is calculated on the premise of distinguishing the link properties according to the link relation between the web pages;represents a pair of R_LAnd R_CThe operation between the pages is generalized addition operation, namely that the topic correlation R between the web pages satisfies max (R)_L，R_C)≤R≤min(1，R_L+R_C) Is according to R_LAnd R_CThe relative importance of the adjustment factor;

new web page P_aTopic relevance R to original public sentiment topic_L(P_a) The specific calculation method of (3) is as shown in formula (2):

R_L(P_a)＝(R_C(P₁)+R_C(P₂)+...+R_C(P_n))/N(a) (2)

wherein R is_C(P_i) For newly entering webpage P_aWith the original webpage P_iThe content similarity of (A) is the newly entered web page P_aTotal number of links issued;

updating the topic model as follows:

define Rnew_C(S, K) is the content similarity of the Internet public opinion report corpus S and the public opinion topic K, and represents the adjustment of the content similarity of the new public opinion report, as shown in formula (3):

<mrow> <mi>R</mi> <mi>n</mi> <mi>e</mi> <mi>w</mi> <mi>c</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>,</mo> <mi>K</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mo>&Sigma;</mo> <mi>r</mi> <mrow> <mo>(</mo> <msubsup> <mi>e</mi> <mi>t</mi> <mi>S</mi> </msubsup> <mo>,</mo> <msubsup> <mi>e</mi> <mi>t</mi> <mi>K</mi> </msubsup> <mo>)</mo> </mrow> </mrow> <mi>N</mi> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

wherein,representing a vector space formed after feature extraction is carried out on the public opinion reports at the time t;showing the existing time topic at the time t; n is the duration of the Internet public opinion report corpus S,representing the sum of the similarity of the topics related in the Internet public opinion report corpus S and the existing topics in the time slice;

for R_LMainly adjusting according to the link pointing relation between the web page reported by the new public opinion and the original web page; if newly-entered public opinion reported webpage P_aWith links to the original topic K, adjusting R according to equation (4)_L；

R_c(P_a) Is the content similarity calculated by formula (3);

calculating R of new public opinion report_L、R_cAnd adjusting the topic relevance R.