CN103678670B

CN103678670B - Micro-blog hot word and hot topic mining system and method

Info

Publication number: CN103678670B
Application number: CN201310725400.4A
Authority: CN
Inventors: 陈羽中; 郭文忠; 陈国龙; 方明月
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2013-12-25
Filing date: 2013-12-25
Publication date: 2017-01-11
Anticipated expiration: 2033-12-25
Also published as: CN103678670A

Abstract

The invention relates to the technical field of social networks, in particular to a micro-blog hot word and hot topic mining system and method. The method includes the following steps that content data released in a micro-blog are preprocessed to acquire a candidate hot word sequence; according to the frequency of occurrence and suddenness of candidate hot words in a candidate hot word set at the current moment and in a given historical time window, the vitality of each candidate hot word is worked out, and a hot word set is formed by screening the candidate hot words; according to the hot word set formed by screening the candidate hot words, hot word correlation is worked out, and a hot word co-occurrence network is constructed; according to the hot word co-occurrence network, the hot word set is partitioned through the hot word clustering algorithm based on multi-label propagation to acquire a hot topic set. By means of the micro-blog hot word and hot topic mining system and method, efficient micro-blog hot word and hot topic mining is achieved, and mining precision and processing efficiency are improved.

Description

Microblog hotword and hot topic mining system and method

Technical Field

The invention relates to the technical field of social networks, in particular to a microblog hotword and hot topic mining system and method.

Background

With the rise of microblogs, the participation of people is continuously improved, and users can release own seen news at any time and any place through computers and mobile phones and realize instant sharing. Microblogs become a fashion of the internet at present, and are also important places for generating and discussing hot topics, wherein the hot topics are topics which frequently appear on the network within a period of time and are widely concerned and discussed by people. The exponential growth of microblog information makes how to effectively master massive information and extract hot topics become a problem to be solved urgently.

For hot topic detection, a traditional method is to cluster texts, but the method is not favorable for a user to visually identify hot topics, and microblogs have short text characteristics, data are sparse and are not distributed uniformly, so that the effect of the method for finding the hot topics is not ideal. Therefore, the mainstream method is to realize hot topic discovery by hot word extraction and clustering.

The classical methods used to weigh word importance and extract hotwords are TFIDF and TFPDF, among others. The main idea of TFIDF is that the frequency of occurrence of words does not sufficiently represent text features, such as words "yes" and "mare", which frequently occur but have little ability to represent text. However, the method is not suitable for calculating the weight of words in microblogs, microblogs have short text characteristics, repeated words rarely appear on one microblog, and after hot topics appear on the microblogs, users can widely forward and discuss the words, a large number of microblogs contain the same keywords, and if the TFIDF method is used for extracting the keywords, important words can be lost to a certain extent. Therefore, scholars have proposed a method for TFPDF that gives higher weight to words that appear in most documents to extract hot words. The method is beneficial to extracting key words related to hot topics, but also can extract words which frequently appear but do not express the topic capacity. The hot words refer to words with sharp increase of word frequency in a period of time, and the two methods do not consider the distribution condition of the words along with the time and are not beneficial to the extraction of the hot words.

For hot word clustering, the existing methods include: 1) adopting a Bisecting K-mean clustering algorithm insensitive to the initial cluster; 2) clustering is carried out by constructing a word similarity matrix and utilizing an Affinity Propagation algorithm under the condition of not needing to appoint the number of clusters, and the time complexity is close to that of the clustering; 3) density clustering based algorithms such as DBSCAN; 4) hierarchical clustering algorithms, and the like.

For the problem of finding hot spots of massive microblog data, the existing hot word clustering method mainly has the following problems: firstly, the words related to different topics in the clustering result are not allowed to have an intersection, which is not consistent with the actual situation, and thus some topics are not found or the recognition degree of the topics is very low. For example, in two topics, namely "expense problem in colleges and universities" and "leaderboard in colleges and universities", the term "colleges and universities" can only belong to one topic at most, and any one of the two topics lacks the keyword "colleges and universities", so that the original topic is difficult to identify. In addition, the time complexity of the traditional clustering algorithm is high, and the requirement of clustering mass microblog data is difficult to adapt to.

In summary, a relatively perfect technology and method have appeared for analyzing the influence of user individuals in a social network, but the methods for analyzing the influence at the level of communities in the social network are relatively few, and lack of comprehensive analysis and evaluation on the influence of each community in the social network, and the existing methods are difficult to meet the requirements in terms of analysis effect and efficiency in the face of a large-scale social network scenario.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a microblog hot word and hot topic mining system and method, which are beneficial to improving the accuracy and processing efficiency of microblog hot spot discovery.

In order to achieve the purpose, the technical scheme of the invention is as follows: a microblog hotword and hot topic mining system, the system comprising: the system comprises a preprocessing module, a hot word screening module, a hot word co-occurrence network construction module and a hot word clustering module;

the preprocessing module is used for preprocessing content data published in the social network to obtain candidate hot words and construct a candidate hot word set;

the hot word screening module is used for calculating the vitality of each candidate hot word according to the occurrence frequency and the burstiness of each candidate hot word in the candidate hot word set in the current time and a given historical time window, screening out the hot words and constructing a hot word set according to the vitality of each candidate hot word;

the hot word co-occurrence network construction module is used for calculating the correlation of each hot word in the hot word set and constructing a hot word co-occurrence network;

and the hot word clustering module is used for dividing a hot word set by using a hot word clustering algorithm based on multi-label propagation according to the hot word co-occurrence network to obtain a hot word set.

The invention also provides a method for mining microblog hotwords and hot topics, which comprises the following steps:

step A: preprocessing content data published in a social network to obtain candidate hot words, and constructing a candidate hot word set according to the candidate hot words;

and B: calculating the vitality of each candidate hot word according to the occurrence frequency and the burstiness of each candidate hot word in the candidate hot word set at the current moment and in a given historical time window, screening out the hot words, and constructing a hot word set according to the vitality of each candidate hot word;

and C: calculating the correlation of each hot word in the hot word set, and constructing a hot word co-occurrence network according to the correlation;

step D: and according to the hot word co-occurrence network, dividing a hot word set by using a hot word clustering algorithm based on multi-label propagation to obtain a hot word set.

Further, in the step B, the process of screening hot words and constructing a hot word set specifically includes the following steps:

step B1: calculating over a period of timetThe nutritional value of each candidate hotword; candidate hotwordwThe nutritional value ofNutr _{w t,}In a time periodtInner and micro blog collectiontw ^tCandidate hot words of each microblog pairwThe calculation formula is as follows:

wherein,Contr _{w j,}in a time periodtIn, the firstjCandidate hot word of strip microblog pairwThe contribution of the nutritional value of (c),j∈tw ^tthe calculation formula is as follows:

wherein,is shown asjCandidate hotword appearing in strip microblogwThe number of times of the operation of the motor,is shown asjMaximum word frequency in the microblog;

step B2: utilizing candidate hotwordswTo describe candidate hotwordswThe intensity of the change of the word frequency between the current time period and the historical time period; candidate hotwordwBurst value ofB _{w t,}The calculation method comprises the following steps: time taking periodtPrevious k historical time windows, historical time window size and time periodtThe same, then respectively counting the discrete event models in the time period based on the binomial distributiontAnd time periodtContaining candidate hotwords in previous k historical time windowswThe number of microblogs is adoptedStatistical formula, calculating candidate hotwordswIn a period of timetThe burst value in the formula is as follows:

wherein,Ais shown in the time periodtIncluding candidate hotwordswThe number of microblogs;Bindicating that the candidate hotword is contained in k historical time windowswAverage number of microblogs;Cis shown in the time periodtInterior, notContaining candidate wordswThe number of microblogs;Dindicating that no word candidate hot words are included in the k historical time windowswAverage number of microblogs;

step B3: calculating the vital force value of each candidate hot word by combining the nutritional value and the burst value of each candidate hot word; normalized candidate hotwordwVital force value oflife _{w t,}The calculation method comprises the following steps:

wherein,termsa set of candidate hot words is represented,w' representing a set of candidate hotwordstermsThe elements of (1);

step B4: and sequencing the candidate hot words in the candidate hot word set according to the vital force values of the candidate hot words, screening L candidate hot words which are sequenced at the front as hot words, and forming the hot word set according to the hot word set.

Further, in the step C, hotwordszAnd hot wordskAt a given time periodtIntra-correlationc _{z k,}Is defined as:

wherein,r _{z k,}meaning that it contains hot words at the same timezAnd hot wordskThe number of micro-blogs in the same way,n _zmeaning containing hotwordszThe number of micro-blogs in the same way,R _kmeaning containing hotwordskThe number of micro-blogs in the same way,Nindicating a period of timetNumber of microblogs therein, i.e.N=tw ^t；

The hot word co-occurrence network is defined asG(V,E,W) WhereinRepresenting the hot word set obtained in the step B as a node set,mrepresenting the number of nodes;Erepresenting a set of edges between nodes, for any two nodesIf the words represented by the two nodes have a co-occurrence relationship, then an edge between the two vertices is constructed；WRepresenting a collection of edgesETo set of real numbersRIs mapped tov _i，v _jThere is an edge in betweenIf the edge weight is the firstiA hot word andjsimilarity between individual hotwordssim(i,j) Defined as:

。

further, in the step D, each hot word in the hot word set, that is, each node has a label membership set, and the label membership set of the node is updated in each iteration until the algorithm converges, specifically including the following steps:

step D1: initializing a label of a node according to the hot word co-occurrence network;

step D2: node for randomly acquiring non-updated labelvGo through the nodevAccording to the label set of the neighbor node, the node is updatedvDegree of membership of each label in the set of labels, to nodesvCarrying out tag membership degree normalization;

step D3: repeating iteration until an iteration termination condition is met;

step D4: and classifying the nodes according to the label membership set of the nodes obtained by iteration to obtain a hot topic set.

Further, in step D1, the method for initializing the tag includes: assigning a unique label number to each node and assigning membership degrees to each nodeTo which the unique tag number is assigneduniqueLabels。

Further, in step D2, the update rule of the tag membership degree is: node for randomly acquiring non-updated labelvObtaining the neighbor node set of the nodeNb(v) And further obtain the label set owned by the neighbor nodelabelsThen at the h-th iteration, the nodevBelongs to the label numberThe membership degree is as follows:

wherein,sim(u,v) Representing nodesuAnd nodevSimilarity, denominator betweenNormalization for label membership degree and node guaranteevThe sum of the tag membership of (a) is 1.

Further, in the step D3, the iteration termination condition is:

whereinr _hIs defined as:

when in useAnd the iteration is ended.

Compared with the prior art, the invention has the beneficial effects that: calculating the vitality of each candidate hot word according to the occurrence frequency and the burstiness of the candidate hot words in the current time and a given historical time window, screening out a hot word set, calculating the hot word correlation according to the screened hot word set, constructing a hot word co-occurrence network, and dividing the hot word set by using a hot word clustering algorithm of multi-label propagation to obtain a hot topic set. The system and the method can realize the efficient mining of the hot topics of the social network, and improve the topic detection precision and the processing efficiency.

Drawings

FIG. 1 is a schematic block diagram of the system of the present invention.

FIG. 2 is a flow chart of the method of the present invention.

FIG. 3 is a flow chart of implementation of microblog hotword clustering in the method of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

FIG. 1 is a schematic diagram of a module structure of a microblog hotword and hot topic mining system. As shown in fig. 1, the system includes: the system comprises a preprocessing module 100, a hot word screening module 200, a hot word co-occurrence network construction module 300 and a hot word clustering module 400.

The preprocessing module 100 is configured to preprocess content data published in a social network to obtain candidate hotwords, and construct a candidate hotword set according to the candidate hotwords; the hot word screening module 200 is configured to calculate a vitality of each candidate hot word according to an occurrence frequency and a burstiness of each candidate hot word in the candidate hot word set at a current time and in a given historical time window, screen out the hot words, and construct a hot word set according to the vitality of each candidate hot word; the hot word co-occurrence network constructing module 300 is configured to calculate the correlation of each hot word in the hot word set, and construct a hot word co-occurrence network based on the correlation; the hot word clustering module 400 is configured to divide a hot word set by using a hot word clustering algorithm based on multi-label propagation according to the hot word co-occurrence network, so as to obtain a hot word set.

FIG. 2 is a flowchart of a microblog hotword and hot topic mining method according to the invention. As shown in fig. 2, the method comprises the steps of:

step A: and preprocessing content data published in the social network to obtain candidate hot words, and constructing a candidate hot word set according to the candidate hot words.

Specifically, the ICTCCLA of the Chinese academy of sciences can be used for word segmentation and part-of-speech tagging, nouns and verbs with strong expression capacity for topics are extracted, then the stop word list is used for further filtering, a candidate hot word set is obtained and is marked as，rIndicating the number of candidate words.

And B: and calculating the vitality of each candidate hot word according to the occurrence frequency and the burstiness of each candidate hot word in the candidate hot word set at the current moment and in a given historical time window, screening out the hot words, and constructing the hot word set according to the vitality of each candidate hot word.

In the step B, the process of screening hot words and constructing a hot word set specifically includes the following steps:

step B1: calculating over a period of timetThe nutritional value of each candidate hotword; candidate hotwordwThe nutritional value ofNutr _{w t,}In a time periodtInterior and exterior microBoji (Boji)tw ^tCandidate hot words of each microblog pairwThe calculation formula is as follows:

wherein,Ais shown in the time periodtIncluding candidate hotwordswThe number of microblogs;Bindicating that the candidate hotword is contained in k historical time windowswAverage number of microblogs;Cis shown in the time periodtIn, no candidate words are includedwThe number of microblogs;Dindicating that no word candidate hot words are included in the k historical time windowswAverage number of microblogs;

Specifically, after the vital sign values of the hot words are obtained through calculation, the candidate hot words can be ranked from top to bottom by adopting a Quick ranking (Quick Sort) algorithm according to the vital sign values, and the top M candidate hot words with the highest vital sign values are selected as time periods according to a given threshold value MtHot words inside.

And C: and calculating the correlation of each hot word in the hot word set, and constructing a hot word co-occurrence network according to the correlation.

In the step C, hotwordszAnd hot wordskAt a given time periodtIntra-correlationc _{z k,}Is defined as:

。

The hot word clustering algorithm based on multi-label propagation is characterized in that: the vocabulary co-occurrence network constructed based on the human language or the text document has the characteristics of high convergence and short path. Therefore, one topic can be regarded as a set of points (words) with close internal connection and sparse external link, which accords with the definition of communities in a complex network, and moreover, the topic discovery problem can be converted into the problem of carrying out overlapped word community division on a word co-occurrence network due to the fact that overlapped keywords possibly exist among the topics; the multi-label means that one node is allowed to have a plurality of community labels and belong to a plurality of hot word communities, namely, one hot word is allowed to belong to a plurality of topics. Each label carries a label membership degree, the labels and the label membership degree of the nodes are updated in the label propagation process, the label set of each node is cut according to a set threshold value, and finally the nodes are divided into a plurality of communities (hot topics) according to the labels owned by each node.

In the step D, each hot word in the hot word set, that is, each node has a label membership set, and the label membership set of the node is updated in each iteration until the algorithm converges. Fig. 3 is a flowchart of the implementation of step D in the method of the present invention, which specifically includes the following steps:

step D1: initializing labels of nodes (hot words) according to the hot word co-occurrence network;

in step D1, the method for initializing the tag includes: assigning a unique label number to each node and assigning membership degrees to each nodeTo which the unique tag number is assigneduniqueLabels。

in step D2, the update rule of the tag membership degree is: node for randomly acquiring non-updated labelvObtaining the neighbor node set of the nodeNb(v) And further obtain the label set owned by the neighbor nodelabelsThen at the h-th iteration, the nodevBelongs to the label numberThe membership degree is as follows:

Step D3: according to a given threshold valuepTo nodevFiltering the label set, and then normalizing the membership value of the reserved label again;

specifically, step D3 requires a parameter to be givenpFiltering the label set of the node with the updated label membership degree in the iteration process, only reserving partial labels, preventing the label set of the node from being too large,pthe size of (2) represents the maximum number of labels that a node is allowed to have, and the specific filtering rule is as follows: the membership degree of the label membership set of the deleted node is lower than 1pOf (2) is used. And normalizing the label set obtained after filtering again to ensure that the sum of the membership degrees of all labels of the nodes is 1。

Step D4: repeating iteration until an iteration termination condition is met;

in the step D4, the iteration termination condition is: under the condition that the generated label sets in two adjacent iterations are identical, if the number of internal nodes of each label in the history record is not changed any more, the iteration is ended, namely:

whereinr _hIs defined as:

when in useAnd the iteration is ended.

Step D5: and classifying the nodes (hot words) according to the label membership set of the nodes obtained by iteration to obtain a hot topic set.

Specifically, after iteration is finished, the label set of each node is detected, the nodes (hotwords) are divided into corresponding categories (communities), and according to a given threshold value M, each category (community) only needs to take M hotwords with the top ranking of life values to express the corresponding hot topics. M defaults to 10.

According to the microblog hot topic detection system and method, the occurrence frequency and the burst of words are comprehensively considered, a novel word life value calculation model is designed for hot word extraction, then a word co-occurrence network is constructed, and hot word clustering is carried out based on multi-label propagation close to linear time complexity to obtain the hot topic. In conclusion, the system and the method can effectively extract the hot words and the hot topics, and greatly improve the precision and the time efficiency of hot topic detection.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A microblog hotword and hot topic mining method is characterized by comprising the following steps:

step D: according to the hot word co-occurrence network, dividing a hot word set by using a hot word clustering algorithm based on multi-label propagation to obtain a hot word set;

step B1: calculating the nutritional value of each candidate hot word in the time period t; nutr for nutrition value of candidate hotword w_w,tIn order to collect the microblogs tw in the time period t^tThe sum of the contributions of each microblog to the nutrition value of the candidate hotword w is calculated by the following formula:

{Nutr}_{w, t} = \underset{j &Element; {tw}^{t}}{Σ} {Contr}_{w, j}

wherein, Contr_w,jContribution of jth microblog to nutrition value of candidate hotword w in time period t, j ∈ tw^tThe calculation formula is as follows:

{Contr}_{w, j} = \frac{{tf}_{w, j}}{{tf}_{j}^{\max}}

wherein, tf_w,jRepresenting the frequency of occurrence of the candidate hotword w in the jth microblog,representing the maximum word frequency in the jth microblog;

step B2: describing the intensity degree of change of the word frequency of the candidate hot words w between the current time period and the historical time period by using the burst value of the candidate hot words w; burst value B of candidate hot word w_w,tThe calculation method comprises the following steps: taking k historical time windows before a time period t, wherein the size of the historical time windows is the same as that of the time period t, respectively counting the number of microblogs containing the candidate hotwords w in the time period t and the k historical time windows before the time period t on the basis of a binomial distribution discrete event model, and adopting chi²And a statistical formula, wherein the burst value of the candidate hot word w in the time period t is calculated, and the calculation formula is as follows:

B_{w, t} = \frac{(A + B + C + D) {(A D - B C)}^{2}}{(A + B) (C + D) (A + C) (B + D)}

a represents the number of microblogs containing the candidate hotword w in a time period t; b represents the average microblog number containing the candidate hotwords w in k historical time windows; c represents the number of microblogs which do not contain the candidate word w in the time period t; d represents the average microblog number without the word candidate hot words w in k historical time windows;

step B3: calculating the vital force value of each candidate hot word by combining the nutritional value and the burst value of each candidate hot word; life force value life of normalized candidate hotword w_w,tThe calculation method comprises the following steps:

{life}_{w, t} = \frac{B_{w, t} * {Nutr}_{w, t}}{\underset{w^{'} &Element; t e r m s}{m a x} (B_{w^{'}, t} * {Nutr}_{w^{'}, t})}

wherein, term represents a candidate hot word set, and w' represents an element in the candidate hot word set term;

step B4: according to the vital force values of the candidate hot words, sorting the candidate hot words in the candidate hot word set, screening L candidate hot words with the top sorting as hot words, and forming a hot word set according to the L candidate hot words;

the microblog hotword and hot topic mining system corresponding to the method comprises the following steps:

2. The microblog hotword and hot topic mining method according to claim 1, wherein in the step C, the relevance C of the hotword z and the hotword k in a given time period t_z，kIs defined as:

c_{z, k} = l o g \frac{r_{z, k} / (R_{k} - r_{z, k})}{(n_{z} - r_{z, k}) / (N - n_{z} - R_{k} + r_{z, k})} \times | \frac{r_{z, k}}{R_{k}} - \frac{n_{z} - r_{z, k}}{N - R_{k}} |

wherein r is_z,kIs shown as simultaneously containingNumber of microblogs, n, of the hotword z and the hotword k_zIndicating the number of microblogs containing a hotword z, R_kIndicating the number of microblogs containing the hotword k, and N indicating the number of all microblogs in the time period t, i.e. N ═ tw^t；

The hotword co-occurrence network is defined as G (V, E, W), where V ═ V₁,v₂,...,v_mB, representing the hot word set obtained in the step B by using a node set, wherein m represents the number of nodes; e represents the set of edges between nodes, v for any two nodes_i,v_j∈{v₁,v₂,...,v_mAnd if the words represented by the two nodes have a co-occurrence relationship, constructing an edge e between the two vertexes_i,j∈ E, W represents the mapping of the set of edges E to the set of real numbers R, if v_i，v_jWith an edge e in between_i,j∈ E, the edge weight is the similarity sim (i, j) between the ith hot word and the jth hot word, defined as:

s i m (i, j) = c_{i, j} / \sqrt{\underset{i, j &Element; V, i &NotEqual; j}{Σ} {c^{2}}_{i, j}} .

3. the microblog hotword and hot topic mining method according to claim 2, wherein in the step D, each hotword in the hotword set, that is, each node has a tag membership set, and the tag membership set of the node is updated in each iteration until the algorithm converges, specifically comprising the following steps:

step D2: randomly acquiring a node v without updated labels, traversing neighbor nodes of the node v, updating the membership degree of each label in the label set of the node v according to the label set of the neighbor nodes, and performing label membership degree normalization on the node v;

step D3: repeating iteration until an iteration termination condition is met;

4. The method for mining microblog hotwords and hot topics according to claim 3, wherein in the step D1, the method for initializing the tag comprises the following steps: each node is assigned a unique tag number and is respectively subordinate to the tag number with the degree of membership of 1.0, and the unique tag numbers are collected as unique tags.

5. The microblog hotword and hot topic mining method according to claim 4, wherein in the step D2, the updating rule of the tag membership degree is as follows: randomly acquiring a node v without updating a label, acquiring a neighbor node set Nb (v) of the node, and further acquiring a label set labels owned by the neighbor node, wherein in the h iteration, the node v belongs to a label number lb_i∈ labels has a degree of membership of:

b_{h} ({lb}_{i}, v) = \frac{\underset{u &Element; N b (v)}{Σ} s i m (u, v) * b_{h - 1} ({lb}_{i}, u)}{\underset{u &Element; N b (v)}{Σ} s i m (u, v)}

where sim (u, v) represents the similarity between node u and node v, the denominatorAnd the method is used for normalizing the label membership degrees, and ensures that the sum of the label membership degrees of the nodes v is 1.

6. The microblog hotword and hot topic mining method according to claim 5, wherein in the step D3, iteration termination conditions are as follows:

wherein r is_hIs defined as:

when m is_h＝＝m_h-1And the iteration is ended.