CN103678670A

CN103678670A - Micro-blog hot word and hot topic mining system and method

Info

Publication number: CN103678670A
Application number: CN201310725400.4A
Authority: CN
Inventors: 陈羽中; 郭文忠; 陈国龙; 方明月
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2013-12-25
Filing date: 2013-12-25
Publication date: 2014-03-26
Anticipated expiration: 2033-12-25
Also published as: CN103678670B

Abstract

The invention relates to the technical field of social networks, in particular to a micro-blog hot word and hot topic mining system and method. The method includes the following steps that content data released in a micro-blog are preprocessed to acquire a candidate hot word sequence; according to the frequency of occurrence and suddenness of candidate hot words in a candidate hot word set at the current moment and in a given historical time window, the vitality of each candidate hot word is worked out, and a hot word set is formed by screening the candidate hot words; according to the hot word set formed by screening the candidate hot words, hot word correlation is worked out, and a hot word co-occurrence network is constructed; according to the hot word co-occurrence network, the hot word set is partitioned through the hot word clustering algorithm based on multi-label propagation to acquire a hot topic set. By means of the micro-blog hot word and hot topic mining system and method, efficient micro-blog hot word and hot topic mining is achieved, and mining precision and processing efficiency are improved.

Description

The hot word of a kind of microblogging and much-talked-about topic digging system and method

Technical field

The present invention relates to social networks technical field, particularly the hot word of a kind of microblogging and much-talked-about topic digging system and method.

Background technology

Along with the rise of microblogging, people's participation constantly improves, and user can issue by computer, mobile phone the what is seen and heard of oneself whenever and wherever possible, and realizes and immediately sharing.Microblogging has become a kind of fashion of internet now, is also the important place that much-talked-about topic produces and discusses simultaneously, and much-talked-about topic referred within a period of time, frequently appears on network people's extensive concern the topic of discussing.The exponential increase of micro-blog information, makes how effectively to control magnanimity information and extract much-talked-about topic, becomes problem demanding prompt solution.

For much-talked-about topic, detect, traditional method is that text is carried out to cluster, but this method is unfavorable for that user identifies much-talked-about topic intuitively, and microblogging has short text characteristic, Sparse and the imbalance that distributes, cause these class methods unsatisfactory for the effect of discovering hot topic.Therefore the method for main flow is by hot word extraction cluster, to realize much-talked-about topic to find.

The classical way that is used for weighing word importance and extracts hot word has TFIDF and TFPDF etc.The main thought of TFIDF is that the frequency that word occurs can not fully represent text feature, such as "Yes", " refreshing horse " this word, frequently occurs, but does not almost explain the ability of text.If and a word is very high in the frequency of the appearance of the text, the number of times occurring in other texts is low, so just can more fully demonstrate the feature of this text, yet, this method is also not suitable for the weight calculation of word in microblogging, microblogging has short text characteristic, article one, on microblogging, seldom there will be dittograph, and after the appearance of the much-talked-about topic on microblogging, can cause user's extensive forwarding and discussion, on a large amount of microbloggings, include same keyword, if carry out keyword abstraction by the method for TFIDF, can cause to a certain extent important vocabulary to be lost.Therefore, have scholar to propose the method for TFPDF, it gives the weight that those words that occur in most documents are higher, extracts focus vocabulary.This method is conducive to extract the emphasis vocabulary that much-talked-about topic is relevant, but also can extract some frequent words that occur but do not explain topic ability.Focus vocabulary refers to the word that word frequency increases severely within a period of time, and above-mentioned two kinds of methods are not all considered word distribution situation in time, are unfavorable for the extraction of hot word.

For hot term clustering, existing method has: 1) adopt the insensitive Bisecting K-mean of initial cluster clustering algorithm; 2) by building Word similarity matrix, utilize Affinity Propagation algorithm to carry out cluster in without bunch number situation of appointment, its time complexity approaches; 3) algorithm based on Density Clustering, as DBSCAN; 4) hierarchical clustering algorithm etc.

Focus for magnanimity microblogging data is pinpointed the problems, the subject matter of existing hot term clustering method is: first, in cluster result, the different related words of topic does not allow to exist and occurs simultaneously, this does not conform to actual conditions, easily cause some topics not to be found, or the identification of topic is very low.Such as, in " colleges and universities' cost problem " and " colleges and universities' ranking list " these two topics, " colleges and universities " word can only belong at most a topic, and these two topic whichevers have lacked " colleges and universities " this keyword, will be difficult to pick out topic originally.In addition, traditional clustering algorithm time complexity is higher, is difficult to adapt to the requirement of magnanimity microblogging data clusters.

To sum up, there is more perfect technology and method in the influence power analysis for user's individuality in social networks, but the method for analyzing for other influence power of community-level in social networks is also relatively less, and lack the multianalysis assessment to the influence power of social networks Zhong Ge community, in the face of the scene of extensive social networks, existing method is in analytical effect and efficiency, to be all difficult to meet the demands.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, the hot word of a kind of microblogging and much-talked-about topic digging system and method are provided, this system and method is conducive to improve accuracy rate and the treatment effeciency that microblogging focus is found.

For achieving the above object, technical scheme of the present invention is: the hot word of a kind of microblogging and much-talked-about topic digging system, and described system comprises: pretreatment module, the screening of hot word module, hot word co-occurrence net structure module and hot term clustering module;

Pretreatment module, carries out pre-service for the content-data that social networks is issued, and obtains the hot word of candidate, and builds the hot set of words of candidate with this;

Hot word screening module, for the frequency of occurrences in current time and given historical time window and sudden according to the hot word of each candidate of the hot set of words of described candidate, calculates the vitality of the hot word of each candidate, filters out hot word, and builds hot set of words with this;

Hot word co-occurrence net structure module, for calculating the correlativity of hot each hot word of set of words, and constructs hot word co-occurrence network with this;

Hot term clustering module, for according to described hot word co-occurrence network, is used the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtains much-talked-about topic collection.

The present invention also provides the hot word of a kind of microblogging and much-talked-about topic method for digging, and described method comprises the steps:

Steps A: the content-data of issuing in social networks is carried out to pre-service, obtain the hot word of candidate, and build the hot set of words of candidate with this;

Step B: the frequency of occurrences according to the hot word of each candidate in the hot set of words of described candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot word, and build hot set of words with this;

Step C: calculate the correlativity of each hot word in described hot set of words, and construct hot word co-occurrence network with this;

Step D: according to described hot word co-occurrence network, use the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtain much-talked-about topic collection.

Further, in described step B, screen hot word and build the process of hot set of words, specifically comprise the following steps:

Step B1: calculate in the time period tin, the nutritive value of the hot word of each candidate; The hot word of candidate wnutritive value nutr _{w,
t}for in the time period tin, microblogging set tw ^tin every microblogging to the hot word of candidate wthe contribution sum of nutritive value, computing formula is:

Wherein, contr _{w,
j}for in the time period tin, the jbar microblogging is to the hot word of candidate wthe contribution of nutritive value, j∈ tw ^t, computing formula is:

Wherein,

Figure 2013107254004100002DEST_PATH_IMAGE003

represent the jin bar microblogging, there is the hot word of candidate wnumber of times,

represent the jmaximum word frequency in bar microblogging;

Step B2: utilize the hot word of candidate wburst value the hot word of candidate is described wthe severe degree that changes between current slot and historical time section of word frequency; The hot word of candidate wburst value b _{w,
t}computing method be: get the time period tk before historical time window, historical time window size and time period tidentical, then the Discrete Event Models based on binomial distribution, adds up in the time period respectively tand the time period tin k historical time window, comprise the hot word of candidate before wmicroblogging number, adopt

Figure 2013107254004100002DEST_PATH_IMAGE005

statistical formula, the hot word of calculated candidate win the time period tinterior burst value, computing formula is:

Wherein, abe illustrated in the time period tin, comprise the hot word of candidate wmicroblogging number; bbe illustrated in k historical time window, comprise the hot word of candidate waverage microblogging number; cbe illustrated in the time period tin, do not comprise candidate word wmicroblogging number; dbe illustrated in k historical time window, do not comprise the hot word of word candidate waverage microblogging number;

Step B3: in conjunction with nutritive value and the burst value of the hot word of each candidate, calculate the vitality value of the hot word of each candidate; The hot word of normalized candidate wvitality value life _{w,
t}computing method be:

Figure 2013107254004100002DEST_PATH_IMAGE007

Wherein, termsrepresent the hot set of words of candidate, w' the hot set of words of expression candidate termsin element;

Step B4: according to the vitality value of the hot word of candidate, the hot word of candidate in the hot set of words of candidate is sorted, filter out L the forward hot word of candidate of sequence as hot word, and form hot set of words with this.

Further, in described step C, hot word zwith hot word kin section preset time tinterior correlativity c _{z,
k}be defined as:

Wherein, r _{z,
k}represent to comprise hot word simultaneously zwith hot word kmicroblogging number, n _zrepresent to comprise hot word zmicroblogging number, r _krepresent to comprise hot word kmicroblogging number, nrepresent the time period tinterior all microblogging numbers, n= tw ^t;

Hot word co-occurrence network is defined as g( v, e, w), wherein

Figure 2013107254004100002DEST_PATH_IMAGE009

for node set, represent the hot set of words that obtains in described step B, mrepresent node number; erepresent the set on limit between node, for any two nodes

if there is cooccurrence relation in the word of these two node representatives, builds the limit between these two summits

Figure 2013107254004100002DEST_PATH_IMAGE011

; wrepresent the set on limit eto real number set rmapping, if v _i, v _jbetween have limit

, limit weights are iindividual hot word and jsimilarity between individual hot word sim( i, j), be defined as:

。

Further, in described step D, each the hot word in hot set of words, each node has the set of a label degree of membership, the more label degree of membership set of new node in each iteration, until algorithm convergence specifically comprises the following steps:

Step D1: according to described hot word co-occurrence network, carry out the label initialization of node;

Step D2: obtain at random the node that does not upgrade label v, traversal node vneighbor node, according to the tag set of neighbor node, more new node vtag set in the degree of membership of each label, to node vcarry out the normalization of label degree of membership;

Step D3: iterate, until meet stopping criterion for iteration;

Step D4: the label degree of membership set of the node obtaining according to iteration, node is sorted out, obtain much-talked-about topic collection.

Further, in described step D1, the initialized method of label is: for each node distributes a unique tag number, and respectively with degree of membership

Figure 2013107254004100002DEST_PATH_IMAGE013

be under the jurisdiction of this tag number, these unique tag number set are designated as uniqueLabels.

Further, in described step D2, the update rule of label degree of membership is: obtain at random the node that does not upgrade label v, the neighbor node set of obtaining this node nb( v), and then obtain the tag set that neighbor node has labels, at the h time iteration, node vbelong to tag number

degree of membership be:

Wherein, sim( u, v) expression node uand node vbetween similarity, denominator

for the normalization of label degree of membership, guarantee node vlabel degree of membership sum be 1.

Further, in described step D3, stopping criterion for iteration is:

Figure 2013107254004100002DEST_PATH_IMAGE017

Wherein r _hbe defined as:

When

, iteration finishes.

Compared to prior art, the invention has the beneficial effects as follows: the frequency of occurrences according to the hot word of candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot set of words, and according to the hot set of words filtering out, calculate hot word correlation, construct hot word co-occurrence network, the hot term clustering algorithm that uses many labels to propagate is divided hot set of words, obtains much-talked-about topic set.Described system and method can be realized the efficient excavation of social networks much-talked-about topic, on topic detection precision and treatment effeciency, is improved.

Accompanying drawing explanation

Fig. 1 is the modular structure schematic diagram of system of the present invention.

Fig. 2 is the process flow diagram of the inventive method.

Fig. 3 is the realization flow figure of the hot term clustering of microblogging in the inventive method.

Embodiment

Below in conjunction with drawings and the specific embodiments, the present invention is further illustrated.

Fig. 1 is the modular structure schematic diagram of the hot word of microblogging of the present invention and much-talked-about topic digging system.As shown in Figure 1, described system comprises: pretreatment module 100, the screening of hot word module 200, hot word co-occurrence net structure module 300 and hot term clustering module 400.

Pretreatment module 100 is carried out pre-service for the content-data that social networks is issued, and obtains the hot word of candidate, and builds the hot set of words of candidate with this; Hot word screening module 200, for the frequency of occurrences in current time and given historical time window and sudden according to the hot word of each candidate of the hot set of words of described candidate, is calculated the vitality of the hot word of each candidate, filters out hot word, and builds hot set of words with this; Hot word co-occurrence net structure module 300 is for calculating the correlativity of hot each hot word of set of words, and constructs hot word co-occurrence network with this; Hot term clustering module 400, for according to described hot word co-occurrence network, is used the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtains much-talked-about topic collection.

Fig. 2 is the process flow diagram of the hot word of microblogging of the present invention and much-talked-about topic method for digging.As shown in Figure 2, described method comprises the steps:

Steps A: the content-data of issuing in social networks is carried out to pre-service, obtain the hot word of candidate, and build the hot set of words of candidate with this.

Concrete, can use the ICTCLA of the Chinese Academy of Sciences to carry out participle and part-of-speech tagging, extract topic is had compared with the noun of high rule complexity and verb, use afterwards the vocabulary of stopping using further to filter, obtain the hot set of words of candidate, be designated as

, rrepresent candidate word number.

Step B: the frequency of occurrences according to the hot word of each candidate in the hot set of words of described candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot word, and build hot set of words with this.

In described step B, screen hot word and build the process of hot set of words, specifically comprise the following steps:

Wherein,

represent the jmaximum word frequency in bar microblogging;

Concrete, calculate after the vitality value of each hot word, can adopt quicksort (Quick Sort) algorithm, according to vitality value, from high in the end the hot word of candidate is sorted, according to given threshold value M, select front M the hot word of candidate that vitality value is the highest as the time period tinterior hot word.

Step C: calculate the correlativity of each hot word in described hot set of words, and construct hot word co-occurrence network with this.

In described step C, hot word zwith hot word kin section preset time tinterior correlativity c _{z,
k}be defined as:

Hot word co-occurrence network is defined as g( v, e, w), wherein

if there is cooccurrence relation in the word of these two node representatives, builds the limit between these two summits ; wrepresent the set on limit eto real number set rmapping, if v _i, v _jbetween have limit

。

The feature of the hot term clustering algorithm of propagating based on many labels is: because the vocabulary co-occurrence network based on human language or text document structure has Gao Judu, short path characteristic.Therefore the inner the set tight and point (word) that external linkage is sparse that connects can be regarded as in a topic, meet the definition of complex network Zhong Dui community, moreover, between topic, may have overlapping keyword, topic is pinpointed the problems and can be converted to the problem that word co-occurrence network is carried out to the division of reduplication community; Many labels refer to allow a node to have a plurality of communities label, belong to a plurality of Re Ci community, allow a hot word to belong to a plurality of topics.Each label is carrying a label degree of membership, in label communication process, the label of node and label degree of membership value are upgraded, and according to the threshold value of setting, the tag set of each node being carried out to cutting, the label finally having according to each node arrives a plurality of communities (much-talked-about topic) by node division.

In described step D, each the hot word in hot set of words, each node has the set of a label degree of membership, and the more label degree of membership set of new node in each iteration, until algorithm convergence.Fig. 3 is the realization flow figure of step D in the inventive method, specifically comprises the following steps:

Step D1: according to described hot word co-occurrence network, carry out the label initialization of node (hot word);

In described step D1, the initialized method of label is: for each node distributes a unique tag number, and respectively with degree of membership

In described step D2, the update rule of label degree of membership is: obtain at random the node that does not upgrade label v, the neighbor node set of obtaining this node nb( v), and then obtain the tag set that neighbor node has labels, at the h time iteration, node vbelong to tag number

degree of membership be:

Step D3: according to given threshold value p, to node vtag set filter, afterwards the degree of membership value of the label retaining is normalized again;

Concrete, step D3 needs a given parameter pthe tag set of the node after label degree of membership being upgraded in iterative process filters, and a reserve part label, prevents that the tag set of node is too huge, psize represent the maximum number of labels that allows node to have, concrete filtering rule is: the label of deletion of node is subordinate to degree of membership in set lower than 1/ pelement.The tag set obtaining after filtration is normalized again, guarantees that each label degree of membership summation of node is 1.

Step D4: iterate, until meet stopping criterion for iteration;

In described step D4, stopping criterion for iteration is: judge in adjacent twice iteration that if the internal node quantity of each label of historical record no longer changes, iteration finishes, that is: in the situation that the tag set producing is the same

Wherein r _hbe defined as:

When

, iteration finishes.

Step D5: the label degree of membership set of the node obtaining according to iteration, node (hot word) is sorted out, obtain much-talked-about topic collection.

Concrete, after finishing, iteration detects the tag set of each node, node (hot word) is divided into corresponding classification (community), and according to given threshold value M, each classification (community) only need to be got the forward M of vital values rank hot word for expressing corresponding much-talked-about topic.M gives tacit consent to value 10.

Microblogging much-talked-about topic detection system of the present invention and method, consider frequency that word occurs and sudden, the word vital values computation model that has designed a kind of novelty carries out hot word extraction, build afterwards word co-occurrence network, and the many labels based on approaching linear time complexity propagate and carry out hot term clustering, obtain much-talked-about topic.To sum up, said system and method can effectively be extracted hot word and much-talked-about topic, and improve a lot in the precision detecting in much-talked-about topic and time efficiency.

Be more than preferred embodiment of the present invention, all changes of doing according to technical solution of the present invention, when the function producing does not exceed the scope of technical solution of the present invention, all belong to protection scope of the present invention.

Claims

1. the hot word of microblogging and a much-talked-about topic digging system, is characterized in that, described system comprises:

2. the hot word of microblogging and a much-talked-about topic method for digging, is characterized in that, described method comprises the steps:

3. the hot word of a kind of microblogging according to claim 2 and much-talked-about topic method for digging, is characterized in that, in described step B, screens hot word and build the process of hot set of words, specifically comprises the following steps:

Figure 2013107254004100001DEST_PATH_IMAGE002

Figure 2013107254004100001DEST_PATH_IMAGE004

Wherein,

Figure 2013107254004100001DEST_PATH_IMAGE006

Figure 2013107254004100001DEST_PATH_IMAGE008

represent the jmaximum word frequency in bar microblogging;

Figure 2013107254004100001DEST_PATH_IMAGE010

Figure 2013107254004100001DEST_PATH_IMAGE012

Figure 2013107254004100001DEST_PATH_IMAGE014

4. the hot word of a kind of microblogging according to claim 2 and much-talked-about topic method for digging, is characterized in that, in described step C, and hot word zwith hot word kin section preset time tinterior correlativity c _{z,
k}be defined as:

Figure 2013107254004100001DEST_PATH_IMAGE016

Hot word co-occurrence network is defined as g( v, e, w), wherein for node set, represent the hot set of words that obtains in described step B, mrepresent node number; erepresent the set on limit between node, for any two nodes

Figure 2013107254004100001DEST_PATH_IMAGE020

Figure 2013107254004100001DEST_PATH_IMAGE022

Figure 2013107254004100001DEST_PATH_IMAGE024

。

5. the hot word of a kind of microblogging according to claim 4 and much-talked-about topic method for digging, it is characterized in that, in described step D, each hot word in hot set of words, be that each node has the set of a label degree of membership, the more label degree of membership set of new node in each iteration, until algorithm convergence specifically comprises the following steps:

Step D3: iterate, until meet stopping criterion for iteration;

6. the hot word of a kind of microblogging according to claim 5 and much-talked-about topic method for digging, is characterized in that, in described step D1, the initialized method of label is: for each node distributes a unique tag number, and respectively with degree of membership

Figure 2013107254004100001DEST_PATH_IMAGE026

7. the hot word of a kind of microblogging according to claim 6 and much-talked-about topic method for digging, is characterized in that, in described step D2, the update rule of label degree of membership is: obtain at random the node that does not upgrade label v, the neighbor node set of obtaining this node nb( v), and then obtain the tag set that neighbor node has labels, at the h time iteration, node vbelong to tag number

Figure 2013107254004100001DEST_PATH_IMAGE028

degree of membership be:

Figure 2013107254004100001DEST_PATH_IMAGE030

Figure 2013107254004100001DEST_PATH_IMAGE032

8. the hot word of a kind of microblogging according to claim 7 and much-talked-about topic method for digging, is characterized in that, in described step D3, stopping criterion for iteration is:

Wherein r _hbe defined as:

Figure 2013107254004100001DEST_PATH_IMAGE036

When

Figure 2013107254004100001DEST_PATH_IMAGE038

, iteration finishes.