CN103678670A - Micro-blog hot word and hot topic mining system and method - Google Patents

Micro-blog hot word and hot topic mining system and method Download PDF

Info

Publication number
CN103678670A
CN103678670A CN201310725400.4A CN201310725400A CN103678670A CN 103678670 A CN103678670 A CN 103678670A CN 201310725400 A CN201310725400 A CN 201310725400A CN 103678670 A CN103678670 A CN 103678670A
Authority
CN
China
Prior art keywords
hot
hot word
candidate
word
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310725400.4A
Other languages
Chinese (zh)
Other versions
CN103678670B (en
Inventor
陈羽中
郭文忠
陈国龙
方明月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201310725400.4A priority Critical patent/CN103678670B/en
Publication of CN103678670A publication Critical patent/CN103678670A/en
Application granted granted Critical
Publication of CN103678670B publication Critical patent/CN103678670B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention relates to the technical field of social networks, in particular to a micro-blog hot word and hot topic mining system and method. The method includes the following steps that content data released in a micro-blog are preprocessed to acquire a candidate hot word sequence; according to the frequency of occurrence and suddenness of candidate hot words in a candidate hot word set at the current moment and in a given historical time window, the vitality of each candidate hot word is worked out, and a hot word set is formed by screening the candidate hot words; according to the hot word set formed by screening the candidate hot words, hot word correlation is worked out, and a hot word co-occurrence network is constructed; according to the hot word co-occurrence network, the hot word set is partitioned through the hot word clustering algorithm based on multi-label propagation to acquire a hot topic set. By means of the micro-blog hot word and hot topic mining system and method, efficient micro-blog hot word and hot topic mining is achieved, and mining precision and processing efficiency are improved.

Description

The hot word of a kind of microblogging and much-talked-about topic digging system and method
Technical field
The present invention relates to social networks technical field, particularly the hot word of a kind of microblogging and much-talked-about topic digging system and method.
Background technology
Along with the rise of microblogging, people's participation constantly improves, and user can issue by computer, mobile phone the what is seen and heard of oneself whenever and wherever possible, and realizes and immediately sharing.Microblogging has become a kind of fashion of internet now, is also the important place that much-talked-about topic produces and discusses simultaneously, and much-talked-about topic referred within a period of time, frequently appears on network people's extensive concern the topic of discussing.The exponential increase of micro-blog information, makes how effectively to control magnanimity information and extract much-talked-about topic, becomes problem demanding prompt solution.
For much-talked-about topic, detect, traditional method is that text is carried out to cluster, but this method is unfavorable for that user identifies much-talked-about topic intuitively, and microblogging has short text characteristic, Sparse and the imbalance that distributes, cause these class methods unsatisfactory for the effect of discovering hot topic.Therefore the method for main flow is by hot word extraction cluster, to realize much-talked-about topic to find.
The classical way that is used for weighing word importance and extracts hot word has TFIDF and TFPDF etc.The main thought of TFIDF is that the frequency that word occurs can not fully represent text feature, such as "Yes", " refreshing horse " this word, frequently occurs, but does not almost explain the ability of text.If and a word is very high in the frequency of the appearance of the text, the number of times occurring in other texts is low, so just can more fully demonstrate the feature of this text, yet, this method is also not suitable for the weight calculation of word in microblogging, microblogging has short text characteristic, article one, on microblogging, seldom there will be dittograph, and after the appearance of the much-talked-about topic on microblogging, can cause user's extensive forwarding and discussion, on a large amount of microbloggings, include same keyword, if carry out keyword abstraction by the method for TFIDF, can cause to a certain extent important vocabulary to be lost.Therefore, have scholar to propose the method for TFPDF, it gives the weight that those words that occur in most documents are higher, extracts focus vocabulary.This method is conducive to extract the emphasis vocabulary that much-talked-about topic is relevant, but also can extract some frequent words that occur but do not explain topic ability.Focus vocabulary refers to the word that word frequency increases severely within a period of time, and above-mentioned two kinds of methods are not all considered word distribution situation in time, are unfavorable for the extraction of hot word.
For hot term clustering, existing method has: 1) adopt the insensitive Bisecting K-mean of initial cluster clustering algorithm; 2) by building Word similarity matrix, utilize Affinity Propagation algorithm to carry out cluster in without bunch number situation of appointment, its time complexity approaches; 3) algorithm based on Density Clustering, as DBSCAN; 4) hierarchical clustering algorithm etc.
Focus for magnanimity microblogging data is pinpointed the problems, the subject matter of existing hot term clustering method is: first, in cluster result, the different related words of topic does not allow to exist and occurs simultaneously, this does not conform to actual conditions, easily cause some topics not to be found, or the identification of topic is very low.Such as, in " colleges and universities' cost problem " and " colleges and universities' ranking list " these two topics, " colleges and universities " word can only belong at most a topic, and these two topic whichevers have lacked " colleges and universities " this keyword, will be difficult to pick out topic originally.In addition, traditional clustering algorithm time complexity is higher, is difficult to adapt to the requirement of magnanimity microblogging data clusters.
To sum up, there is more perfect technology and method in the influence power analysis for user's individuality in social networks, but the method for analyzing for other influence power of community-level in social networks is also relatively less, and lack the multianalysis assessment to the influence power of social networks Zhong Ge community, in the face of the scene of extensive social networks, existing method is in analytical effect and efficiency, to be all difficult to meet the demands.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, the hot word of a kind of microblogging and much-talked-about topic digging system and method are provided, this system and method is conducive to improve accuracy rate and the treatment effeciency that microblogging focus is found.
For achieving the above object, technical scheme of the present invention is: the hot word of a kind of microblogging and much-talked-about topic digging system, and described system comprises: pretreatment module, the screening of hot word module, hot word co-occurrence net structure module and hot term clustering module;
Pretreatment module, carries out pre-service for the content-data that social networks is issued, and obtains the hot word of candidate, and builds the hot set of words of candidate with this;
Hot word screening module, for the frequency of occurrences in current time and given historical time window and sudden according to the hot word of each candidate of the hot set of words of described candidate, calculates the vitality of the hot word of each candidate, filters out hot word, and builds hot set of words with this;
Hot word co-occurrence net structure module, for calculating the correlativity of hot each hot word of set of words, and constructs hot word co-occurrence network with this;
Hot term clustering module, for according to described hot word co-occurrence network, is used the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtains much-talked-about topic collection.
The present invention also provides the hot word of a kind of microblogging and much-talked-about topic method for digging, and described method comprises the steps:
Steps A: the content-data of issuing in social networks is carried out to pre-service, obtain the hot word of candidate, and build the hot set of words of candidate with this;
Step B: the frequency of occurrences according to the hot word of each candidate in the hot set of words of described candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot word, and build hot set of words with this;
Step C: calculate the correlativity of each hot word in described hot set of words, and construct hot word co-occurrence network with this;
Step D: according to described hot word co-occurrence network, use the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtain much-talked-about topic collection.
Further, in described step B, screen hot word and build the process of hot set of words, specifically comprise the following steps:
Step B1: calculate in the time period tin, the nutritive value of the hot word of each candidate; The hot word of candidate wnutritive value nutr w, t for in the time period tin, microblogging set tw t in every microblogging to the hot word of candidate wthe contribution sum of nutritive value, computing formula is:
Wherein, contr w, j for in the time period tin, the jbar microblogging is to the hot word of candidate wthe contribution of nutritive value, jtw t , computing formula is:
Figure 564785DEST_PATH_IMAGE002
Wherein,
Figure 2013107254004100002DEST_PATH_IMAGE003
represent the jin bar microblogging, there is the hot word of candidate wnumber of times,
Figure 855827DEST_PATH_IMAGE004
represent the jmaximum word frequency in bar microblogging;
Step B2: utilize the hot word of candidate wburst value the hot word of candidate is described wthe severe degree that changes between current slot and historical time section of word frequency; The hot word of candidate wburst value b w, t computing method be: get the time period tk before historical time window, historical time window size and time period tidentical, then the Discrete Event Models based on binomial distribution, adds up in the time period respectively tand the time period tin k historical time window, comprise the hot word of candidate before wmicroblogging number, adopt
Figure 2013107254004100002DEST_PATH_IMAGE005
statistical formula, the hot word of calculated candidate win the time period tinterior burst value, computing formula is:
Figure 876043DEST_PATH_IMAGE006
Wherein, abe illustrated in the time period tin, comprise the hot word of candidate wmicroblogging number; bbe illustrated in k historical time window, comprise the hot word of candidate waverage microblogging number; cbe illustrated in the time period tin, do not comprise candidate word wmicroblogging number; dbe illustrated in k historical time window, do not comprise the hot word of word candidate waverage microblogging number;
Step B3: in conjunction with nutritive value and the burst value of the hot word of each candidate, calculate the vitality value of the hot word of each candidate; The hot word of normalized candidate wvitality value life w, t computing method be:
Figure 2013107254004100002DEST_PATH_IMAGE007
Wherein, termsrepresent the hot set of words of candidate, w' the hot set of words of expression candidate termsin element;
Step B4: according to the vitality value of the hot word of candidate, the hot word of candidate in the hot set of words of candidate is sorted, filter out L the forward hot word of candidate of sequence as hot word, and form hot set of words with this.
Further, in described step C, hot word zwith hot word kin section preset time tinterior correlativity c z, k be defined as:
Figure 582836DEST_PATH_IMAGE008
Wherein, r z, k represent to comprise hot word simultaneously zwith hot word kmicroblogging number, n z represent to comprise hot word zmicroblogging number, r k represent to comprise hot word kmicroblogging number, nrepresent the time period tinterior all microblogging numbers, n= tw t ;
Hot word co-occurrence network is defined as g( v, e, w), wherein
Figure 2013107254004100002DEST_PATH_IMAGE009
for node set, represent the hot set of words that obtains in described step B, mrepresent node number; erepresent the set on limit between node, for any two nodes
Figure 272575DEST_PATH_IMAGE010
if there is cooccurrence relation in the word of these two node representatives, builds the limit between these two summits
Figure 2013107254004100002DEST_PATH_IMAGE011
; wrepresent the set on limit eto real number set rmapping, if v i , v j between have limit
Figure 409552DEST_PATH_IMAGE011
, limit weights are iindividual hot word and jsimilarity between individual hot word sim( i, j), be defined as:
Further, in described step D, each the hot word in hot set of words, each node has the set of a label degree of membership, the more label degree of membership set of new node in each iteration, until algorithm convergence specifically comprises the following steps:
Step D1: according to described hot word co-occurrence network, carry out the label initialization of node;
Step D2: obtain at random the node that does not upgrade label v, traversal node vneighbor node, according to the tag set of neighbor node, more new node vtag set in the degree of membership of each label, to node vcarry out the normalization of label degree of membership;
Step D3: iterate, until meet stopping criterion for iteration;
Step D4: the label degree of membership set of the node obtaining according to iteration, node is sorted out, obtain much-talked-about topic collection.
Further, in described step D1, the initialized method of label is: for each node distributes a unique tag number, and respectively with degree of membership
Figure 2013107254004100002DEST_PATH_IMAGE013
be under the jurisdiction of this tag number, these unique tag number set are designated as uniqueLabels.
Further, in described step D2, the update rule of label degree of membership is: obtain at random the node that does not upgrade label v, the neighbor node set of obtaining this node nb( v), and then obtain the tag set that neighbor node has labels, at the h time iteration, node vbelong to tag number
Figure 124753DEST_PATH_IMAGE014
degree of membership be:
Figure DEST_PATH_IMAGE015
Wherein, sim( u, v) expression node uand node vbetween similarity, denominator
Figure 108145DEST_PATH_IMAGE016
for the normalization of label degree of membership, guarantee node vlabel degree of membership sum be 1.
Further, in described step D3, stopping criterion for iteration is:
Figure 2013107254004100002DEST_PATH_IMAGE017
Wherein r h be defined as:
Figure 914559DEST_PATH_IMAGE018
When
Figure DEST_PATH_IMAGE019
, iteration finishes.
Compared to prior art, the invention has the beneficial effects as follows: the frequency of occurrences according to the hot word of candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot set of words, and according to the hot set of words filtering out, calculate hot word correlation, construct hot word co-occurrence network, the hot term clustering algorithm that uses many labels to propagate is divided hot set of words, obtains much-talked-about topic set.Described system and method can be realized the efficient excavation of social networks much-talked-about topic, on topic detection precision and treatment effeciency, is improved.
Accompanying drawing explanation
Fig. 1 is the modular structure schematic diagram of system of the present invention.
Fig. 2 is the process flow diagram of the inventive method.
Fig. 3 is the realization flow figure of the hot term clustering of microblogging in the inventive method.
Embodiment
Below in conjunction with drawings and the specific embodiments, the present invention is further illustrated.
Fig. 1 is the modular structure schematic diagram of the hot word of microblogging of the present invention and much-talked-about topic digging system.As shown in Figure 1, described system comprises: pretreatment module 100, the screening of hot word module 200, hot word co-occurrence net structure module 300 and hot term clustering module 400.
Pretreatment module 100 is carried out pre-service for the content-data that social networks is issued, and obtains the hot word of candidate, and builds the hot set of words of candidate with this; Hot word screening module 200, for the frequency of occurrences in current time and given historical time window and sudden according to the hot word of each candidate of the hot set of words of described candidate, is calculated the vitality of the hot word of each candidate, filters out hot word, and builds hot set of words with this; Hot word co-occurrence net structure module 300 is for calculating the correlativity of hot each hot word of set of words, and constructs hot word co-occurrence network with this; Hot term clustering module 400, for according to described hot word co-occurrence network, is used the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtains much-talked-about topic collection.
Fig. 2 is the process flow diagram of the hot word of microblogging of the present invention and much-talked-about topic method for digging.As shown in Figure 2, described method comprises the steps:
Steps A: the content-data of issuing in social networks is carried out to pre-service, obtain the hot word of candidate, and build the hot set of words of candidate with this.
Concrete, can use the ICTCLA of the Chinese Academy of Sciences to carry out participle and part-of-speech tagging, extract topic is had compared with the noun of high rule complexity and verb, use afterwards the vocabulary of stopping using further to filter, obtain the hot set of words of candidate, be designated as
Figure 980472DEST_PATH_IMAGE020
, rrepresent candidate word number.
Step B: the frequency of occurrences according to the hot word of each candidate in the hot set of words of described candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot word, and build hot set of words with this.
In described step B, screen hot word and build the process of hot set of words, specifically comprise the following steps:
Step B1: calculate in the time period tin, the nutritive value of the hot word of each candidate; The hot word of candidate wnutritive value nutr w, t for in the time period tin, microblogging set tw t in every microblogging to the hot word of candidate wthe contribution sum of nutritive value, computing formula is:
Figure DEST_PATH_IMAGE021
Wherein, contr w, j for in the time period tin, the jbar microblogging is to the hot word of candidate wthe contribution of nutritive value, jtw t , computing formula is:
Figure 31998DEST_PATH_IMAGE022
Wherein,
Figure 430750DEST_PATH_IMAGE003
represent the jin bar microblogging, there is the hot word of candidate wnumber of times,
Figure 234496DEST_PATH_IMAGE004
represent the jmaximum word frequency in bar microblogging;
Step B2: utilize the hot word of candidate wburst value the hot word of candidate is described wthe severe degree that changes between current slot and historical time section of word frequency; The hot word of candidate wburst value b w, t computing method be: get the time period tk before historical time window, historical time window size and time period tidentical, then the Discrete Event Models based on binomial distribution, adds up in the time period respectively tand the time period tin k historical time window, comprise the hot word of candidate before wmicroblogging number, adopt
Figure 413804DEST_PATH_IMAGE005
statistical formula, the hot word of calculated candidate win the time period tinterior burst value, computing formula is:
Figure DEST_PATH_IMAGE023
Wherein, abe illustrated in the time period tin, comprise the hot word of candidate wmicroblogging number; bbe illustrated in k historical time window, comprise the hot word of candidate waverage microblogging number; cbe illustrated in the time period tin, do not comprise candidate word wmicroblogging number; dbe illustrated in k historical time window, do not comprise the hot word of word candidate waverage microblogging number;
Step B3: in conjunction with nutritive value and the burst value of the hot word of each candidate, calculate the vitality value of the hot word of each candidate; The hot word of normalized candidate wvitality value life w, t computing method be:
Figure 15161DEST_PATH_IMAGE024
Wherein, termsrepresent the hot set of words of candidate, w' the hot set of words of expression candidate termsin element;
Step B4: according to the vitality value of the hot word of candidate, the hot word of candidate in the hot set of words of candidate is sorted, filter out L the forward hot word of candidate of sequence as hot word, and form hot set of words with this.
Concrete, calculate after the vitality value of each hot word, can adopt quicksort (Quick Sort) algorithm, according to vitality value, from high in the end the hot word of candidate is sorted, according to given threshold value M, select front M the hot word of candidate that vitality value is the highest as the time period tinterior hot word.
Step C: calculate the correlativity of each hot word in described hot set of words, and construct hot word co-occurrence network with this.
In described step C, hot word zwith hot word kin section preset time tinterior correlativity c z, k be defined as:
Figure DEST_PATH_IMAGE025
Wherein, r z, k represent to comprise hot word simultaneously zwith hot word kmicroblogging number, n z represent to comprise hot word zmicroblogging number, r k represent to comprise hot word kmicroblogging number, nrepresent the time period tinterior all microblogging numbers, n= tw t ;
Hot word co-occurrence network is defined as g( v, e, w), wherein
Figure 970216DEST_PATH_IMAGE009
for node set, represent the hot set of words that obtains in described step B, mrepresent node number; erepresent the set on limit between node, for any two nodes
Figure 180749DEST_PATH_IMAGE010
if there is cooccurrence relation in the word of these two node representatives, builds the limit between these two summits ; wrepresent the set on limit eto real number set rmapping, if v i , v j between have limit
Figure 365316DEST_PATH_IMAGE011
, limit weights are iindividual hot word and jsimilarity between individual hot word sim( i, j), be defined as:
Step D: according to described hot word co-occurrence network, use the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtain much-talked-about topic collection.
The feature of the hot term clustering algorithm of propagating based on many labels is: because the vocabulary co-occurrence network based on human language or text document structure has Gao Judu, short path characteristic.Therefore the inner the set tight and point (word) that external linkage is sparse that connects can be regarded as in a topic, meet the definition of complex network Zhong Dui community, moreover, between topic, may have overlapping keyword, topic is pinpointed the problems and can be converted to the problem that word co-occurrence network is carried out to the division of reduplication community; Many labels refer to allow a node to have a plurality of communities label, belong to a plurality of Re Ci community, allow a hot word to belong to a plurality of topics.Each label is carrying a label degree of membership, in label communication process, the label of node and label degree of membership value are upgraded, and according to the threshold value of setting, the tag set of each node being carried out to cutting, the label finally having according to each node arrives a plurality of communities (much-talked-about topic) by node division.
In described step D, each the hot word in hot set of words, each node has the set of a label degree of membership, and the more label degree of membership set of new node in each iteration, until algorithm convergence.Fig. 3 is the realization flow figure of step D in the inventive method, specifically comprises the following steps:
Step D1: according to described hot word co-occurrence network, carry out the label initialization of node (hot word);
In described step D1, the initialized method of label is: for each node distributes a unique tag number, and respectively with degree of membership
Figure 447989DEST_PATH_IMAGE013
be under the jurisdiction of this tag number, these unique tag number set are designated as uniqueLabels.
Step D2: obtain at random the node that does not upgrade label v, traversal node vneighbor node, according to the tag set of neighbor node, more new node vtag set in the degree of membership of each label, to node vcarry out the normalization of label degree of membership;
In described step D2, the update rule of label degree of membership is: obtain at random the node that does not upgrade label v, the neighbor node set of obtaining this node nb( v), and then obtain the tag set that neighbor node has labels, at the h time iteration, node vbelong to tag number
Figure 772529DEST_PATH_IMAGE014
degree of membership be:
Figure DEST_PATH_IMAGE027
Wherein, sim( u, v) expression node uand node vbetween similarity, denominator
Figure 530401DEST_PATH_IMAGE016
for the normalization of label degree of membership, guarantee node vlabel degree of membership sum be 1.
Step D3: according to given threshold value p, to node vtag set filter, afterwards the degree of membership value of the label retaining is normalized again;
Concrete, step D3 needs a given parameter pthe tag set of the node after label degree of membership being upgraded in iterative process filters, and a reserve part label, prevents that the tag set of node is too huge, psize represent the maximum number of labels that allows node to have, concrete filtering rule is: the label of deletion of node is subordinate to degree of membership in set lower than 1/ pelement.The tag set obtaining after filtration is normalized again, guarantees that each label degree of membership summation of node is 1.
Step D4: iterate, until meet stopping criterion for iteration;
In described step D4, stopping criterion for iteration is: judge in adjacent twice iteration that if the internal node quantity of each label of historical record no longer changes, iteration finishes, that is: in the situation that the tag set producing is the same
Figure 259716DEST_PATH_IMAGE028
Wherein r h be defined as:
When
Figure 77630DEST_PATH_IMAGE019
, iteration finishes.
Step D5: the label degree of membership set of the node obtaining according to iteration, node (hot word) is sorted out, obtain much-talked-about topic collection.
Concrete, after finishing, iteration detects the tag set of each node, node (hot word) is divided into corresponding classification (community), and according to given threshold value M, each classification (community) only need to be got the forward M of vital values rank hot word for expressing corresponding much-talked-about topic.M gives tacit consent to value 10.
Microblogging much-talked-about topic detection system of the present invention and method, consider frequency that word occurs and sudden, the word vital values computation model that has designed a kind of novelty carries out hot word extraction, build afterwards word co-occurrence network, and the many labels based on approaching linear time complexity propagate and carry out hot term clustering, obtain much-talked-about topic.To sum up, said system and method can effectively be extracted hot word and much-talked-about topic, and improve a lot in the precision detecting in much-talked-about topic and time efficiency.
Be more than preferred embodiment of the present invention, all changes of doing according to technical solution of the present invention, when the function producing does not exceed the scope of technical solution of the present invention, all belong to protection scope of the present invention.

Claims (8)

1. the hot word of microblogging and a much-talked-about topic digging system, is characterized in that, described system comprises:
Pretreatment module, carries out pre-service for the content-data that social networks is issued, and obtains the hot word of candidate, and builds the hot set of words of candidate with this;
Hot word screening module, for the frequency of occurrences in current time and given historical time window and sudden according to the hot word of each candidate of the hot set of words of described candidate, calculates the vitality of the hot word of each candidate, filters out hot word, and builds hot set of words with this;
Hot word co-occurrence net structure module, for calculating the correlativity of hot each hot word of set of words, and constructs hot word co-occurrence network with this;
Hot term clustering module, for according to described hot word co-occurrence network, is used the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtains much-talked-about topic collection.
2. the hot word of microblogging and a much-talked-about topic method for digging, is characterized in that, described method comprises the steps:
Steps A: the content-data of issuing in social networks is carried out to pre-service, obtain the hot word of candidate, and build the hot set of words of candidate with this;
Step B: the frequency of occurrences according to the hot word of each candidate in the hot set of words of described candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot word, and build hot set of words with this;
Step C: calculate the correlativity of each hot word in described hot set of words, and construct hot word co-occurrence network with this;
Step D: according to described hot word co-occurrence network, use the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtain much-talked-about topic collection.
3. the hot word of a kind of microblogging according to claim 2 and much-talked-about topic method for digging, is characterized in that, in described step B, screens hot word and build the process of hot set of words, specifically comprises the following steps:
Step B1: calculate in the time period tin, the nutritive value of the hot word of each candidate; The hot word of candidate wnutritive value nutr w, t for in the time period tin, microblogging set tw t in every microblogging to the hot word of candidate wthe contribution sum of nutritive value, computing formula is:
Figure 2013107254004100001DEST_PATH_IMAGE002
Wherein, contr w, j for in the time period tin, the jbar microblogging is to the hot word of candidate wthe contribution of nutritive value, jtw t , computing formula is:
Figure 2013107254004100001DEST_PATH_IMAGE004
Wherein,
Figure 2013107254004100001DEST_PATH_IMAGE006
represent the jin bar microblogging, there is the hot word of candidate wnumber of times,
Figure 2013107254004100001DEST_PATH_IMAGE008
represent the jmaximum word frequency in bar microblogging;
Step B2: utilize the hot word of candidate wburst value the hot word of candidate is described wthe severe degree that changes between current slot and historical time section of word frequency; The hot word of candidate wburst value b w, t computing method be: get the time period tk before historical time window, historical time window size and time period tidentical, then the Discrete Event Models based on binomial distribution, adds up in the time period respectively tand the time period tin k historical time window, comprise the hot word of candidate before wmicroblogging number, adopt
Figure 2013107254004100001DEST_PATH_IMAGE010
statistical formula, the hot word of calculated candidate win the time period tinterior burst value, computing formula is:
Figure 2013107254004100001DEST_PATH_IMAGE012
Wherein, abe illustrated in the time period tin, comprise the hot word of candidate wmicroblogging number; bbe illustrated in k historical time window, comprise the hot word of candidate waverage microblogging number; cbe illustrated in the time period tin, do not comprise candidate word wmicroblogging number; dbe illustrated in k historical time window, do not comprise the hot word of word candidate waverage microblogging number;
Step B3: in conjunction with nutritive value and the burst value of the hot word of each candidate, calculate the vitality value of the hot word of each candidate; The hot word of normalized candidate wvitality value life w, t computing method be:
Figure 2013107254004100001DEST_PATH_IMAGE014
Wherein, termsrepresent the hot set of words of candidate, w' the hot set of words of expression candidate termsin element;
Step B4: according to the vitality value of the hot word of candidate, the hot word of candidate in the hot set of words of candidate is sorted, filter out L the forward hot word of candidate of sequence as hot word, and form hot set of words with this.
4. the hot word of a kind of microblogging according to claim 2 and much-talked-about topic method for digging, is characterized in that, in described step C, and hot word zwith hot word kin section preset time tinterior correlativity c z, k be defined as:
Figure 2013107254004100001DEST_PATH_IMAGE016
Wherein, r z, k represent to comprise hot word simultaneously zwith hot word kmicroblogging number, n z represent to comprise hot word zmicroblogging number, r k represent to comprise hot word kmicroblogging number, nrepresent the time period tinterior all microblogging numbers, n= tw t ;
Hot word co-occurrence network is defined as g( v, e, w), wherein for node set, represent the hot set of words that obtains in described step B, mrepresent node number; erepresent the set on limit between node, for any two nodes
Figure 2013107254004100001DEST_PATH_IMAGE020
if there is cooccurrence relation in the word of these two node representatives, builds the limit between these two summits
Figure 2013107254004100001DEST_PATH_IMAGE022
; wrepresent the set on limit eto real number set rmapping, if v i , v j between have limit
Figure 275677DEST_PATH_IMAGE022
, limit weights are iindividual hot word and jsimilarity between individual hot word sim( i, j), be defined as:
Figure 2013107254004100001DEST_PATH_IMAGE024
5. the hot word of a kind of microblogging according to claim 4 and much-talked-about topic method for digging, it is characterized in that, in described step D, each hot word in hot set of words, be that each node has the set of a label degree of membership, the more label degree of membership set of new node in each iteration, until algorithm convergence specifically comprises the following steps:
Step D1: according to described hot word co-occurrence network, carry out the label initialization of node;
Step D2: obtain at random the node that does not upgrade label v, traversal node vneighbor node, according to the tag set of neighbor node, more new node vtag set in the degree of membership of each label, to node vcarry out the normalization of label degree of membership;
Step D3: iterate, until meet stopping criterion for iteration;
Step D4: the label degree of membership set of the node obtaining according to iteration, node is sorted out, obtain much-talked-about topic collection.
6. the hot word of a kind of microblogging according to claim 5 and much-talked-about topic method for digging, is characterized in that, in described step D1, the initialized method of label is: for each node distributes a unique tag number, and respectively with degree of membership
Figure 2013107254004100001DEST_PATH_IMAGE026
be under the jurisdiction of this tag number, these unique tag number set are designated as uniqueLabels.
7. the hot word of a kind of microblogging according to claim 6 and much-talked-about topic method for digging, is characterized in that, in described step D2, the update rule of label degree of membership is: obtain at random the node that does not upgrade label v, the neighbor node set of obtaining this node nb( v), and then obtain the tag set that neighbor node has labels, at the h time iteration, node vbelong to tag number
Figure 2013107254004100001DEST_PATH_IMAGE028
degree of membership be:
Figure 2013107254004100001DEST_PATH_IMAGE030
Wherein, sim( u, v) expression node uand node vbetween similarity, denominator
Figure 2013107254004100001DEST_PATH_IMAGE032
for the normalization of label degree of membership, guarantee node vlabel degree of membership sum be 1.
8. the hot word of a kind of microblogging according to claim 7 and much-talked-about topic method for digging, is characterized in that, in described step D3, stopping criterion for iteration is:
Wherein r h be defined as:
Figure 2013107254004100001DEST_PATH_IMAGE036
When
Figure 2013107254004100001DEST_PATH_IMAGE038
, iteration finishes.
CN201310725400.4A 2013-12-25 2013-12-25 Micro-blog hot word and hot topic mining system and method Expired - Fee Related CN103678670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310725400.4A CN103678670B (en) 2013-12-25 2013-12-25 Micro-blog hot word and hot topic mining system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310725400.4A CN103678670B (en) 2013-12-25 2013-12-25 Micro-blog hot word and hot topic mining system and method

Publications (2)

Publication Number Publication Date
CN103678670A true CN103678670A (en) 2014-03-26
CN103678670B CN103678670B (en) 2017-01-11

Family

ID=50316214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310725400.4A Expired - Fee Related CN103678670B (en) 2013-12-25 2013-12-25 Micro-blog hot word and hot topic mining system and method

Country Status (1)

Country Link
CN (1) CN103678670B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN104156436A (en) * 2014-08-13 2014-11-19 福州大学 Social association cloud media collaborative filtering and recommending method
CN105095988A (en) * 2015-07-01 2015-11-25 中国科学院计算技术研究所 Method and system for detecting social network information explosion
CN105488196A (en) * 2015-12-07 2016-04-13 中国人民大学 Automatic hot topic mining system based on internet corpora
CN106446191A (en) * 2016-09-30 2017-02-22 浙江工业大学 Logistic regression based multi-feature network popular tag prediction method
CN106610989A (en) * 2015-10-22 2017-05-03 北京国双科技有限公司 Search keyword clustering method and apparatus
CN106919627A (en) * 2015-12-28 2017-07-04 北京国双科技有限公司 The treating method and apparatus of hot word
CN107122478A (en) * 2017-05-03 2017-09-01 成都云数未来信息科学有限公司 A kind of method based on keyword extraction much-talked-about topic
CN108170693A (en) * 2016-12-07 2018-06-15 北京国双科技有限公司 Push the method and device of hot word
CN108182191A (en) * 2016-12-08 2018-06-19 腾讯科技(深圳)有限公司 A kind of hot spot data processing method and its equipment
CN108241611A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 A kind of keyword extracting method and extraction equipment
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN108804432A (en) * 2017-04-26 2018-11-13 慧科讯业有限公司 It is a kind of based on network media data Stream Discovery and to track the mthods, systems and devices of much-talked-about topic
CN109509110A (en) * 2018-07-27 2019-03-22 福州大学 Method is found based on the hot microblog topic for improving BBTM model
CN110377823A (en) * 2019-06-28 2019-10-25 厦门美域中央信息科技有限公司 A kind of building of hot spot digging system under Hadoop frame
CN110765239A (en) * 2019-10-29 2020-02-07 腾讯科技(深圳)有限公司 Hot word recognition method, device and storage medium
CN111125484A (en) * 2019-12-17 2020-05-08 网易(杭州)网络有限公司 Topic discovery method and system and electronic device
CN112668836A (en) * 2020-12-07 2021-04-16 数据地平线(广州)科技有限公司 Risk graph-oriented associated risk evidence efficient mining and monitoring method and device
CN113673224A (en) * 2021-08-19 2021-11-19 北京三快在线科技有限公司 Method and device for recognizing popular vocabulary, computer equipment and readable storage medium
CN113836307A (en) * 2021-10-15 2021-12-24 国网北京市电力公司 Power supply service work order hotspot discovery method, system and device and storage medium
CN114938477A (en) * 2022-06-23 2022-08-23 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN117076963A (en) * 2023-10-17 2023-11-17 北京国科众安科技有限公司 Information heat analysis method based on big data platform
CN114938477B (en) * 2022-06-23 2024-05-03 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2700629A1 (en) * 2010-05-13 2011-11-13 Gerard Voon Shopping enabler
CN102346766A (en) * 2011-09-20 2012-02-08 北京邮电大学 Method and device for detecting network hot topics found based on maximal clique
CN103294818A (en) * 2013-06-12 2013-09-11 北京航空航天大学 Multi-information fusion microblog hot topic detection method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2700629A1 (en) * 2010-05-13 2011-11-13 Gerard Voon Shopping enabler
CN102346766A (en) * 2011-09-20 2012-02-08 北京邮电大学 Method and device for detecting network hot topics found based on maximal clique
CN103294818A (en) * 2013-06-12 2013-09-11 北京航空航天大学 Multi-information fusion microblog hot topic detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
龙志祎等: "基于词聚类的热点话题检测算法", 《计算机工程与设计》, vol. 32, no. 6, 30 June 2011 (2011-06-30), pages 2214 - 2217 *

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN104156436B (en) * 2014-08-13 2017-05-10 福州大学 Social association cloud media collaborative filtering and recommending method
CN104156436A (en) * 2014-08-13 2014-11-19 福州大学 Social association cloud media collaborative filtering and recommending method
CN105095988A (en) * 2015-07-01 2015-11-25 中国科学院计算技术研究所 Method and system for detecting social network information explosion
CN106610989B (en) * 2015-10-22 2021-06-01 北京国双科技有限公司 Search keyword clustering method and device
CN106610989A (en) * 2015-10-22 2017-05-03 北京国双科技有限公司 Search keyword clustering method and apparatus
CN105488196B (en) * 2015-12-07 2019-01-22 中国人民大学 A kind of hot topic automatic mining system based on interconnection corpus
CN105488196A (en) * 2015-12-07 2016-04-13 中国人民大学 Automatic hot topic mining system based on internet corpora
CN106919627A (en) * 2015-12-28 2017-07-04 北京国双科技有限公司 The treating method and apparatus of hot word
CN106446191B (en) * 2016-09-30 2019-11-05 浙江工业大学 A kind of multiple features network flow row label prediction technique returned based on Logistic
CN106446191A (en) * 2016-09-30 2017-02-22 浙江工业大学 Logistic regression based multi-feature network popular tag prediction method
CN108170693A (en) * 2016-12-07 2018-06-15 北京国双科技有限公司 Push the method and device of hot word
CN108170693B (en) * 2016-12-07 2020-07-31 北京国双科技有限公司 Hot word pushing method and device
CN108182191A (en) * 2016-12-08 2018-06-19 腾讯科技(深圳)有限公司 A kind of hot spot data processing method and its equipment
CN108182191B (en) * 2016-12-08 2022-01-18 腾讯科技(深圳)有限公司 Hotspot data processing method and device
CN108241611A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 A kind of keyword extracting method and extraction equipment
CN108241611B (en) * 2016-12-26 2021-08-17 北京国双科技有限公司 Keyword extraction method and extraction equipment
CN108804432A (en) * 2017-04-26 2018-11-13 慧科讯业有限公司 It is a kind of based on network media data Stream Discovery and to track the mthods, systems and devices of much-talked-about topic
CN107122478A (en) * 2017-05-03 2017-09-01 成都云数未来信息科学有限公司 A kind of method based on keyword extraction much-talked-about topic
CN107122478B (en) * 2017-05-03 2020-05-08 成都云数未来信息科学有限公司 Method for extracting hot topics based on keywords
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN108304371B (en) * 2017-07-14 2021-07-13 腾讯科技(深圳)有限公司 Method and device for mining hot content, computer equipment and storage medium
CN109509110A (en) * 2018-07-27 2019-03-22 福州大学 Method is found based on the hot microblog topic for improving BBTM model
CN109509110B (en) * 2018-07-27 2021-08-31 福州大学 Microblog hot topic discovery method based on improved BBTM model
CN110377823A (en) * 2019-06-28 2019-10-25 厦门美域中央信息科技有限公司 A kind of building of hot spot digging system under Hadoop frame
CN110765239A (en) * 2019-10-29 2020-02-07 腾讯科技(深圳)有限公司 Hot word recognition method, device and storage medium
CN110765239B (en) * 2019-10-29 2023-03-28 腾讯科技(深圳)有限公司 Hot word recognition method, device and storage medium
CN111125484A (en) * 2019-12-17 2020-05-08 网易(杭州)网络有限公司 Topic discovery method and system and electronic device
CN111125484B (en) * 2019-12-17 2023-06-30 网易(杭州)网络有限公司 Topic discovery method, topic discovery system and electronic equipment
CN112668836A (en) * 2020-12-07 2021-04-16 数据地平线(广州)科技有限公司 Risk graph-oriented associated risk evidence efficient mining and monitoring method and device
CN112668836B (en) * 2020-12-07 2024-04-05 数据地平线(广州)科技有限公司 Risk spectrum-oriented associated risk evidence efficient mining and monitoring method and apparatus
CN113673224B (en) * 2021-08-19 2022-04-05 北京三快在线科技有限公司 Method and device for recognizing popular vocabulary, computer equipment and readable storage medium
CN113673224A (en) * 2021-08-19 2021-11-19 北京三快在线科技有限公司 Method and device for recognizing popular vocabulary, computer equipment and readable storage medium
CN113836307A (en) * 2021-10-15 2021-12-24 国网北京市电力公司 Power supply service work order hotspot discovery method, system and device and storage medium
CN113836307B (en) * 2021-10-15 2024-02-20 国网北京市电力公司 Power supply service work order hot spot discovery method, system, device and storage medium
CN114938477A (en) * 2022-06-23 2022-08-23 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN114938477B (en) * 2022-06-23 2024-05-03 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN117076963A (en) * 2023-10-17 2023-11-17 北京国科众安科技有限公司 Information heat analysis method based on big data platform
CN117076963B (en) * 2023-10-17 2024-01-02 北京国科众安科技有限公司 Information heat analysis method based on big data platform

Also Published As

Publication number Publication date
CN103678670B (en) 2017-01-11

Similar Documents

Publication Publication Date Title
CN103678670A (en) Micro-blog hot word and hot topic mining system and method
CN103745000B (en) Hot topic detection method of Chinese micro-blogs
CN103617169B (en) A kind of hot microblog topic extracting method based on Hadoop
Li et al. Filtering out the noise in short text topic modeling
Hai et al. Identifying features in opinion mining via intrinsic and extrinsic domain relevance
CN106156286B (en) Type extraction system and method towards technical literature knowledge entity
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN103514183B (en) Information search method and system based on interactive document clustering
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
WO2020108430A1 (en) Weibo sentiment analysis method and system
CN103605665A (en) Keyword based evaluation expert intelligent search and recommendation method
CN104699766A (en) Implicit attribute mining method integrating word correlation and context deduction
CN104484343A (en) Topic detection and tracking method for microblog
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
CN107291886A (en) A kind of microblog topic detecting method and system based on incremental clustering algorithm
CN109214454B (en) Microblog-oriented emotion community classification method
CN111914087A (en) Public opinion analysis method
CN104281565A (en) Semantic dictionary constructing method and device
CN107203513A (en) Microblogging text data fine granularity topic evolution analysis method based on probabilistic model
CN110929683B (en) Video public opinion monitoring method and system based on artificial intelligence
CN103488637A (en) Method for carrying out expert search based on dynamic community mining
CN105117466A (en) Internet information screening system and method
CN102063497A (en) Open type knowledge sharing platform and entry processing method thereof
Campbell et al. Content+ context networks for user classification in twitter
Lim et al. ClaimFinder: A Framework for Identifying Claims in Microblogs.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170111

Termination date: 20191225

CF01 Termination of patent right due to non-payment of annual fee