CN103678670A - Micro-blog hot word and hot topic mining system and method - Google Patents
Micro-blog hot word and hot topic mining system and method Download PDFInfo
- Publication number
- CN103678670A CN103678670A CN201310725400.4A CN201310725400A CN103678670A CN 103678670 A CN103678670 A CN 103678670A CN 201310725400 A CN201310725400 A CN 201310725400A CN 103678670 A CN103678670 A CN 103678670A
- Authority
- CN
- China
- Prior art keywords
- hot
- hot word
- candidate
- word
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Abstract
The invention relates to the technical field of social networks, in particular to a micro-blog hot word and hot topic mining system and method. The method includes the following steps that content data released in a micro-blog are preprocessed to acquire a candidate hot word sequence; according to the frequency of occurrence and suddenness of candidate hot words in a candidate hot word set at the current moment and in a given historical time window, the vitality of each candidate hot word is worked out, and a hot word set is formed by screening the candidate hot words; according to the hot word set formed by screening the candidate hot words, hot word correlation is worked out, and a hot word co-occurrence network is constructed; according to the hot word co-occurrence network, the hot word set is partitioned through the hot word clustering algorithm based on multi-label propagation to acquire a hot topic set. By means of the micro-blog hot word and hot topic mining system and method, efficient micro-blog hot word and hot topic mining is achieved, and mining precision and processing efficiency are improved.
Description
Technical field
The present invention relates to social networks technical field, particularly the hot word of a kind of microblogging and much-talked-about topic digging system and method.
Background technology
Along with the rise of microblogging, people's participation constantly improves, and user can issue by computer, mobile phone the what is seen and heard of oneself whenever and wherever possible, and realizes and immediately sharing.Microblogging has become a kind of fashion of internet now, is also the important place that much-talked-about topic produces and discusses simultaneously, and much-talked-about topic referred within a period of time, frequently appears on network people's extensive concern the topic of discussing.The exponential increase of micro-blog information, makes how effectively to control magnanimity information and extract much-talked-about topic, becomes problem demanding prompt solution.
For much-talked-about topic, detect, traditional method is that text is carried out to cluster, but this method is unfavorable for that user identifies much-talked-about topic intuitively, and microblogging has short text characteristic, Sparse and the imbalance that distributes, cause these class methods unsatisfactory for the effect of discovering hot topic.Therefore the method for main flow is by hot word extraction cluster, to realize much-talked-about topic to find.
The classical way that is used for weighing word importance and extracts hot word has TFIDF and TFPDF etc.The main thought of TFIDF is that the frequency that word occurs can not fully represent text feature, such as "Yes", " refreshing horse " this word, frequently occurs, but does not almost explain the ability of text.If and a word is very high in the frequency of the appearance of the text, the number of times occurring in other texts is low, so just can more fully demonstrate the feature of this text, yet, this method is also not suitable for the weight calculation of word in microblogging, microblogging has short text characteristic, article one, on microblogging, seldom there will be dittograph, and after the appearance of the much-talked-about topic on microblogging, can cause user's extensive forwarding and discussion, on a large amount of microbloggings, include same keyword, if carry out keyword abstraction by the method for TFIDF, can cause to a certain extent important vocabulary to be lost.Therefore, have scholar to propose the method for TFPDF, it gives the weight that those words that occur in most documents are higher, extracts focus vocabulary.This method is conducive to extract the emphasis vocabulary that much-talked-about topic is relevant, but also can extract some frequent words that occur but do not explain topic ability.Focus vocabulary refers to the word that word frequency increases severely within a period of time, and above-mentioned two kinds of methods are not all considered word distribution situation in time, are unfavorable for the extraction of hot word.
For hot term clustering, existing method has: 1) adopt the insensitive Bisecting K-mean of initial cluster clustering algorithm; 2) by building Word similarity matrix, utilize Affinity Propagation algorithm to carry out cluster in without bunch number situation of appointment, its time complexity approaches; 3) algorithm based on Density Clustering, as DBSCAN; 4) hierarchical clustering algorithm etc.
Focus for magnanimity microblogging data is pinpointed the problems, the subject matter of existing hot term clustering method is: first, in cluster result, the different related words of topic does not allow to exist and occurs simultaneously, this does not conform to actual conditions, easily cause some topics not to be found, or the identification of topic is very low.Such as, in " colleges and universities' cost problem " and " colleges and universities' ranking list " these two topics, " colleges and universities " word can only belong at most a topic, and these two topic whichevers have lacked " colleges and universities " this keyword, will be difficult to pick out topic originally.In addition, traditional clustering algorithm time complexity is higher, is difficult to adapt to the requirement of magnanimity microblogging data clusters.
To sum up, there is more perfect technology and method in the influence power analysis for user's individuality in social networks, but the method for analyzing for other influence power of community-level in social networks is also relatively less, and lack the multianalysis assessment to the influence power of social networks Zhong Ge community, in the face of the scene of extensive social networks, existing method is in analytical effect and efficiency, to be all difficult to meet the demands.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, the hot word of a kind of microblogging and much-talked-about topic digging system and method are provided, this system and method is conducive to improve accuracy rate and the treatment effeciency that microblogging focus is found.
For achieving the above object, technical scheme of the present invention is: the hot word of a kind of microblogging and much-talked-about topic digging system, and described system comprises: pretreatment module, the screening of hot word module, hot word co-occurrence net structure module and hot term clustering module;
Pretreatment module, carries out pre-service for the content-data that social networks is issued, and obtains the hot word of candidate, and builds the hot set of words of candidate with this;
Hot word screening module, for the frequency of occurrences in current time and given historical time window and sudden according to the hot word of each candidate of the hot set of words of described candidate, calculates the vitality of the hot word of each candidate, filters out hot word, and builds hot set of words with this;
Hot word co-occurrence net structure module, for calculating the correlativity of hot each hot word of set of words, and constructs hot word co-occurrence network with this;
Hot term clustering module, for according to described hot word co-occurrence network, is used the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtains much-talked-about topic collection.
The present invention also provides the hot word of a kind of microblogging and much-talked-about topic method for digging, and described method comprises the steps:
Steps A: the content-data of issuing in social networks is carried out to pre-service, obtain the hot word of candidate, and build the hot set of words of candidate with this;
Step B: the frequency of occurrences according to the hot word of each candidate in the hot set of words of described candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot word, and build hot set of words with this;
Step C: calculate the correlativity of each hot word in described hot set of words, and construct hot word co-occurrence network with this;
Step D: according to described hot word co-occurrence network, use the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtain much-talked-about topic collection.
Further, in described step B, screen hot word and build the process of hot set of words, specifically comprise the following steps:
Step B1: calculate in the time period
tin, the nutritive value of the hot word of each candidate; The hot word of candidate
wnutritive value
nutr w,
t for in the time period
tin, microblogging set
tw t in every microblogging to the hot word of candidate
wthe contribution sum of nutritive value, computing formula is:
Wherein,
contr w,
j for in the time period
tin, the
jbar microblogging is to the hot word of candidate
wthe contribution of nutritive value,
j∈
tw t , computing formula is:
Wherein,
represent the
jin bar microblogging, there is the hot word of candidate
wnumber of times,
represent the
jmaximum word frequency in bar microblogging;
Step B2: utilize the hot word of candidate
wburst value the hot word of candidate is described
wthe severe degree that changes between current slot and historical time section of word frequency; The hot word of candidate
wburst value
b w,
t computing method be: get the time period
tk before historical time window, historical time window size and time period
tidentical, then the Discrete Event Models based on binomial distribution, adds up in the time period respectively
tand the time period
tin k historical time window, comprise the hot word of candidate before
wmicroblogging number, adopt
statistical formula, the hot word of calculated candidate
win the time period
tinterior burst value, computing formula is:
Wherein,
abe illustrated in the time period
tin, comprise the hot word of candidate
wmicroblogging number;
bbe illustrated in k historical time window, comprise the hot word of candidate
waverage microblogging number;
cbe illustrated in the time period
tin, do not comprise candidate word
wmicroblogging number;
dbe illustrated in k historical time window, do not comprise the hot word of word candidate
waverage microblogging number;
Step B3: in conjunction with nutritive value and the burst value of the hot word of each candidate, calculate the vitality value of the hot word of each candidate; The hot word of normalized candidate
wvitality value
life w,
t computing method be:
Wherein,
termsrepresent the hot set of words of candidate,
w' the hot set of words of expression candidate
termsin element;
Step B4: according to the vitality value of the hot word of candidate, the hot word of candidate in the hot set of words of candidate is sorted, filter out L the forward hot word of candidate of sequence as hot word, and form hot set of words with this.
Further, in described step C, hot word
zwith hot word
kin section preset time
tinterior correlativity
c z,
k be defined as:
Wherein,
r z,
k represent to comprise hot word simultaneously
zwith hot word
kmicroblogging number,
n z represent to comprise hot word
zmicroblogging number,
r k represent to comprise hot word
kmicroblogging number,
nrepresent the time period
tinterior all microblogging numbers,
n=
tw t ;
Hot word co-occurrence network is defined as
g(
v,
e,
w), wherein
for node set, represent the hot set of words that obtains in described step B,
mrepresent node number;
erepresent the set on limit between node, for any two nodes
if there is cooccurrence relation in the word of these two node representatives, builds the limit between these two summits
;
wrepresent the set on limit
eto real number set
rmapping, if
v i ,
v j between have limit
, limit weights are
iindividual hot word and
jsimilarity between individual hot word
sim(
i,
j), be defined as:
。
Further, in described step D, each the hot word in hot set of words, each node has the set of a label degree of membership, the more label degree of membership set of new node in each iteration, until algorithm convergence specifically comprises the following steps:
Step D1: according to described hot word co-occurrence network, carry out the label initialization of node;
Step D2: obtain at random the node that does not upgrade label
v, traversal node
vneighbor node, according to the tag set of neighbor node, more new node
vtag set in the degree of membership of each label, to node
vcarry out the normalization of label degree of membership;
Step D3: iterate, until meet stopping criterion for iteration;
Step D4: the label degree of membership set of the node obtaining according to iteration, node is sorted out, obtain much-talked-about topic collection.
Further, in described step D1, the initialized method of label is: for each node distributes a unique tag number, and respectively with degree of membership
be under the jurisdiction of this tag number, these unique tag number set are designated as
uniqueLabels.
Further, in described step D2, the update rule of label degree of membership is: obtain at random the node that does not upgrade label
v, the neighbor node set of obtaining this node
nb(
v), and then obtain the tag set that neighbor node has
labels, at the h time iteration, node
vbelong to tag number
degree of membership be:
Wherein,
sim(
u,
v) expression node
uand node
vbetween similarity, denominator
for the normalization of label degree of membership, guarantee node
vlabel degree of membership sum be 1.
Further, in described step D3, stopping criterion for iteration is:
Wherein
r h be defined as:
Compared to prior art, the invention has the beneficial effects as follows: the frequency of occurrences according to the hot word of candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot set of words, and according to the hot set of words filtering out, calculate hot word correlation, construct hot word co-occurrence network, the hot term clustering algorithm that uses many labels to propagate is divided hot set of words, obtains much-talked-about topic set.Described system and method can be realized the efficient excavation of social networks much-talked-about topic, on topic detection precision and treatment effeciency, is improved.
Accompanying drawing explanation
Fig. 1 is the modular structure schematic diagram of system of the present invention.
Fig. 2 is the process flow diagram of the inventive method.
Fig. 3 is the realization flow figure of the hot term clustering of microblogging in the inventive method.
Embodiment
Below in conjunction with drawings and the specific embodiments, the present invention is further illustrated.
Fig. 1 is the modular structure schematic diagram of the hot word of microblogging of the present invention and much-talked-about topic digging system.As shown in Figure 1, described system comprises: pretreatment module 100, the screening of hot word module 200, hot word co-occurrence net structure module 300 and hot term clustering module 400.
Pretreatment module 100 is carried out pre-service for the content-data that social networks is issued, and obtains the hot word of candidate, and builds the hot set of words of candidate with this; Hot word screening module 200, for the frequency of occurrences in current time and given historical time window and sudden according to the hot word of each candidate of the hot set of words of described candidate, is calculated the vitality of the hot word of each candidate, filters out hot word, and builds hot set of words with this; Hot word co-occurrence net structure module 300 is for calculating the correlativity of hot each hot word of set of words, and constructs hot word co-occurrence network with this; Hot term clustering module 400, for according to described hot word co-occurrence network, is used the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtains much-talked-about topic collection.
Fig. 2 is the process flow diagram of the hot word of microblogging of the present invention and much-talked-about topic method for digging.As shown in Figure 2, described method comprises the steps:
Steps A: the content-data of issuing in social networks is carried out to pre-service, obtain the hot word of candidate, and build the hot set of words of candidate with this.
Concrete, can use the ICTCLA of the Chinese Academy of Sciences to carry out participle and part-of-speech tagging, extract topic is had compared with the noun of high rule complexity and verb, use afterwards the vocabulary of stopping using further to filter, obtain the hot set of words of candidate, be designated as
,
rrepresent candidate word number.
Step B: the frequency of occurrences according to the hot word of each candidate in the hot set of words of described candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot word, and build hot set of words with this.
In described step B, screen hot word and build the process of hot set of words, specifically comprise the following steps:
Step B1: calculate in the time period
tin, the nutritive value of the hot word of each candidate; The hot word of candidate
wnutritive value
nutr w,
t for in the time period
tin, microblogging set
tw t in every microblogging to the hot word of candidate
wthe contribution sum of nutritive value, computing formula is:
Wherein,
contr w,
j for in the time period
tin, the
jbar microblogging is to the hot word of candidate
wthe contribution of nutritive value,
j∈
tw t , computing formula is:
Wherein,
represent the
jin bar microblogging, there is the hot word of candidate
wnumber of times,
represent the
jmaximum word frequency in bar microblogging;
Step B2: utilize the hot word of candidate
wburst value the hot word of candidate is described
wthe severe degree that changes between current slot and historical time section of word frequency; The hot word of candidate
wburst value
b w,
t computing method be: get the time period
tk before historical time window, historical time window size and time period
tidentical, then the Discrete Event Models based on binomial distribution, adds up in the time period respectively
tand the time period
tin k historical time window, comprise the hot word of candidate before
wmicroblogging number, adopt
statistical formula, the hot word of calculated candidate
win the time period
tinterior burst value, computing formula is:
Wherein,
abe illustrated in the time period
tin, comprise the hot word of candidate
wmicroblogging number;
bbe illustrated in k historical time window, comprise the hot word of candidate
waverage microblogging number;
cbe illustrated in the time period
tin, do not comprise candidate word
wmicroblogging number;
dbe illustrated in k historical time window, do not comprise the hot word of word candidate
waverage microblogging number;
Step B3: in conjunction with nutritive value and the burst value of the hot word of each candidate, calculate the vitality value of the hot word of each candidate; The hot word of normalized candidate
wvitality value
life w,
t computing method be:
Wherein,
termsrepresent the hot set of words of candidate,
w' the hot set of words of expression candidate
termsin element;
Step B4: according to the vitality value of the hot word of candidate, the hot word of candidate in the hot set of words of candidate is sorted, filter out L the forward hot word of candidate of sequence as hot word, and form hot set of words with this.
Concrete, calculate after the vitality value of each hot word, can adopt quicksort (Quick Sort) algorithm, according to vitality value, from high in the end the hot word of candidate is sorted, according to given threshold value M, select front M the hot word of candidate that vitality value is the highest as the time period
tinterior hot word.
Step C: calculate the correlativity of each hot word in described hot set of words, and construct hot word co-occurrence network with this.
In described step C, hot word
zwith hot word
kin section preset time
tinterior correlativity
c z,
k be defined as:
Wherein,
r z,
k represent to comprise hot word simultaneously
zwith hot word
kmicroblogging number,
n z represent to comprise hot word
zmicroblogging number,
r k represent to comprise hot word
kmicroblogging number,
nrepresent the time period
tinterior all microblogging numbers,
n=
tw t ;
Hot word co-occurrence network is defined as
g(
v,
e,
w), wherein
for node set, represent the hot set of words that obtains in described step B,
mrepresent node number;
erepresent the set on limit between node, for any two nodes
if there is cooccurrence relation in the word of these two node representatives, builds the limit between these two summits
;
wrepresent the set on limit
eto real number set
rmapping, if
v i ,
v j between have limit
, limit weights are
iindividual hot word and
jsimilarity between individual hot word
sim(
i,
j), be defined as:
。
Step D: according to described hot word co-occurrence network, use the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtain much-talked-about topic collection.
The feature of the hot term clustering algorithm of propagating based on many labels is: because the vocabulary co-occurrence network based on human language or text document structure has Gao Judu, short path characteristic.Therefore the inner the set tight and point (word) that external linkage is sparse that connects can be regarded as in a topic, meet the definition of complex network Zhong Dui community, moreover, between topic, may have overlapping keyword, topic is pinpointed the problems and can be converted to the problem that word co-occurrence network is carried out to the division of reduplication community; Many labels refer to allow a node to have a plurality of communities label, belong to a plurality of Re Ci community, allow a hot word to belong to a plurality of topics.Each label is carrying a label degree of membership, in label communication process, the label of node and label degree of membership value are upgraded, and according to the threshold value of setting, the tag set of each node being carried out to cutting, the label finally having according to each node arrives a plurality of communities (much-talked-about topic) by node division.
In described step D, each the hot word in hot set of words, each node has the set of a label degree of membership, and the more label degree of membership set of new node in each iteration, until algorithm convergence.Fig. 3 is the realization flow figure of step D in the inventive method, specifically comprises the following steps:
Step D1: according to described hot word co-occurrence network, carry out the label initialization of node (hot word);
In described step D1, the initialized method of label is: for each node distributes a unique tag number, and respectively with degree of membership
be under the jurisdiction of this tag number, these unique tag number set are designated as
uniqueLabels.
Step D2: obtain at random the node that does not upgrade label
v, traversal node
vneighbor node, according to the tag set of neighbor node, more new node
vtag set in the degree of membership of each label, to node
vcarry out the normalization of label degree of membership;
In described step D2, the update rule of label degree of membership is: obtain at random the node that does not upgrade label
v, the neighbor node set of obtaining this node
nb(
v), and then obtain the tag set that neighbor node has
labels, at the h time iteration, node
vbelong to tag number
degree of membership be:
Wherein,
sim(
u,
v) expression node
uand node
vbetween similarity, denominator
for the normalization of label degree of membership, guarantee node
vlabel degree of membership sum be 1.
Step D3: according to given threshold value
p, to node
vtag set filter, afterwards the degree of membership value of the label retaining is normalized again;
Concrete, step D3 needs a given parameter
pthe tag set of the node after label degree of membership being upgraded in iterative process filters, and a reserve part label, prevents that the tag set of node is too huge,
psize represent the maximum number of labels that allows node to have, concrete filtering rule is: the label of deletion of node is subordinate to degree of membership in set lower than 1/
pelement.The tag set obtaining after filtration is normalized again, guarantees that each label degree of membership summation of node is 1.
Step D4: iterate, until meet stopping criterion for iteration;
In described step D4, stopping criterion for iteration is: judge in adjacent twice iteration that if the internal node quantity of each label of historical record no longer changes, iteration finishes, that is: in the situation that the tag set producing is the same
Wherein
r h be defined as:
Step D5: the label degree of membership set of the node obtaining according to iteration, node (hot word) is sorted out, obtain much-talked-about topic collection.
Concrete, after finishing, iteration detects the tag set of each node, node (hot word) is divided into corresponding classification (community), and according to given threshold value M, each classification (community) only need to be got the forward M of vital values rank hot word for expressing corresponding much-talked-about topic.M gives tacit consent to value 10.
Microblogging much-talked-about topic detection system of the present invention and method, consider frequency that word occurs and sudden, the word vital values computation model that has designed a kind of novelty carries out hot word extraction, build afterwards word co-occurrence network, and the many labels based on approaching linear time complexity propagate and carry out hot term clustering, obtain much-talked-about topic.To sum up, said system and method can effectively be extracted hot word and much-talked-about topic, and improve a lot in the precision detecting in much-talked-about topic and time efficiency.
Be more than preferred embodiment of the present invention, all changes of doing according to technical solution of the present invention, when the function producing does not exceed the scope of technical solution of the present invention, all belong to protection scope of the present invention.
Claims (8)
1. the hot word of microblogging and a much-talked-about topic digging system, is characterized in that, described system comprises:
Pretreatment module, carries out pre-service for the content-data that social networks is issued, and obtains the hot word of candidate, and builds the hot set of words of candidate with this;
Hot word screening module, for the frequency of occurrences in current time and given historical time window and sudden according to the hot word of each candidate of the hot set of words of described candidate, calculates the vitality of the hot word of each candidate, filters out hot word, and builds hot set of words with this;
Hot word co-occurrence net structure module, for calculating the correlativity of hot each hot word of set of words, and constructs hot word co-occurrence network with this;
Hot term clustering module, for according to described hot word co-occurrence network, is used the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtains much-talked-about topic collection.
2. the hot word of microblogging and a much-talked-about topic method for digging, is characterized in that, described method comprises the steps:
Steps A: the content-data of issuing in social networks is carried out to pre-service, obtain the hot word of candidate, and build the hot set of words of candidate with this;
Step B: the frequency of occurrences according to the hot word of each candidate in the hot set of words of described candidate in current time and given historical time window and sudden, calculate the vitality of the hot word of each candidate, filter out hot word, and build hot set of words with this;
Step C: calculate the correlativity of each hot word in described hot set of words, and construct hot word co-occurrence network with this;
Step D: according to described hot word co-occurrence network, use the hot term clustering algorithm of propagating based on many labels to divide hot set of words, obtain much-talked-about topic collection.
3. the hot word of a kind of microblogging according to claim 2 and much-talked-about topic method for digging, is characterized in that, in described step B, screens hot word and build the process of hot set of words, specifically comprises the following steps:
Step B1: calculate in the time period
tin, the nutritive value of the hot word of each candidate; The hot word of candidate
wnutritive value
nutr w,
t for in the time period
tin, microblogging set
tw t in every microblogging to the hot word of candidate
wthe contribution sum of nutritive value, computing formula is:
Wherein,
contr w,
j for in the time period
tin, the
jbar microblogging is to the hot word of candidate
wthe contribution of nutritive value,
j∈
tw t , computing formula is:
Wherein,
represent the
jin bar microblogging, there is the hot word of candidate
wnumber of times,
represent the
jmaximum word frequency in bar microblogging;
Step B2: utilize the hot word of candidate
wburst value the hot word of candidate is described
wthe severe degree that changes between current slot and historical time section of word frequency; The hot word of candidate
wburst value
b w,
t computing method be: get the time period
tk before historical time window, historical time window size and time period
tidentical, then the Discrete Event Models based on binomial distribution, adds up in the time period respectively
tand the time period
tin k historical time window, comprise the hot word of candidate before
wmicroblogging number, adopt
statistical formula, the hot word of calculated candidate
win the time period
tinterior burst value, computing formula is:
Wherein,
abe illustrated in the time period
tin, comprise the hot word of candidate
wmicroblogging number;
bbe illustrated in k historical time window, comprise the hot word of candidate
waverage microblogging number;
cbe illustrated in the time period
tin, do not comprise candidate word
wmicroblogging number;
dbe illustrated in k historical time window, do not comprise the hot word of word candidate
waverage microblogging number;
Step B3: in conjunction with nutritive value and the burst value of the hot word of each candidate, calculate the vitality value of the hot word of each candidate; The hot word of normalized candidate
wvitality value
life w,
t computing method be:
Wherein,
termsrepresent the hot set of words of candidate,
w' the hot set of words of expression candidate
termsin element;
Step B4: according to the vitality value of the hot word of candidate, the hot word of candidate in the hot set of words of candidate is sorted, filter out L the forward hot word of candidate of sequence as hot word, and form hot set of words with this.
4. the hot word of a kind of microblogging according to claim 2 and much-talked-about topic method for digging, is characterized in that, in described step C, and hot word
zwith hot word
kin section preset time
tinterior correlativity
c z,
k be defined as:
Wherein,
r z,
k represent to comprise hot word simultaneously
zwith hot word
kmicroblogging number,
n z represent to comprise hot word
zmicroblogging number,
r k represent to comprise hot word
kmicroblogging number,
nrepresent the time period
tinterior all microblogging numbers,
n=
tw t ;
Hot word co-occurrence network is defined as
g(
v,
e,
w), wherein
for node set, represent the hot set of words that obtains in described step B,
mrepresent node number;
erepresent the set on limit between node, for any two nodes
if there is cooccurrence relation in the word of these two node representatives, builds the limit between these two summits
;
wrepresent the set on limit
eto real number set
rmapping, if
v i ,
v j between have limit
, limit weights are
iindividual hot word and
jsimilarity between individual hot word
sim(
i,
j), be defined as:
5. the hot word of a kind of microblogging according to claim 4 and much-talked-about topic method for digging, it is characterized in that, in described step D, each hot word in hot set of words, be that each node has the set of a label degree of membership, the more label degree of membership set of new node in each iteration, until algorithm convergence specifically comprises the following steps:
Step D1: according to described hot word co-occurrence network, carry out the label initialization of node;
Step D2: obtain at random the node that does not upgrade label
v, traversal node
vneighbor node, according to the tag set of neighbor node, more new node
vtag set in the degree of membership of each label, to node
vcarry out the normalization of label degree of membership;
Step D3: iterate, until meet stopping criterion for iteration;
Step D4: the label degree of membership set of the node obtaining according to iteration, node is sorted out, obtain much-talked-about topic collection.
6. the hot word of a kind of microblogging according to claim 5 and much-talked-about topic method for digging, is characterized in that, in described step D1, the initialized method of label is: for each node distributes a unique tag number, and respectively with degree of membership
be under the jurisdiction of this tag number, these unique tag number set are designated as
uniqueLabels.
7. the hot word of a kind of microblogging according to claim 6 and much-talked-about topic method for digging, is characterized in that, in described step D2, the update rule of label degree of membership is: obtain at random the node that does not upgrade label
v, the neighbor node set of obtaining this node
nb(
v), and then obtain the tag set that neighbor node has
labels, at the h time iteration, node
vbelong to tag number
degree of membership be:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310725400.4A CN103678670B (en) | 2013-12-25 | 2013-12-25 | Micro-blog hot word and hot topic mining system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310725400.4A CN103678670B (en) | 2013-12-25 | 2013-12-25 | Micro-blog hot word and hot topic mining system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103678670A true CN103678670A (en) | 2014-03-26 |
CN103678670B CN103678670B (en) | 2017-01-11 |
Family
ID=50316214
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310725400.4A Expired - Fee Related CN103678670B (en) | 2013-12-25 | 2013-12-25 | Micro-blog hot word and hot topic mining system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103678670B (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104063428A (en) * | 2014-06-09 | 2014-09-24 | 国家计算机网络与信息安全管理中心 | Method for detecting unexpected hot topics in Chinese microblogs |
CN104156436A (en) * | 2014-08-13 | 2014-11-19 | 福州大学 | Social association cloud media collaborative filtering and recommending method |
CN105095988A (en) * | 2015-07-01 | 2015-11-25 | 中国科学院计算技术研究所 | Method and system for detecting social network information explosion |
CN105488196A (en) * | 2015-12-07 | 2016-04-13 | 中国人民大学 | Automatic hot topic mining system based on internet corpora |
CN106446191A (en) * | 2016-09-30 | 2017-02-22 | 浙江工业大学 | Logistic regression based multi-feature network popular tag prediction method |
CN106610989A (en) * | 2015-10-22 | 2017-05-03 | 北京国双科技有限公司 | Search keyword clustering method and apparatus |
CN106919627A (en) * | 2015-12-28 | 2017-07-04 | 北京国双科技有限公司 | The treating method and apparatus of hot word |
CN107122478A (en) * | 2017-05-03 | 2017-09-01 | 成都云数未来信息科学有限公司 | A kind of method based on keyword extraction much-talked-about topic |
CN108170693A (en) * | 2016-12-07 | 2018-06-15 | 北京国双科技有限公司 | Push the method and device of hot word |
CN108182191A (en) * | 2016-12-08 | 2018-06-19 | 腾讯科技(深圳)有限公司 | A kind of hot spot data processing method and its equipment |
CN108241611A (en) * | 2016-12-26 | 2018-07-03 | 北京国双科技有限公司 | A kind of keyword extracting method and extraction equipment |
CN108304371A (en) * | 2017-07-14 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Method, apparatus, computer equipment and the storage medium that Hot Contents excavate |
CN108804432A (en) * | 2017-04-26 | 2018-11-13 | 慧科讯业有限公司 | It is a kind of based on network media data Stream Discovery and to track the mthods, systems and devices of much-talked-about topic |
CN109509110A (en) * | 2018-07-27 | 2019-03-22 | 福州大学 | Method is found based on the hot microblog topic for improving BBTM model |
CN110377823A (en) * | 2019-06-28 | 2019-10-25 | 厦门美域中央信息科技有限公司 | A kind of building of hot spot digging system under Hadoop frame |
CN110765239A (en) * | 2019-10-29 | 2020-02-07 | 腾讯科技(深圳)有限公司 | Hot word recognition method, device and storage medium |
CN111125484A (en) * | 2019-12-17 | 2020-05-08 | 网易(杭州)网络有限公司 | Topic discovery method and system and electronic device |
CN112668836A (en) * | 2020-12-07 | 2021-04-16 | 数据地平线(广州)科技有限公司 | Risk graph-oriented associated risk evidence efficient mining and monitoring method and device |
CN113673224A (en) * | 2021-08-19 | 2021-11-19 | 北京三快在线科技有限公司 | Method and device for recognizing popular vocabulary, computer equipment and readable storage medium |
CN113836307A (en) * | 2021-10-15 | 2021-12-24 | 国网北京市电力公司 | Power supply service work order hotspot discovery method, system and device and storage medium |
CN114938477A (en) * | 2022-06-23 | 2022-08-23 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
CN117076963A (en) * | 2023-10-17 | 2023-11-17 | 北京国科众安科技有限公司 | Information heat analysis method based on big data platform |
CN114938477B (en) * | 2022-06-23 | 2024-05-03 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2700629A1 (en) * | 2010-05-13 | 2011-11-13 | Gerard Voon | Shopping enabler |
CN102346766A (en) * | 2011-09-20 | 2012-02-08 | 北京邮电大学 | Method and device for detecting network hot topics found based on maximal clique |
CN103294818A (en) * | 2013-06-12 | 2013-09-11 | 北京航空航天大学 | Multi-information fusion microblog hot topic detection method |
-
2013
- 2013-12-25 CN CN201310725400.4A patent/CN103678670B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2700629A1 (en) * | 2010-05-13 | 2011-11-13 | Gerard Voon | Shopping enabler |
CN102346766A (en) * | 2011-09-20 | 2012-02-08 | 北京邮电大学 | Method and device for detecting network hot topics found based on maximal clique |
CN103294818A (en) * | 2013-06-12 | 2013-09-11 | 北京航空航天大学 | Multi-information fusion microblog hot topic detection method |
Non-Patent Citations (1)
Title |
---|
龙志祎等: "基于词聚类的热点话题检测算法", 《计算机工程与设计》, vol. 32, no. 6, 30 June 2011 (2011-06-30), pages 2214 - 2217 * |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104063428A (en) * | 2014-06-09 | 2014-09-24 | 国家计算机网络与信息安全管理中心 | Method for detecting unexpected hot topics in Chinese microblogs |
CN104156436B (en) * | 2014-08-13 | 2017-05-10 | 福州大学 | Social association cloud media collaborative filtering and recommending method |
CN104156436A (en) * | 2014-08-13 | 2014-11-19 | 福州大学 | Social association cloud media collaborative filtering and recommending method |
CN105095988A (en) * | 2015-07-01 | 2015-11-25 | 中国科学院计算技术研究所 | Method and system for detecting social network information explosion |
CN106610989B (en) * | 2015-10-22 | 2021-06-01 | 北京国双科技有限公司 | Search keyword clustering method and device |
CN106610989A (en) * | 2015-10-22 | 2017-05-03 | 北京国双科技有限公司 | Search keyword clustering method and apparatus |
CN105488196B (en) * | 2015-12-07 | 2019-01-22 | 中国人民大学 | A kind of hot topic automatic mining system based on interconnection corpus |
CN105488196A (en) * | 2015-12-07 | 2016-04-13 | 中国人民大学 | Automatic hot topic mining system based on internet corpora |
CN106919627A (en) * | 2015-12-28 | 2017-07-04 | 北京国双科技有限公司 | The treating method and apparatus of hot word |
CN106446191B (en) * | 2016-09-30 | 2019-11-05 | 浙江工业大学 | A kind of multiple features network flow row label prediction technique returned based on Logistic |
CN106446191A (en) * | 2016-09-30 | 2017-02-22 | 浙江工业大学 | Logistic regression based multi-feature network popular tag prediction method |
CN108170693A (en) * | 2016-12-07 | 2018-06-15 | 北京国双科技有限公司 | Push the method and device of hot word |
CN108170693B (en) * | 2016-12-07 | 2020-07-31 | 北京国双科技有限公司 | Hot word pushing method and device |
CN108182191A (en) * | 2016-12-08 | 2018-06-19 | 腾讯科技(深圳)有限公司 | A kind of hot spot data processing method and its equipment |
CN108182191B (en) * | 2016-12-08 | 2022-01-18 | 腾讯科技(深圳)有限公司 | Hotspot data processing method and device |
CN108241611A (en) * | 2016-12-26 | 2018-07-03 | 北京国双科技有限公司 | A kind of keyword extracting method and extraction equipment |
CN108241611B (en) * | 2016-12-26 | 2021-08-17 | 北京国双科技有限公司 | Keyword extraction method and extraction equipment |
CN108804432A (en) * | 2017-04-26 | 2018-11-13 | 慧科讯业有限公司 | It is a kind of based on network media data Stream Discovery and to track the mthods, systems and devices of much-talked-about topic |
CN107122478A (en) * | 2017-05-03 | 2017-09-01 | 成都云数未来信息科学有限公司 | A kind of method based on keyword extraction much-talked-about topic |
CN107122478B (en) * | 2017-05-03 | 2020-05-08 | 成都云数未来信息科学有限公司 | Method for extracting hot topics based on keywords |
CN108304371A (en) * | 2017-07-14 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Method, apparatus, computer equipment and the storage medium that Hot Contents excavate |
CN108304371B (en) * | 2017-07-14 | 2021-07-13 | 腾讯科技(深圳)有限公司 | Method and device for mining hot content, computer equipment and storage medium |
CN109509110A (en) * | 2018-07-27 | 2019-03-22 | 福州大学 | Method is found based on the hot microblog topic for improving BBTM model |
CN109509110B (en) * | 2018-07-27 | 2021-08-31 | 福州大学 | Microblog hot topic discovery method based on improved BBTM model |
CN110377823A (en) * | 2019-06-28 | 2019-10-25 | 厦门美域中央信息科技有限公司 | A kind of building of hot spot digging system under Hadoop frame |
CN110765239A (en) * | 2019-10-29 | 2020-02-07 | 腾讯科技(深圳)有限公司 | Hot word recognition method, device and storage medium |
CN110765239B (en) * | 2019-10-29 | 2023-03-28 | 腾讯科技(深圳)有限公司 | Hot word recognition method, device and storage medium |
CN111125484A (en) * | 2019-12-17 | 2020-05-08 | 网易(杭州)网络有限公司 | Topic discovery method and system and electronic device |
CN111125484B (en) * | 2019-12-17 | 2023-06-30 | 网易(杭州)网络有限公司 | Topic discovery method, topic discovery system and electronic equipment |
CN112668836A (en) * | 2020-12-07 | 2021-04-16 | 数据地平线(广州)科技有限公司 | Risk graph-oriented associated risk evidence efficient mining and monitoring method and device |
CN112668836B (en) * | 2020-12-07 | 2024-04-05 | 数据地平线(广州)科技有限公司 | Risk spectrum-oriented associated risk evidence efficient mining and monitoring method and apparatus |
CN113673224B (en) * | 2021-08-19 | 2022-04-05 | 北京三快在线科技有限公司 | Method and device for recognizing popular vocabulary, computer equipment and readable storage medium |
CN113673224A (en) * | 2021-08-19 | 2021-11-19 | 北京三快在线科技有限公司 | Method and device for recognizing popular vocabulary, computer equipment and readable storage medium |
CN113836307A (en) * | 2021-10-15 | 2021-12-24 | 国网北京市电力公司 | Power supply service work order hotspot discovery method, system and device and storage medium |
CN113836307B (en) * | 2021-10-15 | 2024-02-20 | 国网北京市电力公司 | Power supply service work order hot spot discovery method, system, device and storage medium |
CN114938477A (en) * | 2022-06-23 | 2022-08-23 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
CN114938477B (en) * | 2022-06-23 | 2024-05-03 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
CN117076963A (en) * | 2023-10-17 | 2023-11-17 | 北京国科众安科技有限公司 | Information heat analysis method based on big data platform |
CN117076963B (en) * | 2023-10-17 | 2024-01-02 | 北京国科众安科技有限公司 | Information heat analysis method based on big data platform |
Also Published As
Publication number | Publication date |
---|---|
CN103678670B (en) | 2017-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103678670A (en) | Micro-blog hot word and hot topic mining system and method | |
CN103745000B (en) | Hot topic detection method of Chinese micro-blogs | |
CN103617169B (en) | A kind of hot microblog topic extracting method based on Hadoop | |
Li et al. | Filtering out the noise in short text topic modeling | |
Hai et al. | Identifying features in opinion mining via intrinsic and extrinsic domain relevance | |
CN106156286B (en) | Type extraction system and method towards technical literature knowledge entity | |
CN101593200B (en) | Method for classifying Chinese webpages based on keyword frequency analysis | |
CN103514183B (en) | Information search method and system based on interactive document clustering | |
CN111143576A (en) | Event-oriented dynamic knowledge graph construction method and device | |
WO2020108430A1 (en) | Weibo sentiment analysis method and system | |
CN103605665A (en) | Keyword based evaluation expert intelligent search and recommendation method | |
CN104699766A (en) | Implicit attribute mining method integrating word correlation and context deduction | |
CN104484343A (en) | Topic detection and tracking method for microblog | |
CN104978332B (en) | User-generated content label data generation method, device and correlation technique and device | |
CN107291886A (en) | A kind of microblog topic detecting method and system based on incremental clustering algorithm | |
CN109214454B (en) | Microblog-oriented emotion community classification method | |
CN111914087A (en) | Public opinion analysis method | |
CN104281565A (en) | Semantic dictionary constructing method and device | |
CN107203513A (en) | Microblogging text data fine granularity topic evolution analysis method based on probabilistic model | |
CN110929683B (en) | Video public opinion monitoring method and system based on artificial intelligence | |
CN103488637A (en) | Method for carrying out expert search based on dynamic community mining | |
CN105117466A (en) | Internet information screening system and method | |
CN102063497A (en) | Open type knowledge sharing platform and entry processing method thereof | |
Campbell et al. | Content+ context networks for user classification in twitter | |
Lim et al. | ClaimFinder: A Framework for Identifying Claims in Microblogs. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170111 Termination date: 20191225 |
|
CF01 | Termination of patent right due to non-payment of annual fee |