CN103678670B - Micro-blog hot word and hot topic mining system and method - Google Patents

Micro-blog hot word and hot topic mining system and method Download PDF

Info

Publication number
CN103678670B
CN103678670B CN201310725400.4A CN201310725400A CN103678670B CN 103678670 B CN103678670 B CN 103678670B CN 201310725400 A CN201310725400 A CN 201310725400A CN 103678670 B CN103678670 B CN 103678670B
Authority
CN
China
Prior art keywords
hot
candidate
hot word
word
hotword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310725400.4A
Other languages
Chinese (zh)
Other versions
CN103678670A (en
Inventor
陈羽中
郭文忠
陈国龙
方明月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201310725400.4A priority Critical patent/CN103678670B/en
Publication of CN103678670A publication Critical patent/CN103678670A/en
Application granted granted Critical
Publication of CN103678670B publication Critical patent/CN103678670B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of social networks, in particular to a micro-blog hot word and hot topic mining system and method. The method includes the following steps that content data released in a micro-blog are preprocessed to acquire a candidate hot word sequence; according to the frequency of occurrence and suddenness of candidate hot words in a candidate hot word set at the current moment and in a given historical time window, the vitality of each candidate hot word is worked out, and a hot word set is formed by screening the candidate hot words; according to the hot word set formed by screening the candidate hot words, hot word correlation is worked out, and a hot word co-occurrence network is constructed; according to the hot word co-occurrence network, the hot word set is partitioned through the hot word clustering algorithm based on multi-label propagation to acquire a hot topic set. By means of the micro-blog hot word and hot topic mining system and method, efficient micro-blog hot word and hot topic mining is achieved, and mining precision and processing efficiency are improved.

Description

Microblog hotword and hot topic mining system and method
Technical Field
The invention relates to the technical field of social networks, in particular to a microblog hotword and hot topic mining system and method.
Background
With the rise of microblogs, the participation of people is continuously improved, and users can release own seen news at any time and any place through computers and mobile phones and realize instant sharing. Microblogs become a fashion of the internet at present, and are also important places for generating and discussing hot topics, wherein the hot topics are topics which frequently appear on the network within a period of time and are widely concerned and discussed by people. The exponential growth of microblog information makes how to effectively master massive information and extract hot topics become a problem to be solved urgently.
For hot topic detection, a traditional method is to cluster texts, but the method is not favorable for a user to visually identify hot topics, and microblogs have short text characteristics, data are sparse and are not distributed uniformly, so that the effect of the method for finding the hot topics is not ideal. Therefore, the mainstream method is to realize hot topic discovery by hot word extraction and clustering.
The classical methods used to weigh word importance and extract hotwords are TFIDF and TFPDF, among others. The main idea of TFIDF is that the frequency of occurrence of words does not sufficiently represent text features, such as words "yes" and "mare", which frequently occur but have little ability to represent text. However, the method is not suitable for calculating the weight of words in microblogs, microblogs have short text characteristics, repeated words rarely appear on one microblog, and after hot topics appear on the microblogs, users can widely forward and discuss the words, a large number of microblogs contain the same keywords, and if the TFIDF method is used for extracting the keywords, important words can be lost to a certain extent. Therefore, scholars have proposed a method for TFPDF that gives higher weight to words that appear in most documents to extract hot words. The method is beneficial to extracting key words related to hot topics, but also can extract words which frequently appear but do not express the topic capacity. The hot words refer to words with sharp increase of word frequency in a period of time, and the two methods do not consider the distribution condition of the words along with the time and are not beneficial to the extraction of the hot words.
For hot word clustering, the existing methods include: 1) adopting a Bisecting K-mean clustering algorithm insensitive to the initial cluster; 2) clustering is carried out by constructing a word similarity matrix and utilizing an Affinity Propagation algorithm under the condition of not needing to appoint the number of clusters, and the time complexity is close to that of the clustering; 3) density clustering based algorithms such as DBSCAN; 4) hierarchical clustering algorithms, and the like.
For the problem of finding hot spots of massive microblog data, the existing hot word clustering method mainly has the following problems: firstly, the words related to different topics in the clustering result are not allowed to have an intersection, which is not consistent with the actual situation, and thus some topics are not found or the recognition degree of the topics is very low. For example, in two topics, namely "expense problem in colleges and universities" and "leaderboard in colleges and universities", the term "colleges and universities" can only belong to one topic at most, and any one of the two topics lacks the keyword "colleges and universities", so that the original topic is difficult to identify. In addition, the time complexity of the traditional clustering algorithm is high, and the requirement of clustering mass microblog data is difficult to adapt to.
In summary, a relatively perfect technology and method have appeared for analyzing the influence of user individuals in a social network, but the methods for analyzing the influence at the level of communities in the social network are relatively few, and lack of comprehensive analysis and evaluation on the influence of each community in the social network, and the existing methods are difficult to meet the requirements in terms of analysis effect and efficiency in the face of a large-scale social network scenario.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a microblog hot word and hot topic mining system and method, which are beneficial to improving the accuracy and processing efficiency of microblog hot spot discovery.
In order to achieve the purpose, the technical scheme of the invention is as follows: a microblog hotword and hot topic mining system, the system comprising: the system comprises a preprocessing module, a hot word screening module, a hot word co-occurrence network construction module and a hot word clustering module;
the preprocessing module is used for preprocessing content data published in the social network to obtain candidate hot words and construct a candidate hot word set;
the hot word screening module is used for calculating the vitality of each candidate hot word according to the occurrence frequency and the burstiness of each candidate hot word in the candidate hot word set in the current time and a given historical time window, screening out the hot words and constructing a hot word set according to the vitality of each candidate hot word;
the hot word co-occurrence network construction module is used for calculating the correlation of each hot word in the hot word set and constructing a hot word co-occurrence network;
and the hot word clustering module is used for dividing a hot word set by using a hot word clustering algorithm based on multi-label propagation according to the hot word co-occurrence network to obtain a hot word set.
The invention also provides a method for mining microblog hotwords and hot topics, which comprises the following steps:
step A: preprocessing content data published in a social network to obtain candidate hot words, and constructing a candidate hot word set according to the candidate hot words;
and B: calculating the vitality of each candidate hot word according to the occurrence frequency and the burstiness of each candidate hot word in the candidate hot word set at the current moment and in a given historical time window, screening out the hot words, and constructing a hot word set according to the vitality of each candidate hot word;
and C: calculating the correlation of each hot word in the hot word set, and constructing a hot word co-occurrence network according to the correlation;
step D: and according to the hot word co-occurrence network, dividing a hot word set by using a hot word clustering algorithm based on multi-label propagation to obtain a hot word set.
Further, in the step B, the process of screening hot words and constructing a hot word set specifically includes the following steps:
step B1: calculating over a period of timetThe nutritional value of each candidate hotword; candidate hotwordwThe nutritional value ofNutr w t,In a time periodtInner and micro blog collectiontw t Candidate hot words of each microblog pairwThe calculation formula is as follows:
wherein,Contr w j,in a time periodtIn, the firstjCandidate hot word of strip microblog pairwThe contribution of the nutritional value of (c),jtw t the calculation formula is as follows:
wherein,is shown asjCandidate hotword appearing in strip microblogwThe number of times of the operation of the motor,is shown asjMaximum word frequency in the microblog;
step B2: utilizing candidate hotwordswTo describe candidate hotwordswThe intensity of the change of the word frequency between the current time period and the historical time period; candidate hotwordwBurst value ofB w t,The calculation method comprises the following steps: time taking periodtPrevious k historical time windows, historical time window size and time periodtThe same, then respectively counting the discrete event models in the time period based on the binomial distributiontAnd time periodtContaining candidate hotwords in previous k historical time windowswThe number of microblogs is adoptedStatistical formula, calculating candidate hotwordswIn a period of timetThe burst value in the formula is as follows:
wherein,Ais shown in the time periodtIncluding candidate hotwordswThe number of microblogs;Bindicating that the candidate hotword is contained in k historical time windowswAverage number of microblogs;Cis shown in the time periodtInterior, notContaining candidate wordswThe number of microblogs;Dindicating that no word candidate hot words are included in the k historical time windowswAverage number of microblogs;
step B3: calculating the vital force value of each candidate hot word by combining the nutritional value and the burst value of each candidate hot word; normalized candidate hotwordwVital force value oflife w t,The calculation method comprises the following steps:
wherein,termsa set of candidate hot words is represented,w' representing a set of candidate hotwordstermsThe elements of (1);
step B4: and sequencing the candidate hot words in the candidate hot word set according to the vital force values of the candidate hot words, screening L candidate hot words which are sequenced at the front as hot words, and forming the hot word set according to the hot word set.
Further, in the step C, hotwordszAnd hot wordskAt a given time periodtIntra-correlationc z k,Is defined as:
wherein,r z k,meaning that it contains hot words at the same timezAnd hot wordskThe number of micro-blogs in the same way,n z meaning containing hotwordszThe number of micro-blogs in the same way,R k meaning containing hotwordskThe number of micro-blogs in the same way,Nindicating a period of timetNumber of microblogs therein, i.e.N=tw t
The hot word co-occurrence network is defined asG(V,E,W) WhereinRepresenting the hot word set obtained in the step B as a node set,mrepresenting the number of nodes;Erepresenting a set of edges between nodes, for any two nodesIf the words represented by the two nodes have a co-occurrence relationship, then an edge between the two vertices is constructedWRepresenting a collection of edgesETo set of real numbersRIs mapped tov i v j There is an edge in betweenIf the edge weight is the firstiA hot word andjsimilarity between individual hotwordssim(i,j) Defined as:
further, in the step D, each hot word in the hot word set, that is, each node has a label membership set, and the label membership set of the node is updated in each iteration until the algorithm converges, specifically including the following steps:
step D1: initializing a label of a node according to the hot word co-occurrence network;
step D2: node for randomly acquiring non-updated labelvGo through the nodevAccording to the label set of the neighbor node, the node is updatedvDegree of membership of each label in the set of labels, to nodesvCarrying out tag membership degree normalization;
step D3: repeating iteration until an iteration termination condition is met;
step D4: and classifying the nodes according to the label membership set of the nodes obtained by iteration to obtain a hot topic set.
Further, in step D1, the method for initializing the tag includes: assigning a unique label number to each node and assigning membership degrees to each nodeTo which the unique tag number is assigneduniqueLabels
Further, in step D2, the update rule of the tag membership degree is: node for randomly acquiring non-updated labelvObtaining the neighbor node set of the nodeNb(v) And further obtain the label set owned by the neighbor nodelabelsThen at the h-th iteration, the nodevBelongs to the label numberThe membership degree is as follows:
wherein,sim(u,v) Representing nodesuAnd nodevSimilarity, denominator betweenNormalization for label membership degree and node guaranteevThe sum of the tag membership of (a) is 1.
Further, in the step D3, the iteration termination condition is:
whereinr h Is defined as:
when in useAnd the iteration is ended.
Compared with the prior art, the invention has the beneficial effects that: calculating the vitality of each candidate hot word according to the occurrence frequency and the burstiness of the candidate hot words in the current time and a given historical time window, screening out a hot word set, calculating the hot word correlation according to the screened hot word set, constructing a hot word co-occurrence network, and dividing the hot word set by using a hot word clustering algorithm of multi-label propagation to obtain a hot topic set. The system and the method can realize the efficient mining of the hot topics of the social network, and improve the topic detection precision and the processing efficiency.
Drawings
FIG. 1 is a schematic block diagram of the system of the present invention.
FIG. 2 is a flow chart of the method of the present invention.
FIG. 3 is a flow chart of implementation of microblog hotword clustering in the method of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
FIG. 1 is a schematic diagram of a module structure of a microblog hotword and hot topic mining system. As shown in fig. 1, the system includes: the system comprises a preprocessing module 100, a hot word screening module 200, a hot word co-occurrence network construction module 300 and a hot word clustering module 400.
The preprocessing module 100 is configured to preprocess content data published in a social network to obtain candidate hotwords, and construct a candidate hotword set according to the candidate hotwords; the hot word screening module 200 is configured to calculate a vitality of each candidate hot word according to an occurrence frequency and a burstiness of each candidate hot word in the candidate hot word set at a current time and in a given historical time window, screen out the hot words, and construct a hot word set according to the vitality of each candidate hot word; the hot word co-occurrence network constructing module 300 is configured to calculate the correlation of each hot word in the hot word set, and construct a hot word co-occurrence network based on the correlation; the hot word clustering module 400 is configured to divide a hot word set by using a hot word clustering algorithm based on multi-label propagation according to the hot word co-occurrence network, so as to obtain a hot word set.
FIG. 2 is a flowchart of a microblog hotword and hot topic mining method according to the invention. As shown in fig. 2, the method comprises the steps of:
step A: and preprocessing content data published in the social network to obtain candidate hot words, and constructing a candidate hot word set according to the candidate hot words.
Specifically, the ICTCCLA of the Chinese academy of sciences can be used for word segmentation and part-of-speech tagging, nouns and verbs with strong expression capacity for topics are extracted, then the stop word list is used for further filtering, a candidate hot word set is obtained and is marked asrIndicating the number of candidate words.
And B: and calculating the vitality of each candidate hot word according to the occurrence frequency and the burstiness of each candidate hot word in the candidate hot word set at the current moment and in a given historical time window, screening out the hot words, and constructing the hot word set according to the vitality of each candidate hot word.
In the step B, the process of screening hot words and constructing a hot word set specifically includes the following steps:
step B1: calculating over a period of timetThe nutritional value of each candidate hotword; candidate hotwordwThe nutritional value ofNutr w t,In a time periodtInterior and exterior microBoji (Boji)tw t Candidate hot words of each microblog pairwThe calculation formula is as follows:
wherein,Contr w j,in a time periodtIn, the firstjCandidate hot word of strip microblog pairwThe contribution of the nutritional value of (c),jtw t the calculation formula is as follows:
wherein,is shown asjCandidate hotword appearing in strip microblogwThe number of times of the operation of the motor,is shown asjMaximum word frequency in the microblog;
step B2: utilizing candidate hotwordswTo describe candidate hotwordswThe intensity of the change of the word frequency between the current time period and the historical time period; candidate hotwordwBurst value ofB w t,The calculation method comprises the following steps: time taking periodtPrevious k historical time windows, historical time window size and time periodtThe same, then respectively counting the discrete event models in the time period based on the binomial distributiontAnd time periodtContaining candidate hotwords in previous k historical time windowswThe number of microblogs is adoptedStatistical formula, calculating candidate hotwordswIn a period of timetThe burst value in the formula is as follows:
wherein,Ais shown in the time periodtIncluding candidate hotwordswThe number of microblogs;Bindicating that the candidate hotword is contained in k historical time windowswAverage number of microblogs;Cis shown in the time periodtIn, no candidate words are includedwThe number of microblogs;Dindicating that no word candidate hot words are included in the k historical time windowswAverage number of microblogs;
step B3: calculating the vital force value of each candidate hot word by combining the nutritional value and the burst value of each candidate hot word; normalized candidate hotwordwVital force value oflife w t,The calculation method comprises the following steps:
wherein,termsa set of candidate hot words is represented,w' representing a set of candidate hotwordstermsThe elements of (1);
step B4: and sequencing the candidate hot words in the candidate hot word set according to the vital force values of the candidate hot words, screening L candidate hot words which are sequenced at the front as hot words, and forming the hot word set according to the hot word set.
Specifically, after the vital sign values of the hot words are obtained through calculation, the candidate hot words can be ranked from top to bottom by adopting a Quick ranking (Quick Sort) algorithm according to the vital sign values, and the top M candidate hot words with the highest vital sign values are selected as time periods according to a given threshold value MtHot words inside.
And C: and calculating the correlation of each hot word in the hot word set, and constructing a hot word co-occurrence network according to the correlation.
In the step C, hotwordszAnd hot wordskAt a given time periodtIntra-correlationc z k,Is defined as:
wherein,r z k,meaning that it contains hot words at the same timezAnd hot wordskThe number of micro-blogs in the same way,n z meaning containing hotwordszThe number of micro-blogs in the same way,R k meaning containing hotwordskThe number of micro-blogs in the same way,Nindicating a period of timetNumber of microblogs therein, i.e.N=tw t
The hot word co-occurrence network is defined asG(V,E,W) WhereinRepresenting the hot word set obtained in the step B as a node set,mrepresenting the number of nodes;Erepresenting a set of edges between nodes, for any two nodesIf the words represented by the two nodes have a co-occurrence relationship, then an edge between the two vertices is constructedWRepresenting a collection of edgesETo set of real numbersRIs mapped tov i v j There is an edge in betweenIf the edge weight is the firstiA hot word andjsimilarity between individual hotwordssim(i,j) Defined as:
step D: and according to the hot word co-occurrence network, dividing a hot word set by using a hot word clustering algorithm based on multi-label propagation to obtain a hot word set.
The hot word clustering algorithm based on multi-label propagation is characterized in that: the vocabulary co-occurrence network constructed based on the human language or the text document has the characteristics of high convergence and short path. Therefore, one topic can be regarded as a set of points (words) with close internal connection and sparse external link, which accords with the definition of communities in a complex network, and moreover, the topic discovery problem can be converted into the problem of carrying out overlapped word community division on a word co-occurrence network due to the fact that overlapped keywords possibly exist among the topics; the multi-label means that one node is allowed to have a plurality of community labels and belong to a plurality of hot word communities, namely, one hot word is allowed to belong to a plurality of topics. Each label carries a label membership degree, the labels and the label membership degree of the nodes are updated in the label propagation process, the label set of each node is cut according to a set threshold value, and finally the nodes are divided into a plurality of communities (hot topics) according to the labels owned by each node.
In the step D, each hot word in the hot word set, that is, each node has a label membership set, and the label membership set of the node is updated in each iteration until the algorithm converges. Fig. 3 is a flowchart of the implementation of step D in the method of the present invention, which specifically includes the following steps:
step D1: initializing labels of nodes (hot words) according to the hot word co-occurrence network;
in step D1, the method for initializing the tag includes: assigning a unique label number to each node and assigning membership degrees to each nodeTo which the unique tag number is assigneduniqueLabels
Step D2: node for randomly acquiring non-updated labelvGo through the nodevAccording to the label set of the neighbor node, the node is updatedvDegree of membership of each label in the set of labels, to nodesvCarrying out tag membership degree normalization;
in step D2, the update rule of the tag membership degree is: node for randomly acquiring non-updated labelvObtaining the neighbor node set of the nodeNb(v) And further obtain the label set owned by the neighbor nodelabelsThen at the h-th iteration, the nodevBelongs to the label numberThe membership degree is as follows:
wherein,sim(u,v) Representing nodesuAnd nodevSimilarity, denominator betweenNormalization for label membership degree and node guaranteevThe sum of the tag membership of (a) is 1.
Step D3: according to a given threshold valuepTo nodevFiltering the label set, and then normalizing the membership value of the reserved label again;
specifically, step D3 requires a parameter to be givenpFiltering the label set of the node with the updated label membership degree in the iteration process, only reserving partial labels, preventing the label set of the node from being too large,pthe size of (2) represents the maximum number of labels that a node is allowed to have, and the specific filtering rule is as follows: the membership degree of the label membership set of the deleted node is lower than 1pOf (2) is used. And normalizing the label set obtained after filtering again to ensure that the sum of the membership degrees of all labels of the nodes is 1。
Step D4: repeating iteration until an iteration termination condition is met;
in the step D4, the iteration termination condition is: under the condition that the generated label sets in two adjacent iterations are identical, if the number of internal nodes of each label in the history record is not changed any more, the iteration is ended, namely:
whereinr h Is defined as:
when in useAnd the iteration is ended.
Step D5: and classifying the nodes (hot words) according to the label membership set of the nodes obtained by iteration to obtain a hot topic set.
Specifically, after iteration is finished, the label set of each node is detected, the nodes (hotwords) are divided into corresponding categories (communities), and according to a given threshold value M, each category (community) only needs to take M hotwords with the top ranking of life values to express the corresponding hot topics. M defaults to 10.
According to the microblog hot topic detection system and method, the occurrence frequency and the burst of words are comprehensively considered, a novel word life value calculation model is designed for hot word extraction, then a word co-occurrence network is constructed, and hot word clustering is carried out based on multi-label propagation close to linear time complexity to obtain the hot topic. In conclusion, the system and the method can effectively extract the hot words and the hot topics, and greatly improve the precision and the time efficiency of hot topic detection.
The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims (6)

1. A microblog hotword and hot topic mining method is characterized by comprising the following steps:
step A: preprocessing content data published in a social network to obtain candidate hot words, and constructing a candidate hot word set according to the candidate hot words;
and B: calculating the vitality of each candidate hot word according to the occurrence frequency and the burstiness of each candidate hot word in the candidate hot word set at the current moment and in a given historical time window, screening out the hot words, and constructing a hot word set according to the vitality of each candidate hot word;
and C: calculating the correlation of each hot word in the hot word set, and constructing a hot word co-occurrence network according to the correlation;
step D: according to the hot word co-occurrence network, dividing a hot word set by using a hot word clustering algorithm based on multi-label propagation to obtain a hot word set;
in the step B, the process of screening hot words and constructing a hot word set specifically includes the following steps:
step B1: calculating the nutritional value of each candidate hot word in the time period t; nutr for nutrition value of candidate hotword ww,tIn order to collect the microblogs tw in the time period ttThe sum of the contributions of each microblog to the nutrition value of the candidate hotword w is calculated by the following formula:
Nutr w , t = Σ j ∈ tw t Contr w , j
wherein, Contrw,jContribution of jth microblog to nutrition value of candidate hotword w in time period t, j ∈ twtThe calculation formula is as follows:
Contr w , j = tf w , j tf j max
wherein, tfw,jRepresenting the frequency of occurrence of the candidate hotword w in the jth microblog,representing the maximum word frequency in the jth microblog;
step B2: describing the intensity degree of change of the word frequency of the candidate hot words w between the current time period and the historical time period by using the burst value of the candidate hot words w; burst value B of candidate hot word ww,tThe calculation method comprises the following steps: taking k historical time windows before a time period t, wherein the size of the historical time windows is the same as that of the time period t, respectively counting the number of microblogs containing the candidate hotwords w in the time period t and the k historical time windows before the time period t on the basis of a binomial distribution discrete event model, and adopting chi2And a statistical formula, wherein the burst value of the candidate hot word w in the time period t is calculated, and the calculation formula is as follows:
B w , t = ( A + B + C + D ) ( A D - B C ) 2 ( A + B ) ( C + D ) ( A + C ) ( B + D )
a represents the number of microblogs containing the candidate hotword w in a time period t; b represents the average microblog number containing the candidate hotwords w in k historical time windows; c represents the number of microblogs which do not contain the candidate word w in the time period t; d represents the average microblog number without the word candidate hot words w in k historical time windows;
step B3: calculating the vital force value of each candidate hot word by combining the nutritional value and the burst value of each candidate hot word; life force value life of normalized candidate hotword ww,tThe calculation method comprises the following steps:
life w , t = B w , t * Nutr w , t m a x w ′ ∈ t e r m s ( B w ′ , t * Nutr w ′ , t )
wherein, term represents a candidate hot word set, and w' represents an element in the candidate hot word set term;
step B4: according to the vital force values of the candidate hot words, sorting the candidate hot words in the candidate hot word set, screening L candidate hot words with the top sorting as hot words, and forming a hot word set according to the L candidate hot words;
the microblog hotword and hot topic mining system corresponding to the method comprises the following steps:
the preprocessing module is used for preprocessing content data published in the social network to obtain candidate hot words and construct a candidate hot word set;
the hot word screening module is used for calculating the vitality of each candidate hot word according to the occurrence frequency and the burstiness of each candidate hot word in the candidate hot word set in the current time and a given historical time window, screening out the hot words and constructing a hot word set according to the vitality of each candidate hot word;
the hot word co-occurrence network construction module is used for calculating the correlation of each hot word in the hot word set and constructing a hot word co-occurrence network;
and the hot word clustering module is used for dividing a hot word set by using a hot word clustering algorithm based on multi-label propagation according to the hot word co-occurrence network to obtain a hot word set.
2. The microblog hotword and hot topic mining method according to claim 1, wherein in the step C, the relevance C of the hotword z and the hotword k in a given time period tz,kIs defined as:
c z , k = l o g r z , k / ( R k - r z , k ) ( n z - r z , k ) / ( N - n z - R k + r z , k ) × | r z , k R k - n z - r z , k N - R k |
wherein r isz,kIs shown as simultaneously containingNumber of microblogs, n, of the hotword z and the hotword kzIndicating the number of microblogs containing a hotword z, RkIndicating the number of microblogs containing the hotword k, and N indicating the number of all microblogs in the time period t, i.e. N ═ twt
The hotword co-occurrence network is defined as G (V, E, W), where V ═ V1,v2,...,vmB, representing the hot word set obtained in the step B by using a node set, wherein m represents the number of nodes; e represents the set of edges between nodes, v for any two nodesi,vj∈{v1,v2,...,vmAnd if the words represented by the two nodes have a co-occurrence relationship, constructing an edge e between the two vertexesi,j∈ E, W represents the mapping of the set of edges E to the set of real numbers R, if vi,vjWith an edge e in betweeni,j∈ E, the edge weight is the similarity sim (i, j) between the ith hot word and the jth hot word, defined as:
s i m ( i , j ) = c i , j / Σ i , j ∈ V , i ≠ j c 2 i , j .
3. the microblog hotword and hot topic mining method according to claim 2, wherein in the step D, each hotword in the hotword set, that is, each node has a tag membership set, and the tag membership set of the node is updated in each iteration until the algorithm converges, specifically comprising the following steps:
step D1: initializing a label of a node according to the hot word co-occurrence network;
step D2: randomly acquiring a node v without updated labels, traversing neighbor nodes of the node v, updating the membership degree of each label in the label set of the node v according to the label set of the neighbor nodes, and performing label membership degree normalization on the node v;
step D3: repeating iteration until an iteration termination condition is met;
step D4: and classifying the nodes according to the label membership set of the nodes obtained by iteration to obtain a hot topic set.
4. The method for mining microblog hotwords and hot topics according to claim 3, wherein in the step D1, the method for initializing the tag comprises the following steps: each node is assigned a unique tag number and is respectively subordinate to the tag number with the degree of membership of 1.0, and the unique tag numbers are collected as unique tags.
5. The microblog hotword and hot topic mining method according to claim 4, wherein in the step D2, the updating rule of the tag membership degree is as follows: randomly acquiring a node v without updating a label, acquiring a neighbor node set Nb (v) of the node, and further acquiring a label set labels owned by the neighbor node, wherein in the h iteration, the node v belongs to a label number lbi∈ labels has a degree of membership of:
b h ( lb i , v ) = Σ u ∈ N b ( v ) s i m ( u , v ) * b h - 1 ( lb i , u ) Σ u ∈ N b ( v ) s i m ( u , v )
where sim (u, v) represents the similarity between node u and node v, the denominatorAnd the method is used for normalizing the label membership degrees, and ensures that the sum of the label membership degrees of the nodes v is 1.
6. The microblog hotword and hot topic mining method according to claim 5, wherein in the step D3, iteration termination conditions are as follows:
wherein r ishIs defined as:
when m ish==mh-1And the iteration is ended.
CN201310725400.4A 2013-12-25 2013-12-25 Micro-blog hot word and hot topic mining system and method Expired - Fee Related CN103678670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310725400.4A CN103678670B (en) 2013-12-25 2013-12-25 Micro-blog hot word and hot topic mining system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310725400.4A CN103678670B (en) 2013-12-25 2013-12-25 Micro-blog hot word and hot topic mining system and method

Publications (2)

Publication Number Publication Date
CN103678670A CN103678670A (en) 2014-03-26
CN103678670B true CN103678670B (en) 2017-01-11

Family

ID=50316214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310725400.4A Expired - Fee Related CN103678670B (en) 2013-12-25 2013-12-25 Micro-blog hot word and hot topic mining system and method

Country Status (1)

Country Link
CN (1) CN103678670B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN104156436B (en) * 2014-08-13 2017-05-10 福州大学 Social association cloud media collaborative filtering and recommending method
CN105095988A (en) * 2015-07-01 2015-11-25 中国科学院计算技术研究所 Method and system for detecting social network information explosion
CN106610989B (en) * 2015-10-22 2021-06-01 北京国双科技有限公司 Search keyword clustering method and device
CN105488196B (en) * 2015-12-07 2019-01-22 中国人民大学 A kind of hot topic automatic mining system based on interconnection corpus
CN106919627A (en) * 2015-12-28 2017-07-04 北京国双科技有限公司 The treating method and apparatus of hot word
CN106446191B (en) * 2016-09-30 2019-11-05 浙江工业大学 A kind of multiple features network flow row label prediction technique returned based on Logistic
CN108170693B (en) * 2016-12-07 2020-07-31 北京国双科技有限公司 Hot word pushing method and device
CN108182191B (en) * 2016-12-08 2022-01-18 腾讯科技(深圳)有限公司 Hotspot data processing method and device
CN108241611B (en) * 2016-12-26 2021-08-17 北京国双科技有限公司 Keyword extraction method and extraction equipment
CN108804432A (en) * 2017-04-26 2018-11-13 慧科讯业有限公司 It is a kind of based on network media data Stream Discovery and to track the mthods, systems and devices of much-talked-about topic
CN107122478B (en) * 2017-05-03 2020-05-08 成都云数未来信息科学有限公司 Method for extracting hot topics based on keywords
CN108304371B (en) * 2017-07-14 2021-07-13 腾讯科技(深圳)有限公司 Method and device for mining hot content, computer equipment and storage medium
CN109509110B (en) * 2018-07-27 2021-08-31 福州大学 Microblog hot topic discovery method based on improved BBTM model
CN110377823A (en) * 2019-06-28 2019-10-25 厦门美域中央信息科技有限公司 A kind of building of hot spot digging system under Hadoop frame
CN110765239B (en) * 2019-10-29 2023-03-28 腾讯科技(深圳)有限公司 Hot word recognition method, device and storage medium
CN111125484B (en) * 2019-12-17 2023-06-30 网易(杭州)网络有限公司 Topic discovery method, topic discovery system and electronic equipment
CN112668836B (en) * 2020-12-07 2024-04-05 数据地平线(广州)科技有限公司 Risk spectrum-oriented associated risk evidence efficient mining and monitoring method and apparatus
CN113673224B (en) * 2021-08-19 2022-04-05 北京三快在线科技有限公司 Method and device for recognizing popular vocabulary, computer equipment and readable storage medium
CN113836307B (en) * 2021-10-15 2024-02-20 国网北京市电力公司 Power supply service work order hot spot discovery method, system, device and storage medium
CN114938477B (en) * 2022-06-23 2024-05-03 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN117076963B (en) * 2023-10-17 2024-01-02 北京国科众安科技有限公司 Information heat analysis method based on big data platform

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2700629A1 (en) * 2010-05-13 2011-11-13 Gerard Voon Shopping enabler
CN102346766A (en) * 2011-09-20 2012-02-08 北京邮电大学 Method and device for detecting network hot topics found based on maximal clique
CN103294818A (en) * 2013-06-12 2013-09-11 北京航空航天大学 Multi-information fusion microblog hot topic detection method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2700629A1 (en) * 2010-05-13 2011-11-13 Gerard Voon Shopping enabler
CN102346766A (en) * 2011-09-20 2012-02-08 北京邮电大学 Method and device for detecting network hot topics found based on maximal clique
CN103294818A (en) * 2013-06-12 2013-09-11 北京航空航天大学 Multi-information fusion microblog hot topic detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于词聚类的热点话题检测算法;龙志祎等;《计算机工程与设计》;20110630;第32卷(第6期);第2214-2217页 *

Also Published As

Publication number Publication date
CN103678670A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
CN104820629B (en) A kind of intelligent public sentiment accident emergent treatment system and method
CN112199608B (en) Social media rumor detection method based on network information propagation graph modeling
CN107967575B (en) Artificial intelligence platform system for artificial intelligence insurance consultation service
CN111125358B (en) Text classification method based on hypergraph
CN109165294B (en) Short text classification method based on Bayesian classification
WO2020108430A1 (en) Weibo sentiment analysis method and system
CN108038205B (en) Viewpoint analysis prototype system for Chinese microblogs
CN110569920B (en) Prediction method for multi-task machine learning
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN106991127B (en) Knowledge subject short text hierarchical classification method based on topological feature expansion
CN109522420B (en) Method and system for acquiring learning demand
CN106909643A (en) The social media big data motif discovery method of knowledge based collection of illustrative plates
Ignatov et al. Can triconcepts become triclusters?
CN114579833B (en) Microblog public opinion visual analysis method based on topic mining and emotion analysis
CN104077417A (en) Figure tag recommendation method and system in social network
CN115796181A (en) Text relation extraction method for chemical field
CN108363748B (en) Topic portrait system and topic portrait method based on knowledge
Liao et al. Coronavirus pandemic analysis through tripartite graph clustering in online social networks
Gerhana et al. Comparison of naive Bayes classifier and C4. 5 algorithms in predicting student study period
Kundana Data Driven Analysis of Borobudur Ticket Sentiment Using Naïve Bayes.
CN103488637A (en) Method for carrying out expert search based on dynamic community mining
Rani et al. GeoClust: Feature engineering based framework for location-sensitive disaster event detection using AHP-TOPSIS
CN108596205B (en) Microblog forwarding behavior prediction method based on region correlation factor and sparse representation
Campbell et al. Content+ context networks for user classification in twitter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170111

Termination date: 20191225

CF01 Termination of patent right due to non-payment of annual fee