CN103390051A

CN103390051A - Topic detection and tracking method based on microblog data

Info

Publication number: CN103390051A
Application number: CN2013103163167A
Authority: CN
Inventors: 孙国梓; 黄斯琪; 杨一涛; 陈国兰; 仇呈燕; 郑冬亚
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2013-07-25
Filing date: 2013-07-25
Publication date: 2013-11-13
Anticipated expiration: 2033-07-25
Also published as: CN103390051B

Abstract

The invention discloses a topic detection and tracking method based on microblog data. In the method, potential hidden subjects in large-scale social network information are mined. The method comprises the following steps: firstly, partitioning microblog data increasing massively according to time sequence properties, and filtering redundant information; secondly, analyzing and classifying text contents in time windows, returning key subject descriptions with independent semantics after extraction, and extracting topics in different time windows; and lastly, analyzing the inheritance and the identity of topics among the time windows to conclude the variation tendency of microblog topics. According to the method, the dynamic developing process of topic contents, namely, the generation, development, climax and extinction of topics can be shown, and topics are described more accurately and fully.

Description

A kind of method of topic detection and tracking based on the microblogging data

Technical field

The present invention relates to the data mining technology field, particularly a kind of method of topic detection and tracking based on the microblogging data.

Background technology

Along with the progress with the Information Communication means of developing rapidly of Web2.0, microblogging is grown into development rapidly and is affected very large network Social Media form in recent years.As a kind of new information carrier and route of transmission, microblogging can be commented on the netizen more easily to various products and service, participates in the discussion of various much-talked-about topics, plays a part more and more important in network public sentiment information initiation and communication process.The extensive micro-blog information that increases is not all valuable for the user in real time, need to automatically extract energy much-talked-about topic interested for user from the magnanimity micro-blog information, filters out the redundant data without actual value.

Topic is the set of event relevant report.In network, information source is varied, and the much-talked-about topic comprising the public is concerned about, also may exist relevant public safety certainly, the sensitive subjects of social stability.Event is along with the time, and culture waits the impact of factors, and its state of development can produce corresponding variation.Topic develops and has reflected the generation of some topics from him, rises, and a process that descends and finish, As time goes on, the intensity of topic and content all can change, and namely have the migration of topic.The analysis of public opinion is exactly by the mass text data analysis on internet, grasps the evolution trend of theme, makes in time correct prediction, for decision maker's reference.

At present, traditional topic develops and is mainly used in newswire, broadcasting, and TV, blog, the media such as forum community are Data Source, by a series of data digging method and carry out similarity and compare to reach the purpose of topic detection.In the research of this problem, the text in source-information is very important information.The microblogging text be word number limit at 140 characters with interior short text, they produce at any time, enormous amount.Due to the restriction of number of words, the user delivers in the mode of more simplifying usually.Textual form freedom, colloquial style, abbreviation, netspeak, misspelling phenomenon are very common, and often embed hypertext, as expression, picture, video, web page interlinkage etc.If with traditional mode of passing through structure vocabulary-text feature matrix, analyze topic, the exclusive properties of microblogging text self can cause the eigenmatrix height sparse, well imagines that the testing result that obtains also can have a greatly reduced quality.And the present invention can solve top problem well.

Summary of the invention

The object of the invention has been to design a kind of method of topic detection and tracking based on the microblogging data, the method is to carry out real-time data analysis on extensive increment micro-blog information, by the theme modeling, realize that the topic automatic clustering generates, and according to topic content and topic intensity, set up over time the related and variation of topic on time shaft, sum up the dynamic trend that topic develops.

The technical solution adopted for the present invention to solve the technical problems is: the present invention has designed a kind of method of topic detection and tracking based on the microblogging data, the method is carried out piecemeal with the microblogging data that magnanimity increases according to Temporal Order, and the content of text in time window is carried out mining analysis, extract the topic in the different time window, inheritance and the homogeneity by topic between the analysis time window sums up the microblog topic variation tendency finally.The method is mainly by the data pre-service, the time window topic generate and time window between the step such as topic association analysis complete.

Method flow:

Step 1: data pre-service

1. ignore the interactive message of directive property dialogue.Namely neglect the micro-blog information with "@user name " form, this class microblogging model does not often have the embodiment row of general topic, can eliminate as much as possible only for noise data mutual between the individual after ignoring.

2. former microblogging data extending.Information extraction in the URL that relates in the microblogging text is gone out and add in micro-blog information, the viewpoint of supporting user is described.

3. microblogging text type: the processing that the microblogging text is carried out participle, removes stop words, removes low-frequency word and high frequency words.Consider comment, forwarding, User Defined label (shape is as the hashtag of " # subject # ") and embedded external linkage (URL) in the microblogging text, use amended TF-IDF Weight algorithm.With each microblogging model formalization, with a multidimensional term vector W _iCorresponding.

4. go sparse property: for the shorter data text of microblogging, it is carried out clustering processing based on term vector.(namely, at first with being expressed as the word vector after the microblogging participle, based on the word vector, microblogging is carried out clustering processing with the K mean algorithm.Suppose that cluster result is the K class, the Twitter message in each class is merged into single document, obtained K synthetic microblogging document D.）

Step 2: in time window, topic generates

1. will be discrete according to its time information in time window t corresponding on the time sequence through pretreated all data messages, the set in each time window is S _t={ W ₁, W ₂... W _Mt, continuous text flow has been divided into several time windows so originally, wherein the number of documents M in each time window _tCan be the same or different.

2. go sparse property.The microblogging data mostly are even phrase of short sentence, for its comparatively sparse data content, it are carried out clustering processing based on term vector.

3. be the microblogging text of timeslice for cutting, process successively the text collection in each time window, use the LDA model to carry out the topic model modeling, therefrom extract several themes T, and obtain respectively topic content and topic intensity.The topic quantity that wherein generates in each window can be the same or different, and topic quantity N is dynamically generated according to the microblogging content of text in each time window by model selection method.

4., because certain topic that had occurred still can occur with certain probability in ensuing time window, therefore utilize the priori of the posterior probability of the distribution of word in the historical time window as topic excavation in the current time window.Taking the first discrete method that relies on based on non-condition,, for current time window t, is that the interior word of t-1 distributes and the priori of certain weighted value w as word distribution in time window t with the time window.

Step 3: topic association analysis between time window

Topic develops and mainly to refer on different time sections, and the topic with identical semanteme is trend over time, and the destruction of old topic, generation of new topic etc.Topic relevance between the analysis time window, comprise inheritance and homogeneity between topic, thereby obtain the Evolution Paths of topic.Wherein, the inheritance between topic is weighed by semantic similarity, and homogeneity is weighed by the similarity in the microblogging vector information.By the variation of window topic content and intensity, topic is drawn the some stages, the newsy variation tendency of shape of being described as by producing to wither away.Some window topics that will have sequential relationship and relevance are combined into topic, by the variation of window topic content and intensity, a topic are divided into some stages by producing to wither away, and describe out the evolutionary process of topic.

Beneficial effect:

1, at data preprocessing phase, take into full account the characteristics of Twitter message self, consider the forwarding in microblogging, comment, labels etc., filter useless noise data, be weighted describing the constructive data of topic, constructed and more can reflect the vector of microblogging feature.

2, the embedded URL to containing in microblogging, in former microblogging content, enrich the quantity of information of microblogging original text with the data filling of this URL sensing.

3, because the microblogging data are different from general text data, limited by 140 words, comparatively short and small, use clustering method to solve the sparse problem of text.

4, the topic based on local time's window extracts, and by model selection method, dynamically determines the topic number, adopts the window topic with sequential relationship and relevance to describe, and can comparatively accurately describe the semanteme of topic.

5, the comparison method of employing weighted array similarity is weighed the association between topic, combines three kinds of thought and angles that similarity is different, has avoided using any single defect to method.

Description of drawings

Fig. 1 is microblogging data topic detection and tracking method flow diagram of the present invention.

Fig. 2 is that LDA of the present invention generates topic model schematic diagram.

Embodiment

Below in conjunction with Figure of description, the invention is described in further detail.

Step 1: data pre-service

1. ignore the interactive message of directive property dialogue.Namely neglect the micro-blog information with "@user name " form, this type of information mostly is the dialogue between the user with directive property, and the possibility of often describing general topic is less.Can eliminate as much as possible noise data after removal.

2. former microblogging data extending.Information extraction in the embedded external linkage (URL) that relates in the microblogging text is gone out and adds in micro-blog information, and the viewpoint of supporting user is described.During the data application that extracts is calculated to next step TF-IDF value.

3. microblogging text type., for the microblogging data normalization, at first its data are carried out pre-service.Through participle, remove stop words, go the processing of low-and high-frequency word, and the TF-IDF weight calculation after changing.

The traditional data text that is different from other due to microblogging, can be with its clear and definite source text that forwards microblogging, three parts of current microblogging text and review information of being divided into.Although the theme of its information is information expressed in its text, by to forwarding the word that occurs in source text and comment, analyzing, can be more effective, extract more accurately the vocabulary that can show the article feature.For example,, if a word is forwarding source text, occur in microblogging text and comment, this word is very likely just the descriptor that can represent this microblogging feature, and no matter how many its TF-IDF values is.And at body part, shape is also that a kind of summary to theme embodies as the label field of " # subject # " form, often can summarize when the rich theme to be expressed of preamble.

For above situation, traditional TF-IDF weights adding method is modified, make its structure that is more suitable for microblogging text vector space, its computing method are as follows:

{tf}_{ij} = \frac{n_{i, j}}{Σ_{k} n_{k, j}}

Formula (1)

n _i,j＝n_post _i,j+o_hash _i,j×w _hashtag+o_url _i,j×w _url

In formula (1), tf _ijThe word frequency of representation feature word j in microblogging i, n _{I, j}Representation feature word j occurs in microblogging i number of times, n_post _i,jRepresentation feature word j occurs in text (comprise and forwarding and comment, remove hashtag, the URL) data of microblogging i number of times, n_hash _i,j, n_url _i,jThe number of times that occurs in hashtag and URL in microblogging i of representation feature word j respectively, w _Hashtag, w _urlBe respectively the weighted value of its weighting.Σ _kn _kjShow the total word number in microblogging i.

idf = (\frac{N}{n} + 0.01)

Formula (2)

In formula (2), N represents total microblogging quantity, and n represents to occur the microblogging quantity of Feature Words j, the 0.01st, and constant, 0 value occurs for fear of the idf result.

V _ij=tf _ij* idf _jFormula (3)

Obtain formal text.Every microblogging data and a multidimensional term vector W after formalization _iCorresponding:

W _i～(V _i1, V _i2... V _ik) formula (4)

In formula (4), k represents the dimension of term vector, V _ijIn expression microblogging i, the TF-IDF weight of Feature Words j, obtained by formula (3).

Step 2: in time window, topic generates

1. will carry out pretreated information and become several time dependent message block by its time attribute discretization, corresponding to each time window on time series, the set in time window t is S _t={ W ₁, W ₂... W _Mt.Number of documents M in each time window _tDepending on concrete information flow, number of documents can be the same or different.

2. go sparse property.The microblogging data mostly are even phrase of short sentence, for its comparatively sparse data content, it are carried out clustering processing based on term vector.In time window t, to S _tIn term vector W _jUse the K mean algorithm to carry out clustering processing.Suppose that cluster result is the K class, the microblogging data in each class are merged into single document, obtained K synthetic microblogging document D t.

3. be the microblogging text D of timeslice for cutting _tProcess successively the text collection in each time window, LDA (the Latent Dirichlet Allocation) model that uses D.M.Blei to propose in 2003 carries out the topic model modeling, therefrom extracts several themes T, and obtains respectively topic content and topic intensity.Detailed process as shown in Figure 2.

The topic quantity that wherein generates in each window can be the same or different, and topic quantity N is dynamically generated according to the microblogging content of text in each time window by model selection method:

P (w | z) = {(\frac{Γ (Vβ)}{Γ {(β)}^{V}})}^{N} Π_{i = 1}^{N} \frac{Π_{w} Γ (f_{j}^{w} + β)}{Γ (f_{j} + Vβ)}

Formula (4)

Wherein Γ () is the Gamma function of standard, The frequency of theme j, n are distributed to vocabulary w in expression _jThe word number that represents the word that all distribute to theme j.Make the minimum N of p (w|z) be best topic number in following formula.

4. the prior probability that utilizes the posterior probability of last time window to affect the current time window is kept intersubjective continuity, solves the topic problem that probability occurs in ensuing time window that had occurred.Use first discrete method, it relies on based on non-condition,, for current time window t, with the time window, is that the interior word of t-1 distributes

With the priori of certain weighted value w as word distribution in time window t

Namely

Formula (5)

Step 3: topic association analysis between time window

Topic develops and mainly to refer on different time sections, and the topic with identical semanteme is trend over time, and the destruction of old topic, generation of new topic etc., so need the relation between the topic content between window analysis time, comprise inheritance and homogeneity between topic, thereby obtain the variation tendency of topic.Wherein, the inheritance between topic is weighed by semantic similarity, and homogeneity is weighed by the similarity in the microblogging vector information.

1. topic inheritance between window: the inheritance between topic shows the similarity on the topic content, by Arithmetic of Semantic Similarity, it is weighed.

2. homogeneity between the window topic: high two topics of semantic similarity can not direct representation its formed the trend that topic changes, for fear of being semantically to be coupled purely, and not having the content of describing same topic function, the comparison method of employing weighted array similarity is weighed the inheritance between topic.Combine different thought and the angles of two kinds of similarities of cosine angle-off set and Jaccard coefficient in algorithm, avoided using any single defect to method.Simultaneously can guarantee similarity in [0,1] interval, be worth larger expression similarity value higher.

Sim _inh(T ₁, T ₂)=Sim _cos(T ₁, T ₂) * α+Sim _jac(T ₁, T ₂) * β formula (6)

In formula, Sim _cos(T ₁, T ₂), Sim _jac(T ₁, T ₂) represent respectively the cosine similarity, under the Jaccard Coefficient Algorithm, topic T in time window 1 and time window 2 ₁, T ₂Similarity.α, β represents weighting coefficient, has reflected that 2 kinds of different similarities are big or small to the weights of overall similarity.

Consider inheritance and homogeneity tolerance between topic, draw the combination similarity of weighing related judgement between topic:

Sim _com(T ₁, T ₂)=Sim _inh(T ₁, T ₂) * λ+Sim _sen(T ₁, T ₂) * μ formula (7)

Sim wherein _sem(T ₁, T ₂), Sim _inh(T ₁, T ₂) be respectively the algorithm of the tolerance of inheritance and homogeneity between topic, λ, μ are weighting coefficient.

2. correlation analysis between the window topic: some window topics that will have sequential relationship and relevance are combined into topic, by the variation of window topic content and intensity, a topic are divided into some stages by producing extinction, describe out the evolutionary process of topic.

In association analysis with each window topic T _iForward direction time window topic T _i-1With the new topic that backward time window topic is given birth to, Sim _com(T _i, T _i+1The old topic of)＜ε explanation Ti for disappearing, Sim _com(T _i, T _i-1) 〉=ε explanation topic has obtained succession, and process draws topic by producing to the process of withering away thus.The topic detection and tracking method is applied to the microblogging platform, can pool our ideas and make concerted efforts, follow the trail of fast much-talked-about topic and upgrade the topic temperature, make up the weak point of traditional media to real-time much-talked-about topic follow-up analysis.

Claims

1. the method for the topic detection and tracking based on the microblogging data, is characterized in that, is divided into following steps:

Step 1: data pre-service;

1. ignore the interactive message of directive property dialogue;

2. former microblogging data extending;

3. microblogging text type: the processing that the microblogging text is carried out participle, removes stop words, removes low-frequency word and high frequency words;

4. go sparse property: for the shorter data text of microblogging, it is carried out clustering processing based on term vector;

Step 2: in time window, topic generates;

1. will be discrete according to its time information in time window t corresponding on the time sequence through pretreated all data messages;

2. go sparse property: the microblogging data mostly are even phrase of short sentence, for its comparatively sparse data content, it are carried out clustering processing based on term vector;

3. be the microblogging text of timeslice for cutting, process successively the text collection in each time window, use the LDA model to carry out the topic model modeling, therefrom extract several themes T, and obtain respectively topic content and topic intensity; The topic quantity that wherein generates in each window can be the same or different, and topic quantity N is dynamically generated according to the microblogging content of text in each time window by model selection method;

4., because certain topic that had occurred still can occur with certain probability in ensuing time window, therefore utilize the priori of the posterior probability of the distribution of word in the historical time window as topic excavation in the current time window; Taking the first discrete method that relies on based on non-condition,, for current time window t, is that the interior word of t-1 distributes and the priori of certain weighted value w as word distribution in time window t with the time window;

Step 3: topic association analysis between time window;

Topic develops and mainly to refer on different time sections, and the topic with identical semanteme is trend over time, and the destruction of old topic, the generation of new topic; Topic relevance between the analysis time window, comprise inheritance and homogeneity between topic, thereby obtain the Evolution Paths of topic; Wherein, the inheritance between topic is weighed by semantic similarity, and homogeneity is weighed by the similarity in the microblogging vector information; By the variation of window topic content and intensity, topic is drawn the some stages, the newsy variation tendency of shape of being described as by producing to wither away; Some window topics that will have sequential relationship and relevance are combined into topic, by the variation of window topic content and intensity, a topic are divided into some stages by producing to wither away, and describe out the evolutionary process of topic.

2. a kind of method of topic detection and tracking based on the microblogging data according to claim 1, is characterized in that: in the step 1 of described method,, at data preprocessing phase, ignore the interactive message of directive property dialogue; Namely neglect the micro-blog information with "@user name " form, this type of information mostly is the dialogue between the user with directive property, and the possibility of often describing general topic is less.

3. a kind of method of topic detection and tracking based on the microblogging data according to claim 1, it is characterized in that: in the step 1 of described method, the sparse microblogging data message of script is expanded, information extraction in the embedded external linkage (URL) that relates in the microblogging text is gone out and adds in micro-blog information, and the viewpoint of supporting user is described; To in calculating for the improved TF-IDF value of microblogging feature, it is for the text in micro-blog information with the data application that extracts, and comment, forward and given different weights.

4. a kind of method of topic detection and tracking based on the microblogging data according to claim 1, it is characterized in that: in the step 1 of described method, the microblogging data are gone the processing of sparse property, the microblogging data mostly are even phrase of short sentence, for its comparatively sparse data content, it is carried out clustering processing based on term vector; In time window t, to S _tIn term vector W _jUse the K mean algorithm to carry out clustering processing; Suppose that cluster result is the K class, the microblogging data in each class are merged into single document, obtained K synthetic microblogging document D t.

5. a kind of method of topic detection and tracking based on the microblogging data according to claim 1, is characterized in that: in the step 2 of described method, be the microblogging text D of timeslice for cutting _tProcess successively the text collection in each time window, LDA (Latent Dirichlet Allocation) the model jargon topic model modeling that application D.M.Blei proposed in 2003, therefrom extract several themes T, and obtain respectively topic content and topic intensity, the topic quantity that wherein generates in each window can be the same or different, and topic quantity N is dynamically generated according to the microblogging content of text in each time window by model selection method:

P (w | z) = {(\frac{Γ (Vβ)}{Γ {(β)}^{V}})}^{N} Π_{i = 1}^{N} \frac{Π_{w} Γ (f_{j}^{w} + β)}{Γ (f_{j} + Vβ)} .

6. a kind of method of topic detection and tracking based on the microblogging data according to claim 1, it is characterized in that: in the step 3 of described method, the prior probability that utilizes the posterior probability of last time window to affect the current time window is kept intersubjective continuity; Apply first discrete method, it relies on based on non-condition,, for current time window t, with the time window, is that the interior word of t-1 distributes

Namely

7. a kind of method of topic detection and tracking based on the microblogging data according to claim 1 is characterized in that: in the step 2 of described method, adopt the comparison method of weighted array similarity to weigh inheritance between topic; Combine different thought and the angles of two kinds of similarities of cosine angle-off set and Jaccard coefficient in method, avoided using any single defect to method, simultaneously can guarantee similarity in [0,1] interval, be worth larger expression similarity value higher;

Sim _inh(T ₁,T ₂)＝Sim _cos(T ₁,T ₂)×α+Sim _jac(T ₁,T ₂)×β

In formula, Sim _cos(T ₁, T ₂), Sim _jac(T ₁, T ₂) represent respectively the cosine similarity, under the Jaccard Coefficient Algorithm, topic T1 in time window 1 and time window 2, the similarity of T2, α, β represents weighting coefficient, has reflected that 2 kinds of different similarities are big or small to the weights of overall similarity;

Sim _com(T ₁,T ₂)＝Sim _inh(T ₁,T ₂)×λ+Sim _sen(T ₁,T ₂)×μ

Sim wherein _sem(T ₁, T ₂), Sim _inh(T ₁, T ₂) be respectively the algorithm of the tolerance of inheritance and homogeneity between topic, λ, μ are weighting coefficient;

Inheritance between topic shows the similarity on the topic content, by Arithmetic of Semantic Similarity, it is weighed.

8. a kind of method of topic detection and tracking based on the microblogging data according to claim 1, it is characterized in that: in the step 3 of described method, between the window topic, correlation analysis is that the some window topics with sequential relationship and relevance are combined into topic, variation by window topic content and intensity, one topic is divided into some stages by producing to wither away, describes out the evolutionary process of topic;

In association analysis with each window topic T _iForward direction time window topic T _i-1With the new topic that backward time window topic is given birth to, Sim _com(T _i, T _i+1The old topic of)＜ε explanation Ti for disappearing, Sim _com(T _i, T _i-1) 〉=ε explanation topic has obtained succession, and process draws topic by producing to the process of withering away thus.