Background technology
Along with the progress with the Information Communication means of developing rapidly of Web2.0, microblogging is grown into development rapidly and is affected very large network Social Media form in recent years.As a kind of new information carrier and route of transmission, microblogging can be commented on the netizen more easily to various products and service, participates in the discussion of various much-talked-about topics, plays a part more and more important in network public sentiment information initiation and communication process.The extensive micro-blog information that increases is not all valuable for the user in real time, need to automatically extract energy much-talked-about topic interested for user from the magnanimity micro-blog information, filters out the redundant data without actual value.
Topic is the set of event relevant report.In network, information source is varied, and the much-talked-about topic comprising the public is concerned about, also may exist relevant public safety certainly, the sensitive subjects of social stability.Event is along with the time, and culture waits the impact of factors, and its state of development can produce corresponding variation.Topic develops and has reflected the generation of some topics from him, rises, and a process that descends and finish, As time goes on, the intensity of topic and content all can change, and namely have the migration of topic.The analysis of public opinion is exactly by the mass text data analysis on internet, grasps the evolution trend of theme, makes in time correct prediction, for decision maker's reference.
At present, traditional topic develops and is mainly used in newswire, broadcasting, and TV, blog, the media such as forum community are Data Source, by a series of data digging method and carry out similarity and compare to reach the purpose of topic detection.In the research of this problem, the text in source-information is very important information.The microblogging text be word number limit at 140 characters with interior short text, they produce at any time, enormous amount.Due to the restriction of number of words, the user delivers in the mode of more simplifying usually.Textual form freedom, colloquial style, abbreviation, netspeak, misspelling phenomenon are very common, and often embed hypertext, as expression, picture, video, web page interlinkage etc.If with traditional mode of passing through structure vocabulary-text feature matrix, analyze topic, the exclusive properties of microblogging text self can cause the eigenmatrix height sparse, well imagines that the testing result that obtains also can have a greatly reduced quality.And the present invention can solve top problem well.
Summary of the invention
The object of the invention has been to design a kind of method of topic detection and tracking based on the microblogging data, the method is to carry out real-time data analysis on extensive increment micro-blog information, by the theme modeling, realize that the topic automatic clustering generates, and according to topic content and topic intensity, set up over time the related and variation of topic on time shaft, sum up the dynamic trend that topic develops.
The technical solution adopted for the present invention to solve the technical problems is: the present invention has designed a kind of method of topic detection and tracking based on the microblogging data, the method is carried out piecemeal with the microblogging data that magnanimity increases according to Temporal Order, and the content of text in time window is carried out mining analysis, extract the topic in the different time window, inheritance and the homogeneity by topic between the analysis time window sums up the microblog topic variation tendency finally.The method is mainly by the data pre-service, the time window topic generate and time window between the step such as topic association analysis complete.
Method flow:
Step 1: data pre-service
1. ignore the interactive message of directive property dialogue.Namely neglect the micro-blog information with "@user name " form, this class microblogging model does not often have the embodiment row of general topic, can eliminate as much as possible only for noise data mutual between the individual after ignoring.
2. former microblogging data extending.Information extraction in the URL that relates in the microblogging text is gone out and add in micro-blog information, the viewpoint of supporting user is described.
3. microblogging text type: the processing that the microblogging text is carried out participle, removes stop words, removes low-frequency word and high frequency words.Consider comment, forwarding, User Defined label (shape is as the hashtag of " # subject # ") and embedded external linkage (URL) in the microblogging text, use amended TF-IDF Weight algorithm.With each microblogging model formalization, with a multidimensional term vector W
iCorresponding.
4. go sparse property: for the shorter data text of microblogging, it is carried out clustering processing based on term vector.(namely, at first with being expressed as the word vector after the microblogging participle, based on the word vector, microblogging is carried out clustering processing with the K mean algorithm.Suppose that cluster result is the K class, the Twitter message in each class is merged into single document, obtained K synthetic microblogging document D.)
Step 2: in time window, topic generates
1. will be discrete according to its time information in time window t corresponding on the time sequence through pretreated all data messages, the set in each time window is S
t={ W
1, W
2... W
Mt, continuous text flow has been divided into several time windows so originally, wherein the number of documents M in each time window
tCan be the same or different.
2. go sparse property.The microblogging data mostly are even phrase of short sentence, for its comparatively sparse data content, it are carried out clustering processing based on term vector.
3. be the microblogging text of timeslice for cutting, process successively the text collection in each time window, use the LDA model to carry out the topic model modeling, therefrom extract several themes T, and obtain respectively topic content and topic intensity.The topic quantity that wherein generates in each window can be the same or different, and topic quantity N is dynamically generated according to the microblogging content of text in each time window by model selection method.
4., because certain topic that had occurred still can occur with certain probability in ensuing time window, therefore utilize the priori of the posterior probability of the distribution of word in the historical time window as topic excavation in the current time window.Taking the first discrete method that relies on based on non-condition,, for current time window t, is that the interior word of t-1 distributes and the priori of certain weighted value w as word distribution in time window t with the time window.
Step 3: topic association analysis between time window
Topic develops and mainly to refer on different time sections, and the topic with identical semanteme is trend over time, and the destruction of old topic, generation of new topic etc.Topic relevance between the analysis time window, comprise inheritance and homogeneity between topic, thereby obtain the Evolution Paths of topic.Wherein, the inheritance between topic is weighed by semantic similarity, and homogeneity is weighed by the similarity in the microblogging vector information.By the variation of window topic content and intensity, topic is drawn the some stages, the newsy variation tendency of shape of being described as by producing to wither away.Some window topics that will have sequential relationship and relevance are combined into topic, by the variation of window topic content and intensity, a topic are divided into some stages by producing to wither away, and describe out the evolutionary process of topic.
Beneficial effect:
1, at data preprocessing phase, take into full account the characteristics of Twitter message self, consider the forwarding in microblogging, comment, labels etc., filter useless noise data, be weighted describing the constructive data of topic, constructed and more can reflect the vector of microblogging feature.
2, the embedded URL to containing in microblogging, in former microblogging content, enrich the quantity of information of microblogging original text with the data filling of this URL sensing.
3, because the microblogging data are different from general text data, limited by 140 words, comparatively short and small, use clustering method to solve the sparse problem of text.
4, the topic based on local time's window extracts, and by model selection method, dynamically determines the topic number, adopts the window topic with sequential relationship and relevance to describe, and can comparatively accurately describe the semanteme of topic.
5, the comparison method of employing weighted array similarity is weighed the association between topic, combines three kinds of thought and angles that similarity is different, has avoided using any single defect to method.
Embodiment
Below in conjunction with Figure of description, the invention is described in further detail.
Step 1: data pre-service
1. ignore the interactive message of directive property dialogue.Namely neglect the micro-blog information with "@user name " form, this type of information mostly is the dialogue between the user with directive property, and the possibility of often describing general topic is less.Can eliminate as much as possible noise data after removal.
2. former microblogging data extending.Information extraction in the embedded external linkage (URL) that relates in the microblogging text is gone out and adds in micro-blog information, and the viewpoint of supporting user is described.During the data application that extracts is calculated to next step TF-IDF value.
3. microblogging text type., for the microblogging data normalization, at first its data are carried out pre-service.Through participle, remove stop words, go the processing of low-and high-frequency word, and the TF-IDF weight calculation after changing.
The traditional data text that is different from other due to microblogging, can be with its clear and definite source text that forwards microblogging, three parts of current microblogging text and review information of being divided into.Although the theme of its information is information expressed in its text, by to forwarding the word that occurs in source text and comment, analyzing, can be more effective, extract more accurately the vocabulary that can show the article feature.For example,, if a word is forwarding source text, occur in microblogging text and comment, this word is very likely just the descriptor that can represent this microblogging feature, and no matter how many its TF-IDF values is.And at body part, shape is also that a kind of summary to theme embodies as the label field of " # subject # " form, often can summarize when the rich theme to be expressed of preamble.
For above situation, traditional TF-IDF weights adding method is modified, make its structure that is more suitable for microblogging text vector space, its computing method are as follows:
Formula (1)
n
i,j=n_post
i,j+o_hash
i,j×w
hashtag+o_url
i,j×w
url
In formula (1), tf
ijThe word frequency of representation feature word j in microblogging i, n
I, jRepresentation feature word j occurs in microblogging i number of times, n_post
i,jRepresentation feature word j occurs in text (comprise and forwarding and comment, remove hashtag, the URL) data of microblogging i number of times, n_hash
i,j, n_url
i,jThe number of times that occurs in hashtag and URL in microblogging i of representation feature word j respectively, w
Hashtag, w
urlBe respectively the weighted value of its weighting.Σ
kn
kjShow the total word number in microblogging i.
Formula (2)
In formula (2), N represents total microblogging quantity, and n represents to occur the microblogging quantity of Feature Words j, the 0.01st, and constant, 0 value occurs for fear of the idf result.
V
ij=tf
ij* idf
jFormula (3)
Obtain formal text.Every microblogging data and a multidimensional term vector W after formalization
iCorresponding:
W
i~(V
i1, V
i2... V
ik) formula (4)
In formula (4), k represents the dimension of term vector, V
ijIn expression microblogging i, the TF-IDF weight of Feature Words j, obtained by formula (3).
Step 2: in time window, topic generates
1. will carry out pretreated information and become several time dependent message block by its time attribute discretization, corresponding to each time window on time series, the set in time window t is S
t={ W
1, W
2... W
Mt.Number of documents M in each time window
tDepending on concrete information flow, number of documents can be the same or different.
2. go sparse property.The microblogging data mostly are even phrase of short sentence, for its comparatively sparse data content, it are carried out clustering processing based on term vector.In time window t, to S
tIn term vector W
jUse the K mean algorithm to carry out clustering processing.Suppose that cluster result is the K class, the microblogging data in each class are merged into single document, obtained K synthetic microblogging document D t.
3. be the microblogging text D of timeslice for cutting
tProcess successively the text collection in each time window, LDA (the Latent Dirichlet Allocation) model that uses D.M.Blei to propose in 2003 carries out the topic model modeling, therefrom extracts several themes T, and obtains respectively topic content and topic intensity.Detailed process as shown in Figure 2.
The topic quantity that wherein generates in each window can be the same or different, and topic quantity N is dynamically generated according to the microblogging content of text in each time window by model selection method:
Formula (4)
Wherein Γ () is the Gamma function of standard,
The frequency of theme j, n are distributed to vocabulary w in expression
jThe word number that represents the word that all distribute to theme j.Make the minimum N of p (w|z) be best topic number in following formula.
4. the prior probability that utilizes the posterior probability of last time window to affect the current time window is kept intersubjective continuity, solves the topic problem that probability occurs in ensuing time window that had occurred.Use first discrete method, it relies on based on non-condition,, for current time window t, with the time window, is that the interior word of t-1 distributes
With the priori of certain weighted value w as word distribution in time window t
Namely
Step 3: topic association analysis between time window
Topic develops and mainly to refer on different time sections, and the topic with identical semanteme is trend over time, and the destruction of old topic, generation of new topic etc., so need the relation between the topic content between window analysis time, comprise inheritance and homogeneity between topic, thereby obtain the variation tendency of topic.Wherein, the inheritance between topic is weighed by semantic similarity, and homogeneity is weighed by the similarity in the microblogging vector information.
1. topic inheritance between window: the inheritance between topic shows the similarity on the topic content, by Arithmetic of Semantic Similarity, it is weighed.
2. homogeneity between the window topic: high two topics of semantic similarity can not direct representation its formed the trend that topic changes, for fear of being semantically to be coupled purely, and not having the content of describing same topic function, the comparison method of employing weighted array similarity is weighed the inheritance between topic.Combine different thought and the angles of two kinds of similarities of cosine angle-off set and Jaccard coefficient in algorithm, avoided using any single defect to method.Simultaneously can guarantee similarity in [0,1] interval, be worth larger expression similarity value higher.
Sim
inh(T
1, T
2)=Sim
cos(T
1, T
2) * α+Sim
jac(T
1, T
2) * β formula (6)
In formula, Sim
cos(T
1, T
2), Sim
jac(T
1, T
2) represent respectively the cosine similarity, under the Jaccard Coefficient Algorithm, topic T in time window 1 and time window 2
1, T
2Similarity.α, β represents weighting coefficient, has reflected that 2 kinds of different similarities are big or small to the weights of overall similarity.
Consider inheritance and homogeneity tolerance between topic, draw the combination similarity of weighing related judgement between topic:
Sim
com(T
1, T
2)=Sim
inh(T
1, T
2) * λ+Sim
sen(T
1, T
2) * μ formula (7)
Sim wherein
sem(T
1, T
2), Sim
inh(T
1, T
2) be respectively the algorithm of the tolerance of inheritance and homogeneity between topic, λ, μ are weighting coefficient.
2. correlation analysis between the window topic: some window topics that will have sequential relationship and relevance are combined into topic, by the variation of window topic content and intensity, a topic are divided into some stages by producing extinction, describe out the evolutionary process of topic.
In association analysis with each window topic T
iForward direction time window topic T
i-1With the new topic that backward time window topic is given birth to, Sim
com(T
i, T
i+1The old topic of)<ε explanation Ti for disappearing, Sim
com(T
i, T
i-1) 〉=ε explanation topic has obtained succession, and process draws topic by producing to the process of withering away thus.The topic detection and tracking method is applied to the microblogging platform, can pool our ideas and make concerted efforts, follow the trail of fast much-talked-about topic and upgrade the topic temperature, make up the weak point of traditional media to real-time much-talked-about topic follow-up analysis.