CN110096704A

CN110096704A - A kind of Dynamic Theme discovery algorithm of short text stream

Info

Publication number: CN110096704A
Application number: CN201910354228.3A
Authority: CN
Inventors: 强继朋; 李云; 袁运浩
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-08-06
Anticipated expiration: 2039-04-29
Also published as: CN110096704B

Abstract

The invention discloses a kind of Dynamic Theme of short text stream in topic model field find algorithm, as follows carry out: step 1, initialize the 1st time point Documents set every document theme；Step 2, the 1st time point collection of document of iterative learning every document theme；Step 3, the distribution for obtaining word in the theme distribution and theme at the 1st time point；Step 4, initialize t time point (t > 1) collection of document every document theme；Step 5, iterative learning t time point collection of document every document theme；Step 6, the distribution for obtaining word in the theme distribution and theme at t time point；Step 7, the theme for deleting t-1 time point；Step 8, to the collection of document under subsequent point in time, successively learnt using step 4,5,6 and 7, the present invention can fully consider the sparsity of short text, be learnt in conjunction with the theme distribution of the document under the upper time, so as to more effectively find the implicit theme in short text stream.

Description

A kind of Dynamic Theme discovery algorithm of short text stream

Technical field

The present invention relates to a kind of agent model, in particular to the Dynamic Theme of a kind of short text stream finds algorithm.

Background technique

In recent years, the content explosive growth of the short text stream on internet can be quickly to short such as microblogging and comment The topic model for the algorithm short text stream that text flow is analyzed also is got more and more attention.The topic model of short text stream can be sent out The theme distribution of the theme and document that are implied in existing short text stream, can be applied to short text classification, event detection and tracking and Documentation summary etc..The motif discovery of short text stream has following challenge: (1) dilution of short text；(2) text is continued for Arrival, similar to processing static text method, store all texts and be iterated, be infeasible to short text stream；(3) Theme in text flow is being evolved always, it is desirable to be able to be detected new theme automatically, be removed the theme never updated.

The problem of existing topic model based on short text stream is calculated or shifted without very good solution theme or do not have The highly advantageous correlation between different time points text collection theme causes the theme finally found to enable very much user dissatisfied. Currently, the topic model algorithm of existing short text stream, can be roughly divided into following two categories: being based on dynamic Di Li Cray multinomial Mixed model and be based on Di Li Cray procedure multinomial mixed model.These two types of algorithms all use the distribution of Di Li Cray and make For the prior distribution for implying theme, then it is sampled using multinomial distribution.But first kind algorithm has examined different time points Correlation between text collection theme, i.e., a upper time point many themes can continue to occur in current point in time, but due to needing The problem of forcing the number of designated key model, not can solve theme transfer.Existing second class algorithm does not need The number of designated key model is forced, but does not account for the correlation of theme in different time points collection of document.

Summary of the invention

The object of the present invention is to provide a kind of Dynamic Themes of short text stream to find algorithm, is able to solve the dilution of short text Property and theme the problem of shifting, and the correlation of the theme at a time point and the theme of current time is made full use of, to have Effect improves the accuracy of the motif discovery of short text stream.

The object of the present invention is achieved like this: a kind of Dynamic Theme discovery algorithm of short text stream, which is characterized in that false If the collection of document D={ D of short text stream₁,D₂,…,D_t-1,D_t..., wherein D_tIndicate the text collection that the t time reaches, D_t In each theme k, show m with following argument table_t,k、n_t,k、doc_t,kWithWherein m_t,kIndicate D_tIn belong to the document of theme k Number, n_t,kIt is the total number for belonging to all words in theme k under the t time, doc_t,kIndicate the document under the t time in theme k Set,It is the total degree that word w belongs to theme k under the t time；D_tIn theme collection share TS_tIt indicates, indicates D_tIn include Theme label, initial content are empty set；The discovery algorithm carries out as follows:

Step 1, the short collection of document at the 1st time point (t=1) areContain B₁? Short essay shelves；Newly-built D₁In theme set TS₁；For D₁, by initial method, successively learn every document d_m(1≤m≤B₁) Theme；

Step 2, the collection of document D at the 1st time point of iterative learning (t=1)₁In every document implicit theme, wherein changing Generation number i is from 1 to I；I is the parameter of user setting；During i-th iteration, D is relearned respectively₁In every document d_m's Theme label is recalculated according to following sub-step:

Step 3 assumes TS₁There is K theme, under the 1st time point of reasoning (t=1), D₁In theme distribution θ₁With each master The distribution Θ of word in topic₁={ φ_1,1,φ_1,2,…,φ_1,k,…,φ_1,K}；

Step 4, the collection of document for handling future time point, i.e. t=t+1；Assuming that D_t-1Theme set TS_t-1There is K master Topic；The short collection of document of current point in time t (t >=2) isContain B_tPiece short essay shelves；It enables TS_tEqual to empty set；For D_t, by initial method, successively learn every document d_m(1≤m≤B_t) theme；

Step 5, iterative learning collection of document D_tIn every document number, wherein the number of iterations i is from 1 to I；I-th changes During generation, D is relearned respectively_tIn every document d_mTheme label, follow the steps below and recalculate；

Step 6 calculates D_tMiddle theme distribution θ_tWith the distribution Θ of word in each theme_t={ φ_t,1,φ_t,2,…, φ_{T, k},…,φ_t,K, wherein theme distribution θ_tUsing the distribution phi of word in formula (12) and theme k_t,kUsing formula (13)；

Step 7 deletes D_t-1In corresponding variable m under corresponding theme and each theme_t,k、n_t,k、doc_t,kWithValue；

Step 8, to the collection of document under subsequent point in time, successively using step 4,5,6 and 7 learnt.

It is further limited as of the invention, step 1 specifically includes:

Step 1.1 enables first document d₁Theme label k be equal to 0；New theme k=0 and the change for indicating theme k=0 Measure m_1,0、n_1,0、doc_1,0WithUsing formula (1), the corresponding variable m of theme k=0 is modified_1,0、n_1,0、doc_1,0WithAnd handle Theme 0 is added to theme TS₁In, i.e. TS₁={ 0 }；

In formula (1),It is document d₁The number of middle word,It is document d₁The appearance number of middle word w；

Step 1.2 is directed to D₁, successively learn document d_m(2≤m≤B₁) corresponding theme label；Assuming that current theme set TS₁In theme number be K；In initialization procedure, document d_mSelect theme k (1≤k≤K) as its master from K theme Inscribe the probability of label are as follows:

In formula (2),Indicate document d_mTheme label, α and β are the parameters of user setting,It is document d_mMiddle word Number,It is document d_mThe appearance number of middle word w, V are the sizes of vocabulary；

In step 1.3, initialization procedure, document d_mSelect a new theme label K+1 as its theme label Probability are as follows:

Step 1.4, according to K+1 probability value above, sampling document d_mTheme label k；If k=K+1, new theme The k and variable m for indicating theme k_1,k、n_1,k、doc_1,kWithWherein m_1,k、n_1,kWithInitial value be all 0, doc_1,kInitially For empty set, and k is added in theme set, i.e. TS₁={ TS₁, k }；Finally, using formula (4) to variable m_1,k、n_1,k、doc_1,k WithIt modifies.

It is further limited as of the invention, step 2 specifically includes:

Step 2.1 assumes document d_mOriginal corresponding theme label is k, first formula (5) is used, to variable m_1,k、n_1,k、 doc_1,kWithIt modifies；If after modification, doc_1,kBecome empty set, then from theme set TS₁Middle removal theme label k, and move Except variable m relevant to theme k_1,k、n_1,k、doc_1,kWith

Step 2.2 recalculates document d using formula (2) and formula (3)_mBelong to the probability of existing theme and new theme, and For document d_mSample a new theme label k；If k=K+1, new theme k and the variable m for indicating theme k_1,k、n_1,k、 doc_1,kWithm_t,k、n_t,kWithInitial value be all 0, doc_1,kIt is initially empty set, and k is added in theme set, i.e., TS₁={ TS₁, k }；Then, using formula (4) to variable m_1,k、n_1,k、doc_1,kWithIt modifies.

It is further limited as of the invention, step 3 specifically includes:

Step 3.1, theme distribution θ₁={ θ_1,1,θ_1,2,…,θ_1,k,…,θ_1,KIn θ_1,kCalculation formula it is as follows:

The distribution of word in step 3.2, theme kInCalculation formula it is as follows:

It is further limited as of the invention, step 4 specifically includes:

Step 4.1, to D_tIt carries out in initialization procedure, document d_mSelect theme k (1≤k≤K) as it from K theme Theme label probability are as follows:

In formula (8), m_t-1,kIndicate a time point D_t-1In belong to the number of documents of theme k,It was the upper time Point D_t-1Middle word w goes out occurrence number in total in theme k；

In step 4.2, initialization procedure, document d_mSelect a new theme label K+1 as its theme label Probability are as follows:

Step 4.3, according to K+1 probability value above, sampling document d_mTheme label k；If TS_tNot comprising theme k, It is added in theme set, i.e. TS_t={ TS_t, k }；Then, using formula (10) to variable m_t,k、n_t,k、doc_t,kWithIt is repaired Change.

It is further limited as of the invention, step 5 specifically includes:

Step 5.1 assumes document d_mOriginal corresponding theme label is k, first using formula (11) to variable m_t,k、n_t,k、 doc_t,kWithIt modifies；If after modification, doc_t,kBecome empty set, then from theme set TS_tMiddle removal theme label k, and move Except variable m relevant to theme k_t,k、n_t,k、doc_t,kWith

Step 5.2 calculates document d using formula (8) and formula (9)_mBelong to TS_tIn theme and new theme probability；

Step 5.3, the probability value calculated according to previous step are document d_mSample a new theme label k；If it is choosing New theme, i.e. k=K+1 are selected, and k is added in theme set, i.e. TS_t={ TS_t, k }；Then, using formula (10) to variable m_t,k、n_t,k、doc_t,kWithIt modifies.

It is further limited as of the invention, step 6 specifically includes:

Step 6.1, theme distribution θ_t={ θ_t,1,θ_t,2,…,θ_t,k,…,θ_t,KIn θ_t,kCalculation formula it is as follows:

The distribution of word in step 6.2, theme kInCalculation formula it is as follows:

Compared with the prior art, the invention has the advantages that:

1, the present invention can learn theme by Di Li Cray process oneself when learning the implicit theme of each time Number does not need the number of specified implicit theme manually, overcomes many traditional theme models and is difficult to determine that number of topics purpose is asked Topic, greatly strengthens the applicability and convenience of model；

2, the present invention, can be in conjunction with a upper time point when learning the implicit theme of the short collection of document of current point in time The implicit theme distribution of implicit theme distribution and current point in time overcomes some conventional models only to consider the implicit of current point in time Theme distribution, because a upper time point is important theme, current point in time is probably also important theme, Neng Gouyou The performance of the raising algorithm of effect；

3, the present invention only considered time point when learning the implicit theme of the short collection of document of current point in time Implicit theme distribution, has ignored the implicit theme distribution at time point earlier, substantially increases algorithm in processing mass text stream Problem, to overcome the problem of conventional model is by whole collection of document poor in timeliness；

4, the present invention can cross the new theme of automatic identification, and can be automatically deleted some out-of-date themes, not only increase The problem of the problem of algorithm process theme shifts also solves the continuous arrival of data at any time, and theme number infinitely makes growth, greatly The practicability of algorithm is improved greatly.

Specific embodiment

The present invention will be further described combined with specific embodiments below.

In the present embodiment, a kind of Dynamic Theme discovery algorithm of short text stream is to carry out as follows:

Assuming that the collection of document D={ D of short text stream₁,D₂,…,D_t-1,D_t..., wherein D_tIndicate what the t time reached Text collection；The collection of document of each time point goes to learn corresponding theme label using Di Li Cray process mixed model；It adopts With the generation process of folding rod (Stick-Breaking) Construction of A Model Di Li Cray process mixed model；For document d, produce Raw process is as follows:

θ | α~GEM (α)

φ_k| β~Dir (β) k=1 ..., ∞

z_d| θ~Mult (θ) d=1 ..., ∞

Here, GEM is Griffiths-Engen-McCloskey distribution, and Dir is the distribution of Di Li Cray, and Mult is multinomial Formula distribution, α and β are hyper parameters, and θ indicates the theme distribution of document, φ_kIndicate the word distribution in theme k, z_dIndicate that document is corresponding Theme label.It is assumed that each document is pertaining only to a theme；The hypothesis meets short collection of document, because each short Document all only has very limited information.Pass through theme label z_d, the probability of document d is generated, is calculated as follows；

Present aspect is learning implicit variable θ, φ and z_d, θ and φ are being integrated out, the Ji Busen sampling reasoning z to collapse is passed through_d。 Finally, further according to z_dCalculate θ and φ.

D_tIn each theme k, show m with following argument table_t,k、n_t,k、doc_t,kWithWherein m_t,kIndicate D_tIn belong to master Inscribe the number of documents of k, n_t,kIt is the total number for belonging to all words in theme k under the t time, doc_t,kIndicate the theme k under the t time In collection of document andIt is the total degree that word w belongs to theme k under the t time；D_tIn theme collection share TS_tIt indicates, indicates D_tIn include theme label, initial content is empty set.

Step 1, the short collection of document at the 1st time point (t=1) areContain B₁? Short essay shelves；Newly-built D₁In theme set TS₁；For D₁, by initial method, successively learn every document d_m(1≤m≤B₁) Theme；The formula to require emphasis, the algorithm do not need the number of designated key in advance；

In formula (2),Indicate document d_mTheme label, α and β are the parameters of user setting,It is document d_mMiddle word Number,It is document d_mThe appearance number of middle word w, V are the sizes of vocabulary；In this example, α and β both are set to 0.1；Formula (2) first half means that each document tends to the theme that one more document of selection belong to, and latter half means Tend to the theme that more similarity relations are shared in selection one with the document；Due to the text collection of first time point, before be not have There is the set of other times, can only be learnt according to the problems in this set；

Step 1.4, according to K+1 probability value above, sampling document d_mTheme label k；If k=K+1, new theme The k and variable m for indicating theme k_1,k、n_1,k、doc_1,kWithWherein m_1,k、n_1,kWithInitial value be all 0, doc_1,kInitially For empty set, and k is added in theme set, i.e. TS₁={ TS₁, k }；Finally, using formula (4) to variable m_1,k、n_1,k、doc_1,k WithIt modifies；

Step 2, the collection of document D at the 1st time point of iterative learning (t=1)₁In every document implicit theme, wherein changing Generation number i is from 1 to I；I is the parameter of user setting；In this example, I is arranged to 100；During i-th iteration, weigh respectively New study D₁In every document d_mTheme label, recalculated according to following sub-step:

Step 2.2 recalculates document d using formula (2) and formula (3)_mBelong to the probability of existing theme and new theme, and For document d_mSample a new theme label k；If k=K+1, new theme k and the variable m for indicating theme k_1,k、n_1,k、 doc_1,kWithm_t,k、n_t,kWithInitial value be all 0,

doc_1,kIt is initially empty set, and k is added in theme set, i.e. TS₁={ TS₁, k }；Then, right using formula (4) Variable m_1,k、n_1,k、doc_1,kWithIt modifies.

Step 3 assumes TS₁There are K theme, i.e. TS₁Theme number, under the 1st time point of reasoning (t=1), D₁In Theme distribution θ₁With the distribution Θ of word in each theme₁={ φ_1,1,φ_1,2,…,φ_1,k,…,φ_1,K}:

In formula (8), m_t-1,kIndicate a time point D_t-1In belong to the number of documents of theme k,It was the upper time Point D_t-1Middle word w goes out occurrence number in total in theme k；The formula different from formula (2), formula (8) also consider in a time point document The distribution of theme, because continuous two time points, many themes are all relevant；

Formula (9) has determined that the theme of the collection of document of current point in time not only derives from the theme collection at a time point It closes, new theme set can also be learnt, to achieve the purpose that theme migrates；

Step 5, iterative learning collection of document D_tIn every document number, wherein the number of iterations i is from 1 to I；I-th changes During generation, D is relearned respectively_tIn every document d_mTheme label, follow the steps below and recalculate:

Step 5.1 assumes document d_mOriginal corresponding theme label is k, first using formula (11) to variable m_t,k、n_t,k、 doc_t,kWithIt modifies；If after modification, doc_t,kBecome empty set, then from theme set TS_tMiddle removal theme label k, and move Except variable m relevant to theme k_t,k、n_t,k、doc_t,kWithThis step determines that some themes not updated are to be deleted, from And the number of theme will not expand without limitation；

Step 7 deletes D_t-1In corresponding variable m under corresponding theme and each theme_t,k、n_t,k、doc_t,kWithValue； This step deleted the related subject and variable at a upper time point, can not only reduce the consumption of memory, can also delete some mistakes When theme.

The present invention is not limited to the above embodiments, on the basis of technical solution disclosed by the invention, the skill of this field For art personnel according to disclosed technology contents, one can be made to some of which technical characteristic by not needing creative labor A little replacements and deformation, these replacements and deformation are within the scope of the invention.

Claims

1. a kind of Dynamic Theme of short text stream finds algorithm, which is characterized in that assuming that the collection of document D={ D of short text stream₁, D₂,…,D_t-1,D_t..., wherein D_tIndicate the text collection that the t time reaches, D_tIn each theme k, with following argument table Show m_t,k、n_t,k、doc_t,kWithWherein m_t,kIndicate D_tIn belong to the number of documents of theme k, n_t,kIt is to belong to master under the t time Inscribe the total number of all words in k, doc_t,kIndicate the collection of document under the t time in theme k,It is that word w belongs under the t time The total degree of theme k；D_tIn theme collection share TS_tIt indicates, indicates D_tIn include theme label, initial content is empty set；Institute Discovery algorithm is stated to carry out as follows:

Step 1, the short collection of document at the 1st time point (t=1) areContain B₁Piece short essay Shelves；Newly-built D₁In theme set TS₁；For D₁, by initial method, successively learn every document d_m(1≤m≤B₁) master Topic；

Step 2, the collection of document D at the 1st time point of iterative learning (t=1)₁In every document implicit theme, wherein the number of iterations I is from 1 to I；I is the parameter of user setting；During i-th iteration, D is relearned respectively₁In every document d_mTheme mark Label, are recalculated according to following sub-step:

Step 3 assumes TS₁There is K theme, under the 1st time point of reasoning (t=1), D₁In theme distribution θ₁In each theme The distribution Θ of word₁={ φ_1,1,φ_1,2,…,φ_1,k,…,φ_1,K}；

Step 4, the collection of document for handling future time point, i.e. t=t+1；Assuming that D_t-1Theme set TS_t-1There is K theme；When The short collection of document of preceding time point t (t >=2) isContain B_tPiece short essay shelves；Enable TS_tIt is equal to Empty set；For D_t, by initial method, successively learn every document d_m(1≤m≤B_t) theme；

Step 5, iterative learning collection of document D_tIn every document number, wherein the number of iterations i is from 1 to I；I-th iteration process In, D is relearned respectively_tIn every document d_mTheme label, follow the steps below and recalculate；

Step 6 calculates D_tMiddle theme distribution θ_tWith the distribution Θ of word in each theme_t={ φ_t,1,φ_t,2,…,φt,k,…, φ_t,K, wherein theme distribution θ_tUsing the distribution phi of word in formula (12) and theme k_t,kUsing formula (13)；

2. a kind of Dynamic Theme of short text stream according to claim 1 finds algorithm, which is characterized in that the tool of step 1 Body includes:

Step 1.1 enables first document d₁Theme label k be equal to 0；New theme k=0 and the variable for indicating theme k=0 m_1,0、n_1,0、doc_1,0WithUsing formula (1), the corresponding variable m of theme k=0 is modified_1,0、n_1,0、doc_1,0WithAnd main Topic 0 is added to theme TS₁In, i.e. TS₁={ 0 }；

Step 1.2 is directed to D₁, successively learn document d_m(2≤m≤B₁) corresponding theme label；Assuming that current theme set TS₁In Theme number be K；In initialization procedure, document d_mSelect theme k (1≤k≤K) as its theme mark from K theme The probability of label are as follows:

In formula (2),Indicate document d_mTheme label, α and β are the parameters of user setting,It is document d_mThe number of middle word,It is document d_mThe appearance number of middle word w, V are the sizes of vocabulary；

In step 1.3, initialization procedure, document d_mSelect the probability of theme label of the new theme label K+1 as it Are as follows:

Step 1.4, according to K+1 probability value above, sampling document d_mTheme label k；If k=K+1, new theme k and table Show the variable m of theme k_1,k、n_1,k、doc_1,kWithWherein m_1,k、n_1,kWithInitial value be all 0, doc_1,kIt is initially empty Collection, and k is added in theme set, i.e. TS₁={ TS₁, k }；Finally, using formula (4) to variable m_1,k、n_1,k、doc_1,kWith It modifies.

3. a kind of Dynamic Theme of short text stream according to claim 2 finds algorithm, which is characterized in that step 2 is specific Include:

Step 2.1 assumes document d_mOriginal corresponding theme label is k, first formula (5) is used, to variable m_1,k、n_1,k、doc_1,kWithIt modifies；If after modification, doc_1,kBecome empty set, then from theme set TS₁Middle removal theme label k, and remove and master Inscribe the relevant variable m of k_1,k、n_1,k、doc_1,kWith

Step 2.2 recalculates document d using formula (2) and formula (3)_mBelong to the probability of existing theme and new theme, and is text Shelves d_mSample a new theme label k；If k=K+1, new theme k and the variable m for indicating theme k_1,k、n_1,k、doc_1,k Withm_t,k、n_t,kWithInitial value be all 0, doc_1,kIt is initially empty set, and k is added in theme set, i.e. TS₁= {TS₁, k }；Then, using formula (4) to variable m_1,k、n_1,k、doc_1,kWithIt modifies.

4. a kind of Dynamic Theme of short text stream according to claim 3 finds algorithm, which is characterized in that step 3 is specific Include:

5. a kind of Dynamic Theme of short text stream according to claim 4 finds algorithm, which is characterized in that step 4 is specific Include:

Step 4.1, to D_tIt carries out in initialization procedure, document d_mSelect theme k (1≤k≤K) as its master from K theme Inscribe the probability of label are as follows:

In formula (8), m_t-1,kIndicate a time point D_t-1In belong to the number of documents of theme k,It was a upper time point D_t-1Middle word w goes out occurrence number in total in theme k；

In step 4.2, initialization procedure, document d_mSelect the probability of theme label of the new theme label K+1 as it Are as follows:

Step 4.3, according to K+1 probability value above, sampling document d_mTheme label k；If TS_tNot comprising theme k, it is added Into theme set, i.e. TS_t={ TS_t, k }；Then, using formula (10) to variable m_t,k、n_t,k、doc_t,kWithIt modifies.

6. a kind of Dynamic Theme of short text stream according to claim 5 finds algorithm, which is characterized in that step 5 is specific Include:

Step 5.1 assumes document d_mOriginal corresponding theme label is k, first using formula (11) to variable m_t,k、n_t,k、doc_t,kWithIt modifies；If after modification, doc_t,kBecome empty set, then from theme set TS_tMiddle removal theme label k, and remove and master Inscribe the relevant variable m of k_t,k、n_t,k、doc_t,kWith

Step 5.3, the probability value calculated according to previous step are document d_mSample a new theme label k；It is new if it is selection Theme, i.e. k=K+1, and k being added in theme set, i.e. TS_t={ TS_t, k }；Then, using formula (10) to variable m_t,k、 n_t,k、doc_t,kWithIt modifies.

7. a kind of Dynamic Theme of short text stream according to claim 5 finds algorithm, which is characterized in that step 6 is specific Include: