CN110096704A - A kind of Dynamic Theme discovery algorithm of short text stream - Google Patents
A kind of Dynamic Theme discovery algorithm of short text stream Download PDFInfo
- Publication number
- CN110096704A CN110096704A CN201910354228.3A CN201910354228A CN110096704A CN 110096704 A CN110096704 A CN 110096704A CN 201910354228 A CN201910354228 A CN 201910354228A CN 110096704 A CN110096704 A CN 110096704A
- Authority
- CN
- China
- Prior art keywords
- theme
- document
- doc
- formula
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Abstract
The invention discloses a kind of Dynamic Theme of short text stream in topic model field find algorithm, as follows carry out: step 1, initialize the 1st time point Documents set every document theme;Step 2, the 1st time point collection of document of iterative learning every document theme;Step 3, the distribution for obtaining word in the theme distribution and theme at the 1st time point;Step 4, initialize t time point (t > 1) collection of document every document theme;Step 5, iterative learning t time point collection of document every document theme;Step 6, the distribution for obtaining word in the theme distribution and theme at t time point;Step 7, the theme for deleting t-1 time point;Step 8, to the collection of document under subsequent point in time, successively learnt using step 4,5,6 and 7, the present invention can fully consider the sparsity of short text, be learnt in conjunction with the theme distribution of the document under the upper time, so as to more effectively find the implicit theme in short text stream.
Description
Technical field
The present invention relates to a kind of agent model, in particular to the Dynamic Theme of a kind of short text stream finds algorithm.
Background technique
In recent years, the content explosive growth of the short text stream on internet can be quickly to short such as microblogging and comment
The topic model for the algorithm short text stream that text flow is analyzed also is got more and more attention.The topic model of short text stream can be sent out
The theme distribution of the theme and document that are implied in existing short text stream, can be applied to short text classification, event detection and tracking and
Documentation summary etc..The motif discovery of short text stream has following challenge: (1) dilution of short text;(2) text is continued for
Arrival, similar to processing static text method, store all texts and be iterated, be infeasible to short text stream;(3)
Theme in text flow is being evolved always, it is desirable to be able to be detected new theme automatically, be removed the theme never updated.
The problem of existing topic model based on short text stream is calculated or shifted without very good solution theme or do not have
The highly advantageous correlation between different time points text collection theme causes the theme finally found to enable very much user dissatisfied.
Currently, the topic model algorithm of existing short text stream, can be roughly divided into following two categories: being based on dynamic Di Li Cray multinomial
Mixed model and be based on Di Li Cray procedure multinomial mixed model.These two types of algorithms all use the distribution of Di Li Cray and make
For the prior distribution for implying theme, then it is sampled using multinomial distribution.But first kind algorithm has examined different time points
Correlation between text collection theme, i.e., a upper time point many themes can continue to occur in current point in time, but due to needing
The problem of forcing the number of designated key model, not can solve theme transfer.Existing second class algorithm does not need
The number of designated key model is forced, but does not account for the correlation of theme in different time points collection of document.
Summary of the invention
The object of the present invention is to provide a kind of Dynamic Themes of short text stream to find algorithm, is able to solve the dilution of short text
Property and theme the problem of shifting, and the correlation of the theme at a time point and the theme of current time is made full use of, to have
Effect improves the accuracy of the motif discovery of short text stream.
The object of the present invention is achieved like this: a kind of Dynamic Theme discovery algorithm of short text stream, which is characterized in that false
If the collection of document D={ D of short text stream1,D2,…,Dt-1,Dt..., wherein DtIndicate the text collection that the t time reaches, Dt
In each theme k, show m with following argument tablet,k、nt,k、doct,kWithWherein mt,kIndicate DtIn belong to the document of theme k
Number, nt,kIt is the total number for belonging to all words in theme k under the t time, doct,kIndicate the document under the t time in theme k
Set,It is the total degree that word w belongs to theme k under the t time;DtIn theme collection share TStIt indicates, indicates DtIn include
Theme label, initial content are empty set;The discovery algorithm carries out as follows:
Step 1, the short collection of document at the 1st time point (t=1) areContain B1?
Short essay shelves;Newly-built D1In theme set TS1;For D1, by initial method, successively learn every document dm(1≤m≤B1)
Theme;
Step 2, the collection of document D at the 1st time point of iterative learning (t=1)1In every document implicit theme, wherein changing
Generation number i is from 1 to I;I is the parameter of user setting;During i-th iteration, D is relearned respectively1In every document dm's
Theme label is recalculated according to following sub-step:
Step 3 assumes TS1There is K theme, under the 1st time point of reasoning (t=1), D1In theme distribution θ1With each master
The distribution Θ of word in topic1={ φ1,1,φ1,2,…,φ1,k,…,φ1,K};
Step 4, the collection of document for handling future time point, i.e. t=t+1;Assuming that Dt-1Theme set TSt-1There is K master
Topic;The short collection of document of current point in time t (t >=2) isContain BtPiece short essay shelves;It enables
TStEqual to empty set;For Dt, by initial method, successively learn every document dm(1≤m≤Bt) theme;
Step 5, iterative learning collection of document DtIn every document number, wherein the number of iterations i is from 1 to I;I-th changes
During generation, D is relearned respectivelytIn every document dmTheme label, follow the steps below and recalculate;
Step 6 calculates DtMiddle theme distribution θtWith the distribution Θ of word in each themet={ φt,1,φt,2,…,
φT, k,…,φt,K, wherein theme distribution θtUsing the distribution phi of word in formula (12) and theme kt,kUsing formula (13);
Step 7 deletes Dt-1In corresponding variable m under corresponding theme and each themet,k、nt,k、doct,kWithValue;
Step 8, to the collection of document under subsequent point in time, successively using step 4,5,6 and 7 learnt.
It is further limited as of the invention, step 1 specifically includes:
Step 1.1 enables first document d1Theme label k be equal to 0;New theme k=0 and the change for indicating theme k=0
Measure m1,0、n1,0、doc1,0WithUsing formula (1), the corresponding variable m of theme k=0 is modified1,0、n1,0、doc1,0WithAnd handle
Theme 0 is added to theme TS1In, i.e. TS1={ 0 };
In formula (1),It is document d1The number of middle word,It is document d1The appearance number of middle word w;
Step 1.2 is directed to D1, successively learn document dm(2≤m≤B1) corresponding theme label;Assuming that current theme set
TS1In theme number be K;In initialization procedure, document dmSelect theme k (1≤k≤K) as its master from K theme
Inscribe the probability of label are as follows:
In formula (2),Indicate document dmTheme label, α and β are the parameters of user setting,It is document dmMiddle word
Number,It is document dmThe appearance number of middle word w, V are the sizes of vocabulary;
In step 1.3, initialization procedure, document dmSelect a new theme label K+1 as its theme label
Probability are as follows:
Step 1.4, according to K+1 probability value above, sampling document dmTheme label k;If k=K+1, new theme
The k and variable m for indicating theme k1,k、n1,k、doc1,kWithWherein m1,k、n1,kWithInitial value be all 0, doc1,kInitially
For empty set, and k is added in theme set, i.e. TS1={ TS1, k };Finally, using formula (4) to variable m1,k、n1,k、doc1,k
WithIt modifies.
It is further limited as of the invention, step 2 specifically includes:
Step 2.1 assumes document dmOriginal corresponding theme label is k, first formula (5) is used, to variable m1,k、n1,k、
doc1,kWithIt modifies;If after modification, doc1,kBecome empty set, then from theme set TS1Middle removal theme label k, and move
Except variable m relevant to theme k1,k、n1,k、doc1,kWith
Step 2.2 recalculates document d using formula (2) and formula (3)mBelong to the probability of existing theme and new theme, and
For document dmSample a new theme label k;If k=K+1, new theme k and the variable m for indicating theme k1,k、n1,k、
doc1,kWithmt,k、nt,kWithInitial value be all 0, doc1,kIt is initially empty set, and k is added in theme set, i.e.,
TS1={ TS1, k };Then, using formula (4) to variable m1,k、n1,k、doc1,kWithIt modifies.
It is further limited as of the invention, step 3 specifically includes:
Step 3.1, theme distribution θ1={ θ1,1,θ1,2,…,θ1,k,…,θ1,KIn θ1,kCalculation formula it is as follows:
The distribution of word in step 3.2, theme kInCalculation formula it is as follows:
It is further limited as of the invention, step 4 specifically includes:
Step 4.1, to DtIt carries out in initialization procedure, document dmSelect theme k (1≤k≤K) as it from K theme
Theme label probability are as follows:
In formula (8), mt-1,kIndicate a time point Dt-1In belong to the number of documents of theme k,It was the upper time
Point Dt-1Middle word w goes out occurrence number in total in theme k;
In step 4.2, initialization procedure, document dmSelect a new theme label K+1 as its theme label
Probability are as follows:
Step 4.3, according to K+1 probability value above, sampling document dmTheme label k;If TStNot comprising theme k,
It is added in theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable mt,k、nt,k、doct,kWithIt is repaired
Change.
It is further limited as of the invention, step 5 specifically includes:
Step 5.1 assumes document dmOriginal corresponding theme label is k, first using formula (11) to variable mt,k、nt,k、
doct,kWithIt modifies;If after modification, doct,kBecome empty set, then from theme set TStMiddle removal theme label k, and move
Except variable m relevant to theme kt,k、nt,k、doct,kWith
Step 5.2 calculates document d using formula (8) and formula (9)mBelong to TStIn theme and new theme probability;
Step 5.3, the probability value calculated according to previous step are document dmSample a new theme label k;If it is choosing
New theme, i.e. k=K+1 are selected, and k is added in theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable
mt,k、nt,k、doct,kWithIt modifies.
It is further limited as of the invention, step 6 specifically includes:
Step 6.1, theme distribution θt={ θt,1,θt,2,…,θt,k,…,θt,KIn θt,kCalculation formula it is as follows:
The distribution of word in step 6.2, theme kInCalculation formula it is as follows:
Compared with the prior art, the invention has the advantages that:
1, the present invention can learn theme by Di Li Cray process oneself when learning the implicit theme of each time
Number does not need the number of specified implicit theme manually, overcomes many traditional theme models and is difficult to determine that number of topics purpose is asked
Topic, greatly strengthens the applicability and convenience of model;
2, the present invention, can be in conjunction with a upper time point when learning the implicit theme of the short collection of document of current point in time
The implicit theme distribution of implicit theme distribution and current point in time overcomes some conventional models only to consider the implicit of current point in time
Theme distribution, because a upper time point is important theme, current point in time is probably also important theme, Neng Gouyou
The performance of the raising algorithm of effect;
3, the present invention only considered time point when learning the implicit theme of the short collection of document of current point in time
Implicit theme distribution, has ignored the implicit theme distribution at time point earlier, substantially increases algorithm in processing mass text stream
Problem, to overcome the problem of conventional model is by whole collection of document poor in timeliness;
4, the present invention can cross the new theme of automatic identification, and can be automatically deleted some out-of-date themes, not only increase
The problem of the problem of algorithm process theme shifts also solves the continuous arrival of data at any time, and theme number infinitely makes growth, greatly
The practicability of algorithm is improved greatly.
Specific embodiment
The present invention will be further described combined with specific embodiments below.
In the present embodiment, a kind of Dynamic Theme discovery algorithm of short text stream is to carry out as follows:
Assuming that the collection of document D={ D of short text stream1,D2,…,Dt-1,Dt..., wherein DtIndicate what the t time reached
Text collection;The collection of document of each time point goes to learn corresponding theme label using Di Li Cray process mixed model;It adopts
With the generation process of folding rod (Stick-Breaking) Construction of A Model Di Li Cray process mixed model;For document d, produce
Raw process is as follows:
θ | α~GEM (α)
φk| β~Dir (β) k=1 ..., ∞
zd| θ~Mult (θ) d=1 ..., ∞
Here, GEM is Griffiths-Engen-McCloskey distribution, and Dir is the distribution of Di Li Cray, and Mult is multinomial
Formula distribution, α and β are hyper parameters, and θ indicates the theme distribution of document, φkIndicate the word distribution in theme k, zdIndicate that document is corresponding
Theme label.It is assumed that each document is pertaining only to a theme;The hypothesis meets short collection of document, because each short
Document all only has very limited information.Pass through theme label zd, the probability of document d is generated, is calculated as follows;
Present aspect is learning implicit variable θ, φ and zd, θ and φ are being integrated out, the Ji Busen sampling reasoning z to collapse is passed throughd。
Finally, further according to zdCalculate θ and φ.
DtIn each theme k, show m with following argument tablet,k、nt,k、doct,kWithWherein mt,kIndicate DtIn belong to master
Inscribe the number of documents of k, nt,kIt is the total number for belonging to all words in theme k under the t time, doct,kIndicate the theme k under the t time
In collection of document andIt is the total degree that word w belongs to theme k under the t time;DtIn theme collection share TStIt indicates, indicates
DtIn include theme label, initial content is empty set.
Step 1, the short collection of document at the 1st time point (t=1) areContain B1?
Short essay shelves;Newly-built D1In theme set TS1;For D1, by initial method, successively learn every document dm(1≤m≤B1)
Theme;The formula to require emphasis, the algorithm do not need the number of designated key in advance;
Step 1.1 enables first document d1Theme label k be equal to 0;New theme k=0 and the change for indicating theme k=0
Measure m1,0、n1,0、doc1,0WithUsing formula (1), the corresponding variable m of theme k=0 is modified1,0、n1,0、doc1,0WithAnd handle
Theme 0 is added to theme TS1In, i.e. TS1={ 0 };
In formula (1),It is document d1The number of middle word,It is document d1The appearance number of middle word w;
Step 1.2 is directed to D1, successively learn document dm(2≤m≤B1) corresponding theme label;Assuming that current theme set
TS1In theme number be K;In initialization procedure, document dmSelect theme k (1≤k≤K) as its master from K theme
Inscribe the probability of label are as follows:
In formula (2),Indicate document dmTheme label, α and β are the parameters of user setting,It is document dmMiddle word
Number,It is document dmThe appearance number of middle word w, V are the sizes of vocabulary;In this example, α and β both are set to 0.1;Formula
(2) first half means that each document tends to the theme that one more document of selection belong to, and latter half means
Tend to the theme that more similarity relations are shared in selection one with the document;Due to the text collection of first time point, before be not have
There is the set of other times, can only be learnt according to the problems in this set;
In step 1.3, initialization procedure, document dmSelect a new theme label K+1 as its theme label
Probability are as follows:
Step 1.4, according to K+1 probability value above, sampling document dmTheme label k;If k=K+1, new theme
The k and variable m for indicating theme k1,k、n1,k、doc1,kWithWherein m1,k、n1,kWithInitial value be all 0, doc1,kInitially
For empty set, and k is added in theme set, i.e. TS1={ TS1, k };Finally, using formula (4) to variable m1,k、n1,k、doc1,k
WithIt modifies;
Step 2, the collection of document D at the 1st time point of iterative learning (t=1)1In every document implicit theme, wherein changing
Generation number i is from 1 to I;I is the parameter of user setting;In this example, I is arranged to 100;During i-th iteration, weigh respectively
New study D1In every document dmTheme label, recalculated according to following sub-step:
Step 2.1 assumes document dmOriginal corresponding theme label is k, first formula (5) is used, to variable m1,k、n1,k、
doc1,kWithIt modifies;If after modification, doc1,kBecome empty set, then from theme set TS1Middle removal theme label k, and move
Except variable m relevant to theme k1,k、n1,k、doc1,kWith
Step 2.2 recalculates document d using formula (2) and formula (3)mBelong to the probability of existing theme and new theme, and
For document dmSample a new theme label k;If k=K+1, new theme k and the variable m for indicating theme k1,k、n1,k、
doc1,kWithmt,k、nt,kWithInitial value be all 0,
doc1,kIt is initially empty set, and k is added in theme set, i.e. TS1={ TS1, k };Then, right using formula (4)
Variable m1,k、n1,k、doc1,kWithIt modifies.
Step 3 assumes TS1There are K theme, i.e. TS1Theme number, under the 1st time point of reasoning (t=1), D1In
Theme distribution θ1With the distribution Θ of word in each theme1={ φ1,1,φ1,2,…,φ1,k,…,φ1,K}:
Step 3.1, theme distribution θ1={ θ1,1,θ1,2,…,θ1,k,…,θ1,KIn θ1,kCalculation formula it is as follows:
The distribution of word in step 3.2, theme kInCalculation formula it is as follows:
Step 4, the collection of document for handling future time point, i.e. t=t+1;Assuming that Dt-1Theme set TSt-1There is K master
Topic;The short collection of document of current point in time t (t >=2) isContain BtPiece short essay shelves;It enables
TStEqual to empty set;For Dt, by initial method, successively learn every document dm(1≤m≤Bt) theme;
Step 4.1, to DtIt carries out in initialization procedure, document dmSelect theme k (1≤k≤K) as it from K theme
Theme label probability are as follows:
In formula (8), mt-1,kIndicate a time point Dt-1In belong to the number of documents of theme k,It was the upper time
Point Dt-1Middle word w goes out occurrence number in total in theme k;The formula different from formula (2), formula (8) also consider in a time point document
The distribution of theme, because continuous two time points, many themes are all relevant;
In step 4.2, initialization procedure, document dmSelect a new theme label K+1 as its theme label
Probability are as follows:
Formula (9) has determined that the theme of the collection of document of current point in time not only derives from the theme collection at a time point
It closes, new theme set can also be learnt, to achieve the purpose that theme migrates;
Step 4.3, according to K+1 probability value above, sampling document dmTheme label k;If TStNot comprising theme k,
It is added in theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable mt,k、nt,k、doct,kWithIt is repaired
Change.
Step 5, iterative learning collection of document DtIn every document number, wherein the number of iterations i is from 1 to I;I-th changes
During generation, D is relearned respectivelytIn every document dmTheme label, follow the steps below and recalculate:
Step 5.1 assumes document dmOriginal corresponding theme label is k, first using formula (11) to variable mt,k、nt,k、
doct,kWithIt modifies;If after modification, doct,kBecome empty set, then from theme set TStMiddle removal theme label k, and move
Except variable m relevant to theme kt,k、nt,k、doct,kWithThis step determines that some themes not updated are to be deleted, from
And the number of theme will not expand without limitation;
Step 5.2 calculates document d using formula (8) and formula (9)mBelong to TStIn theme and new theme probability;
Step 5.3, the probability value calculated according to previous step are document dmSample a new theme label k;If it is choosing
New theme, i.e. k=K+1 are selected, and k is added in theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable
mt,k、nt,k、doct,kWithIt modifies.
Step 6 calculates DtMiddle theme distribution θtWith the distribution Θ of word in each themet={ φt,1,φt,2,…,
φT, k,…,φt,K, wherein theme distribution θtUsing the distribution phi of word in formula (12) and theme kt,kUsing formula (13);
Step 6.1, theme distribution θt={ θt,1,θt,2,…,θt,k,…,θt,KIn θt,kCalculation formula it is as follows:
The distribution of word in step 6.2, theme kInCalculation formula it is as follows:
Step 7 deletes Dt-1In corresponding variable m under corresponding theme and each themet,k、nt,k、doct,kWithValue;
This step deleted the related subject and variable at a upper time point, can not only reduce the consumption of memory, can also delete some mistakes
When theme.
Step 8, to the collection of document under subsequent point in time, successively using step 4,5,6 and 7 learnt.
The present invention is not limited to the above embodiments, on the basis of technical solution disclosed by the invention, the skill of this field
For art personnel according to disclosed technology contents, one can be made to some of which technical characteristic by not needing creative labor
A little replacements and deformation, these replacements and deformation are within the scope of the invention.
Claims (7)
1. a kind of Dynamic Theme of short text stream finds algorithm, which is characterized in that assuming that the collection of document D={ D of short text stream1,
D2,…,Dt-1,Dt..., wherein DtIndicate the text collection that the t time reaches, DtIn each theme k, with following argument table
Show mt,k、nt,k、doct,kWithWherein mt,kIndicate DtIn belong to the number of documents of theme k, nt,kIt is to belong to master under the t time
Inscribe the total number of all words in k, doct,kIndicate the collection of document under the t time in theme k,It is that word w belongs under the t time
The total degree of theme k;DtIn theme collection share TStIt indicates, indicates DtIn include theme label, initial content is empty set;Institute
Discovery algorithm is stated to carry out as follows:
Step 1, the short collection of document at the 1st time point (t=1) areContain B1Piece short essay
Shelves;Newly-built D1In theme set TS1;For D1, by initial method, successively learn every document dm(1≤m≤B1) master
Topic;
Step 2, the collection of document D at the 1st time point of iterative learning (t=1)1In every document implicit theme, wherein the number of iterations
I is from 1 to I;I is the parameter of user setting;During i-th iteration, D is relearned respectively1In every document dmTheme mark
Label, are recalculated according to following sub-step:
Step 3 assumes TS1There is K theme, under the 1st time point of reasoning (t=1), D1In theme distribution θ1In each theme
The distribution Θ of word1={ φ1,1,φ1,2,…,φ1,k,…,φ1,K};
Step 4, the collection of document for handling future time point, i.e. t=t+1;Assuming that Dt-1Theme set TSt-1There is K theme;When
The short collection of document of preceding time point t (t >=2) isContain BtPiece short essay shelves;Enable TStIt is equal to
Empty set;For Dt, by initial method, successively learn every document dm(1≤m≤Bt) theme;
Step 5, iterative learning collection of document DtIn every document number, wherein the number of iterations i is from 1 to I;I-th iteration process
In, D is relearned respectivelytIn every document dmTheme label, follow the steps below and recalculate;
Step 6 calculates DtMiddle theme distribution θtWith the distribution Θ of word in each themet={ φt,1,φt,2,…,φt,k,…,
φt,K, wherein theme distribution θtUsing the distribution phi of word in formula (12) and theme kt,kUsing formula (13);
Step 7 deletes Dt-1In corresponding variable m under corresponding theme and each themet,k、nt,k、doct,kWithValue;
Step 8, to the collection of document under subsequent point in time, successively using step 4,5,6 and 7 learnt.
2. a kind of Dynamic Theme of short text stream according to claim 1 finds algorithm, which is characterized in that the tool of step 1
Body includes:
Step 1.1 enables first document d1Theme label k be equal to 0;New theme k=0 and the variable for indicating theme k=0
m1,0、n1,0、doc1,0WithUsing formula (1), the corresponding variable m of theme k=0 is modified1,0、n1,0、doc1,0WithAnd main
Topic 0 is added to theme TS1In, i.e. TS1={ 0 };
In formula (1),It is document d1The number of middle word,It is document d1The appearance number of middle word w;
Step 1.2 is directed to D1, successively learn document dm(2≤m≤B1) corresponding theme label;Assuming that current theme set TS1In
Theme number be K;In initialization procedure, document dmSelect theme k (1≤k≤K) as its theme mark from K theme
The probability of label are as follows:
In formula (2),Indicate document dmTheme label, α and β are the parameters of user setting,It is document dmThe number of middle word,It is document dmThe appearance number of middle word w, V are the sizes of vocabulary;
In step 1.3, initialization procedure, document dmSelect the probability of theme label of the new theme label K+1 as it
Are as follows:
Step 1.4, according to K+1 probability value above, sampling document dmTheme label k;If k=K+1, new theme k and table
Show the variable m of theme k1,k、n1,k、doc1,kWithWherein m1,k、n1,kWithInitial value be all 0, doc1,kIt is initially empty
Collection, and k is added in theme set, i.e. TS1={ TS1, k };Finally, using formula (4) to variable m1,k、n1,k、doc1,kWith
It modifies.
3. a kind of Dynamic Theme of short text stream according to claim 2 finds algorithm, which is characterized in that step 2 is specific
Include:
Step 2.1 assumes document dmOriginal corresponding theme label is k, first formula (5) is used, to variable m1,k、n1,k、doc1,kWithIt modifies;If after modification, doc1,kBecome empty set, then from theme set TS1Middle removal theme label k, and remove and master
Inscribe the relevant variable m of k1,k、n1,k、doc1,kWith
Step 2.2 recalculates document d using formula (2) and formula (3)mBelong to the probability of existing theme and new theme, and is text
Shelves dmSample a new theme label k;If k=K+1, new theme k and the variable m for indicating theme k1,k、n1,k、doc1,k
Withmt,k、nt,kWithInitial value be all 0, doc1,kIt is initially empty set, and k is added in theme set, i.e. TS1=
{TS1, k };Then, using formula (4) to variable m1,k、n1,k、doc1,kWithIt modifies.
4. a kind of Dynamic Theme of short text stream according to claim 3 finds algorithm, which is characterized in that step 3 is specific
Include:
Step 3.1, theme distribution θ1={ θ1,1,θ1,2,…,θ1,k,…,θ1,KIn θ1,kCalculation formula it is as follows:
The distribution of word in step 3.2, theme kInCalculation formula it is as follows:
5. a kind of Dynamic Theme of short text stream according to claim 4 finds algorithm, which is characterized in that step 4 is specific
Include:
Step 4.1, to DtIt carries out in initialization procedure, document dmSelect theme k (1≤k≤K) as its master from K theme
Inscribe the probability of label are as follows:
In formula (8), mt-1,kIndicate a time point Dt-1In belong to the number of documents of theme k,It was a upper time point
Dt-1Middle word w goes out occurrence number in total in theme k;
In step 4.2, initialization procedure, document dmSelect the probability of theme label of the new theme label K+1 as it
Are as follows:
Step 4.3, according to K+1 probability value above, sampling document dmTheme label k;If TStNot comprising theme k, it is added
Into theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable mt,k、nt,k、doct,kWithIt modifies.
6. a kind of Dynamic Theme of short text stream according to claim 5 finds algorithm, which is characterized in that step 5 is specific
Include:
Step 5.1 assumes document dmOriginal corresponding theme label is k, first using formula (11) to variable mt,k、nt,k、doct,kWithIt modifies;If after modification, doct,kBecome empty set, then from theme set TStMiddle removal theme label k, and remove and master
Inscribe the relevant variable m of kt,k、nt,k、doct,kWith
Step 5.2 calculates document d using formula (8) and formula (9)mBelong to TStIn theme and new theme probability;
Step 5.3, the probability value calculated according to previous step are document dmSample a new theme label k;It is new if it is selection
Theme, i.e. k=K+1, and k being added in theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable mt,k、
nt,k、doct,kWithIt modifies.
7. a kind of Dynamic Theme of short text stream according to claim 5 finds algorithm, which is characterized in that step 6 is specific
Include:
Step 6.1, theme distribution θt={ θt,1,θt,2,…,θt,k,…,θt,KIn θt,kCalculation formula it is as follows:
The distribution of word in step 6.2, theme kInCalculation formula it is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910354228.3A CN110096704B (en) | 2019-04-29 | 2019-04-29 | Dynamic theme discovery method for short text stream |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910354228.3A CN110096704B (en) | 2019-04-29 | 2019-04-29 | Dynamic theme discovery method for short text stream |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110096704A true CN110096704A (en) | 2019-08-06 |
CN110096704B CN110096704B (en) | 2023-05-05 |
Family
ID=67446310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910354228.3A Active CN110096704B (en) | 2019-04-29 | 2019-04-29 | Dynamic theme discovery method for short text stream |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110096704B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112507713A (en) * | 2020-12-15 | 2021-03-16 | 北京京航计算通讯研究所 | Text aggregation system based on dynamic self-aggregation topic model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140129510A1 (en) * | 2011-07-13 | 2014-05-08 | Huawei Technologies Co., Ltd. | Parameter Inference Method, Calculation Apparatus, and System Based on Latent Dirichlet Allocation Model |
CN103942340A (en) * | 2014-05-09 | 2014-07-23 | 电子科技大学 | Microblog user interest recognizing method based on text mining |
CN104199974A (en) * | 2013-09-22 | 2014-12-10 | 中科嘉速(北京)并行软件有限公司 | Microblog-oriented dynamic topic detection and evolution tracking method |
CN107798043A (en) * | 2017-06-28 | 2018-03-13 | 贵州大学 | The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays |
CN107992549A (en) * | 2017-11-28 | 2018-05-04 | 南京信息工程大学 | Dynamic short text stream Clustering Retrieval method |
CN109063030A (en) * | 2018-07-16 | 2018-12-21 | 南京信息工程大学 | A method of theme and descriptor are implied based on streaming LDA topic model discovery document |
-
2019
- 2019-04-29 CN CN201910354228.3A patent/CN110096704B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140129510A1 (en) * | 2011-07-13 | 2014-05-08 | Huawei Technologies Co., Ltd. | Parameter Inference Method, Calculation Apparatus, and System Based on Latent Dirichlet Allocation Model |
CN104199974A (en) * | 2013-09-22 | 2014-12-10 | 中科嘉速(北京)并行软件有限公司 | Microblog-oriented dynamic topic detection and evolution tracking method |
CN103942340A (en) * | 2014-05-09 | 2014-07-23 | 电子科技大学 | Microblog user interest recognizing method based on text mining |
CN107798043A (en) * | 2017-06-28 | 2018-03-13 | 贵州大学 | The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays |
CN107992549A (en) * | 2017-11-28 | 2018-05-04 | 南京信息工程大学 | Dynamic short text stream Clustering Retrieval method |
CN109063030A (en) * | 2018-07-16 | 2018-12-21 | 南京信息工程大学 | A method of theme and descriptor are implied based on streaming LDA topic model discovery document |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112507713A (en) * | 2020-12-15 | 2021-03-16 | 北京京航计算通讯研究所 | Text aggregation system based on dynamic self-aggregation topic model |
Also Published As
Publication number | Publication date |
---|---|
CN110096704B (en) | 2023-05-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105653706B (en) | A kind of multilayer quotation based on literature content knowledge mapping recommends method | |
TWI653542B (en) | Method, system and device for discovering and tracking hot topics based on network media data flow | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN106294593B (en) | In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study | |
CN106599054B (en) | Method and system for classifying and pushing questions | |
CN109902159A (en) | A kind of intelligent O&M statement similarity matching process based on natural language processing | |
EP2833271A1 (en) | Multimedia question and answer system and method | |
CN106991127B (en) | Knowledge subject short text hierarchical classification method based on topological feature expansion | |
CN111241294A (en) | Graph convolution network relation extraction method based on dependency analysis and key words | |
CN107273348B (en) | Topic and emotion combined detection method and device for text | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN109213925B (en) | Legal text searching method | |
CN111143672B (en) | Knowledge graph-based professional speciality scholars recommendation method | |
CN108614897B (en) | Content diversification searching method for natural language | |
CN108804701A (en) | Personage's portrait model building method based on social networks big data | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN107562772A (en) | Event extraction method, apparatus, system and storage medium | |
CN107329954B (en) | Topic detection method based on document content and mutual relation | |
CN108090077A (en) | A kind of comprehensive similarity computational methods based on natural language searching | |
CN103559193A (en) | Topic modeling method based on selected cell | |
CN107247739A (en) | A kind of financial publication text knowledge extracting method based on factor graph | |
CN100543735C (en) | File similarity measure method based on file structure | |
CN106874397B (en) | Automatic semantic annotation method for Internet of things equipment | |
CN110717042A (en) | Method for constructing document-keyword heterogeneous network model | |
CN113282689A (en) | Retrieval method and device based on domain knowledge graph and search engine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |