CN110096704A - A kind of Dynamic Theme discovery algorithm of short text stream - Google Patents

A kind of Dynamic Theme discovery algorithm of short text stream Download PDF

Info

Publication number
CN110096704A
CN110096704A CN201910354228.3A CN201910354228A CN110096704A CN 110096704 A CN110096704 A CN 110096704A CN 201910354228 A CN201910354228 A CN 201910354228A CN 110096704 A CN110096704 A CN 110096704A
Authority
CN
China
Prior art keywords
theme
document
doc
formula
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910354228.3A
Other languages
Chinese (zh)
Other versions
CN110096704B (en
Inventor
强继朋
李云
袁运浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou University
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN201910354228.3A priority Critical patent/CN110096704B/en
Publication of CN110096704A publication Critical patent/CN110096704A/en
Application granted granted Critical
Publication of CN110096704B publication Critical patent/CN110096704B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The invention discloses a kind of Dynamic Theme of short text stream in topic model field find algorithm, as follows carry out: step 1, initialize the 1st time point Documents set every document theme;Step 2, the 1st time point collection of document of iterative learning every document theme;Step 3, the distribution for obtaining word in the theme distribution and theme at the 1st time point;Step 4, initialize t time point (t > 1) collection of document every document theme;Step 5, iterative learning t time point collection of document every document theme;Step 6, the distribution for obtaining word in the theme distribution and theme at t time point;Step 7, the theme for deleting t-1 time point;Step 8, to the collection of document under subsequent point in time, successively learnt using step 4,5,6 and 7, the present invention can fully consider the sparsity of short text, be learnt in conjunction with the theme distribution of the document under the upper time, so as to more effectively find the implicit theme in short text stream.

Description

A kind of Dynamic Theme discovery algorithm of short text stream
Technical field
The present invention relates to a kind of agent model, in particular to the Dynamic Theme of a kind of short text stream finds algorithm.
Background technique
In recent years, the content explosive growth of the short text stream on internet can be quickly to short such as microblogging and comment The topic model for the algorithm short text stream that text flow is analyzed also is got more and more attention.The topic model of short text stream can be sent out The theme distribution of the theme and document that are implied in existing short text stream, can be applied to short text classification, event detection and tracking and Documentation summary etc..The motif discovery of short text stream has following challenge: (1) dilution of short text;(2) text is continued for Arrival, similar to processing static text method, store all texts and be iterated, be infeasible to short text stream;(3) Theme in text flow is being evolved always, it is desirable to be able to be detected new theme automatically, be removed the theme never updated.
The problem of existing topic model based on short text stream is calculated or shifted without very good solution theme or do not have The highly advantageous correlation between different time points text collection theme causes the theme finally found to enable very much user dissatisfied. Currently, the topic model algorithm of existing short text stream, can be roughly divided into following two categories: being based on dynamic Di Li Cray multinomial Mixed model and be based on Di Li Cray procedure multinomial mixed model.These two types of algorithms all use the distribution of Di Li Cray and make For the prior distribution for implying theme, then it is sampled using multinomial distribution.But first kind algorithm has examined different time points Correlation between text collection theme, i.e., a upper time point many themes can continue to occur in current point in time, but due to needing The problem of forcing the number of designated key model, not can solve theme transfer.Existing second class algorithm does not need The number of designated key model is forced, but does not account for the correlation of theme in different time points collection of document.
Summary of the invention
The object of the present invention is to provide a kind of Dynamic Themes of short text stream to find algorithm, is able to solve the dilution of short text Property and theme the problem of shifting, and the correlation of the theme at a time point and the theme of current time is made full use of, to have Effect improves the accuracy of the motif discovery of short text stream.
The object of the present invention is achieved like this: a kind of Dynamic Theme discovery algorithm of short text stream, which is characterized in that false If the collection of document D={ D of short text stream1,D2,…,Dt-1,Dt..., wherein DtIndicate the text collection that the t time reaches, Dt In each theme k, show m with following argument tablet,k、nt,k、doct,kWithWherein mt,kIndicate DtIn belong to the document of theme k Number, nt,kIt is the total number for belonging to all words in theme k under the t time, doct,kIndicate the document under the t time in theme k Set,It is the total degree that word w belongs to theme k under the t time;DtIn theme collection share TStIt indicates, indicates DtIn include Theme label, initial content are empty set;The discovery algorithm carries out as follows:
Step 1, the short collection of document at the 1st time point (t=1) areContain B1? Short essay shelves;Newly-built D1In theme set TS1;For D1, by initial method, successively learn every document dm(1≤m≤B1) Theme;
Step 2, the collection of document D at the 1st time point of iterative learning (t=1)1In every document implicit theme, wherein changing Generation number i is from 1 to I;I is the parameter of user setting;During i-th iteration, D is relearned respectively1In every document dm's Theme label is recalculated according to following sub-step:
Step 3 assumes TS1There is K theme, under the 1st time point of reasoning (t=1), D1In theme distribution θ1With each master The distribution Θ of word in topic1={ φ1,11,2,…,φ1,k,…,φ1,K};
Step 4, the collection of document for handling future time point, i.e. t=t+1;Assuming that Dt-1Theme set TSt-1There is K master Topic;The short collection of document of current point in time t (t >=2) isContain BtPiece short essay shelves;It enables TStEqual to empty set;For Dt, by initial method, successively learn every document dm(1≤m≤Bt) theme;
Step 5, iterative learning collection of document DtIn every document number, wherein the number of iterations i is from 1 to I;I-th changes During generation, D is relearned respectivelytIn every document dmTheme label, follow the steps below and recalculate;
Step 6 calculates DtMiddle theme distribution θtWith the distribution Θ of word in each themet={ φt,1t,2,…, φT, k,…,φt,K, wherein theme distribution θtUsing the distribution phi of word in formula (12) and theme kt,kUsing formula (13);
Step 7 deletes Dt-1In corresponding variable m under corresponding theme and each themet,k、nt,k、doct,kWithValue;
Step 8, to the collection of document under subsequent point in time, successively using step 4,5,6 and 7 learnt.
It is further limited as of the invention, step 1 specifically includes:
Step 1.1 enables first document d1Theme label k be equal to 0;New theme k=0 and the change for indicating theme k=0 Measure m1,0、n1,0、doc1,0WithUsing formula (1), the corresponding variable m of theme k=0 is modified1,0、n1,0、doc1,0WithAnd handle Theme 0 is added to theme TS1In, i.e. TS1={ 0 };
In formula (1),It is document d1The number of middle word,It is document d1The appearance number of middle word w;
Step 1.2 is directed to D1, successively learn document dm(2≤m≤B1) corresponding theme label;Assuming that current theme set TS1In theme number be K;In initialization procedure, document dmSelect theme k (1≤k≤K) as its master from K theme Inscribe the probability of label are as follows:
In formula (2),Indicate document dmTheme label, α and β are the parameters of user setting,It is document dmMiddle word Number,It is document dmThe appearance number of middle word w, V are the sizes of vocabulary;
In step 1.3, initialization procedure, document dmSelect a new theme label K+1 as its theme label Probability are as follows:
Step 1.4, according to K+1 probability value above, sampling document dmTheme label k;If k=K+1, new theme The k and variable m for indicating theme k1,k、n1,k、doc1,kWithWherein m1,k、n1,kWithInitial value be all 0, doc1,kInitially For empty set, and k is added in theme set, i.e. TS1={ TS1, k };Finally, using formula (4) to variable m1,k、n1,k、doc1,k WithIt modifies.
It is further limited as of the invention, step 2 specifically includes:
Step 2.1 assumes document dmOriginal corresponding theme label is k, first formula (5) is used, to variable m1,k、n1,k、 doc1,kWithIt modifies;If after modification, doc1,kBecome empty set, then from theme set TS1Middle removal theme label k, and move Except variable m relevant to theme k1,k、n1,k、doc1,kWith
Step 2.2 recalculates document d using formula (2) and formula (3)mBelong to the probability of existing theme and new theme, and For document dmSample a new theme label k;If k=K+1, new theme k and the variable m for indicating theme k1,k、n1,k、 doc1,kWithmt,k、nt,kWithInitial value be all 0, doc1,kIt is initially empty set, and k is added in theme set, i.e., TS1={ TS1, k };Then, using formula (4) to variable m1,k、n1,k、doc1,kWithIt modifies.
It is further limited as of the invention, step 3 specifically includes:
Step 3.1, theme distribution θ1={ θ1,11,2,…,θ1,k,…,θ1,KIn θ1,kCalculation formula it is as follows:
The distribution of word in step 3.2, theme kInCalculation formula it is as follows:
It is further limited as of the invention, step 4 specifically includes:
Step 4.1, to DtIt carries out in initialization procedure, document dmSelect theme k (1≤k≤K) as it from K theme Theme label probability are as follows:
In formula (8), mt-1,kIndicate a time point Dt-1In belong to the number of documents of theme k,It was the upper time Point Dt-1Middle word w goes out occurrence number in total in theme k;
In step 4.2, initialization procedure, document dmSelect a new theme label K+1 as its theme label Probability are as follows:
Step 4.3, according to K+1 probability value above, sampling document dmTheme label k;If TStNot comprising theme k, It is added in theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable mt,k、nt,k、doct,kWithIt is repaired Change.
It is further limited as of the invention, step 5 specifically includes:
Step 5.1 assumes document dmOriginal corresponding theme label is k, first using formula (11) to variable mt,k、nt,k、 doct,kWithIt modifies;If after modification, doct,kBecome empty set, then from theme set TStMiddle removal theme label k, and move Except variable m relevant to theme kt,k、nt,k、doct,kWith
Step 5.2 calculates document d using formula (8) and formula (9)mBelong to TStIn theme and new theme probability;
Step 5.3, the probability value calculated according to previous step are document dmSample a new theme label k;If it is choosing New theme, i.e. k=K+1 are selected, and k is added in theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable mt,k、nt,k、doct,kWithIt modifies.
It is further limited as of the invention, step 6 specifically includes:
Step 6.1, theme distribution θt={ θt,1t,2,…,θt,k,…,θt,KIn θt,kCalculation formula it is as follows:
The distribution of word in step 6.2, theme kInCalculation formula it is as follows:
Compared with the prior art, the invention has the advantages that:
1, the present invention can learn theme by Di Li Cray process oneself when learning the implicit theme of each time Number does not need the number of specified implicit theme manually, overcomes many traditional theme models and is difficult to determine that number of topics purpose is asked Topic, greatly strengthens the applicability and convenience of model;
2, the present invention, can be in conjunction with a upper time point when learning the implicit theme of the short collection of document of current point in time The implicit theme distribution of implicit theme distribution and current point in time overcomes some conventional models only to consider the implicit of current point in time Theme distribution, because a upper time point is important theme, current point in time is probably also important theme, Neng Gouyou The performance of the raising algorithm of effect;
3, the present invention only considered time point when learning the implicit theme of the short collection of document of current point in time Implicit theme distribution, has ignored the implicit theme distribution at time point earlier, substantially increases algorithm in processing mass text stream Problem, to overcome the problem of conventional model is by whole collection of document poor in timeliness;
4, the present invention can cross the new theme of automatic identification, and can be automatically deleted some out-of-date themes, not only increase The problem of the problem of algorithm process theme shifts also solves the continuous arrival of data at any time, and theme number infinitely makes growth, greatly The practicability of algorithm is improved greatly.
Specific embodiment
The present invention will be further described combined with specific embodiments below.
In the present embodiment, a kind of Dynamic Theme discovery algorithm of short text stream is to carry out as follows:
Assuming that the collection of document D={ D of short text stream1,D2,…,Dt-1,Dt..., wherein DtIndicate what the t time reached Text collection;The collection of document of each time point goes to learn corresponding theme label using Di Li Cray process mixed model;It adopts With the generation process of folding rod (Stick-Breaking) Construction of A Model Di Li Cray process mixed model;For document d, produce Raw process is as follows:
θ | α~GEM (α)
φk| β~Dir (β) k=1 ..., ∞
zd| θ~Mult (θ) d=1 ..., ∞
Here, GEM is Griffiths-Engen-McCloskey distribution, and Dir is the distribution of Di Li Cray, and Mult is multinomial Formula distribution, α and β are hyper parameters, and θ indicates the theme distribution of document, φkIndicate the word distribution in theme k, zdIndicate that document is corresponding Theme label.It is assumed that each document is pertaining only to a theme;The hypothesis meets short collection of document, because each short Document all only has very limited information.Pass through theme label zd, the probability of document d is generated, is calculated as follows;
Present aspect is learning implicit variable θ, φ and zd, θ and φ are being integrated out, the Ji Busen sampling reasoning z to collapse is passed throughd。 Finally, further according to zdCalculate θ and φ.
DtIn each theme k, show m with following argument tablet,k、nt,k、doct,kWithWherein mt,kIndicate DtIn belong to master Inscribe the number of documents of k, nt,kIt is the total number for belonging to all words in theme k under the t time, doct,kIndicate the theme k under the t time In collection of document andIt is the total degree that word w belongs to theme k under the t time;DtIn theme collection share TStIt indicates, indicates DtIn include theme label, initial content is empty set.
Step 1, the short collection of document at the 1st time point (t=1) areContain B1? Short essay shelves;Newly-built D1In theme set TS1;For D1, by initial method, successively learn every document dm(1≤m≤B1) Theme;The formula to require emphasis, the algorithm do not need the number of designated key in advance;
Step 1.1 enables first document d1Theme label k be equal to 0;New theme k=0 and the change for indicating theme k=0 Measure m1,0、n1,0、doc1,0WithUsing formula (1), the corresponding variable m of theme k=0 is modified1,0、n1,0、doc1,0WithAnd handle Theme 0 is added to theme TS1In, i.e. TS1={ 0 };
In formula (1),It is document d1The number of middle word,It is document d1The appearance number of middle word w;
Step 1.2 is directed to D1, successively learn document dm(2≤m≤B1) corresponding theme label;Assuming that current theme set TS1In theme number be K;In initialization procedure, document dmSelect theme k (1≤k≤K) as its master from K theme Inscribe the probability of label are as follows:
In formula (2),Indicate document dmTheme label, α and β are the parameters of user setting,It is document dmMiddle word Number,It is document dmThe appearance number of middle word w, V are the sizes of vocabulary;In this example, α and β both are set to 0.1;Formula (2) first half means that each document tends to the theme that one more document of selection belong to, and latter half means Tend to the theme that more similarity relations are shared in selection one with the document;Due to the text collection of first time point, before be not have There is the set of other times, can only be learnt according to the problems in this set;
In step 1.3, initialization procedure, document dmSelect a new theme label K+1 as its theme label Probability are as follows:
Step 1.4, according to K+1 probability value above, sampling document dmTheme label k;If k=K+1, new theme The k and variable m for indicating theme k1,k、n1,k、doc1,kWithWherein m1,k、n1,kWithInitial value be all 0, doc1,kInitially For empty set, and k is added in theme set, i.e. TS1={ TS1, k };Finally, using formula (4) to variable m1,k、n1,k、doc1,k WithIt modifies;
Step 2, the collection of document D at the 1st time point of iterative learning (t=1)1In every document implicit theme, wherein changing Generation number i is from 1 to I;I is the parameter of user setting;In this example, I is arranged to 100;During i-th iteration, weigh respectively New study D1In every document dmTheme label, recalculated according to following sub-step:
Step 2.1 assumes document dmOriginal corresponding theme label is k, first formula (5) is used, to variable m1,k、n1,k、 doc1,kWithIt modifies;If after modification, doc1,kBecome empty set, then from theme set TS1Middle removal theme label k, and move Except variable m relevant to theme k1,k、n1,k、doc1,kWith
Step 2.2 recalculates document d using formula (2) and formula (3)mBelong to the probability of existing theme and new theme, and For document dmSample a new theme label k;If k=K+1, new theme k and the variable m for indicating theme k1,k、n1,k、 doc1,kWithmt,k、nt,kWithInitial value be all 0,
doc1,kIt is initially empty set, and k is added in theme set, i.e. TS1={ TS1, k };Then, right using formula (4) Variable m1,k、n1,k、doc1,kWithIt modifies.
Step 3 assumes TS1There are K theme, i.e. TS1Theme number, under the 1st time point of reasoning (t=1), D1In Theme distribution θ1With the distribution Θ of word in each theme1={ φ1,11,2,…,φ1,k,…,φ1,K}:
Step 3.1, theme distribution θ1={ θ1,11,2,…,θ1,k,…,θ1,KIn θ1,kCalculation formula it is as follows:
The distribution of word in step 3.2, theme kInCalculation formula it is as follows:
Step 4, the collection of document for handling future time point, i.e. t=t+1;Assuming that Dt-1Theme set TSt-1There is K master Topic;The short collection of document of current point in time t (t >=2) isContain BtPiece short essay shelves;It enables TStEqual to empty set;For Dt, by initial method, successively learn every document dm(1≤m≤Bt) theme;
Step 4.1, to DtIt carries out in initialization procedure, document dmSelect theme k (1≤k≤K) as it from K theme Theme label probability are as follows:
In formula (8), mt-1,kIndicate a time point Dt-1In belong to the number of documents of theme k,It was the upper time Point Dt-1Middle word w goes out occurrence number in total in theme k;The formula different from formula (2), formula (8) also consider in a time point document The distribution of theme, because continuous two time points, many themes are all relevant;
In step 4.2, initialization procedure, document dmSelect a new theme label K+1 as its theme label Probability are as follows:
Formula (9) has determined that the theme of the collection of document of current point in time not only derives from the theme collection at a time point It closes, new theme set can also be learnt, to achieve the purpose that theme migrates;
Step 4.3, according to K+1 probability value above, sampling document dmTheme label k;If TStNot comprising theme k, It is added in theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable mt,k、nt,k、doct,kWithIt is repaired Change.
Step 5, iterative learning collection of document DtIn every document number, wherein the number of iterations i is from 1 to I;I-th changes During generation, D is relearned respectivelytIn every document dmTheme label, follow the steps below and recalculate:
Step 5.1 assumes document dmOriginal corresponding theme label is k, first using formula (11) to variable mt,k、nt,k、 doct,kWithIt modifies;If after modification, doct,kBecome empty set, then from theme set TStMiddle removal theme label k, and move Except variable m relevant to theme kt,k、nt,k、doct,kWithThis step determines that some themes not updated are to be deleted, from And the number of theme will not expand without limitation;
Step 5.2 calculates document d using formula (8) and formula (9)mBelong to TStIn theme and new theme probability;
Step 5.3, the probability value calculated according to previous step are document dmSample a new theme label k;If it is choosing New theme, i.e. k=K+1 are selected, and k is added in theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable mt,k、nt,k、doct,kWithIt modifies.
Step 6 calculates DtMiddle theme distribution θtWith the distribution Θ of word in each themet={ φt,1t,2,…, φT, k,…,φt,K, wherein theme distribution θtUsing the distribution phi of word in formula (12) and theme kt,kUsing formula (13);
Step 6.1, theme distribution θt={ θt,1t,2,…,θt,k,…,θt,KIn θt,kCalculation formula it is as follows:
The distribution of word in step 6.2, theme kInCalculation formula it is as follows:
Step 7 deletes Dt-1In corresponding variable m under corresponding theme and each themet,k、nt,k、doct,kWithValue; This step deleted the related subject and variable at a upper time point, can not only reduce the consumption of memory, can also delete some mistakes When theme.
Step 8, to the collection of document under subsequent point in time, successively using step 4,5,6 and 7 learnt.
The present invention is not limited to the above embodiments, on the basis of technical solution disclosed by the invention, the skill of this field For art personnel according to disclosed technology contents, one can be made to some of which technical characteristic by not needing creative labor A little replacements and deformation, these replacements and deformation are within the scope of the invention.

Claims (7)

1. a kind of Dynamic Theme of short text stream finds algorithm, which is characterized in that assuming that the collection of document D={ D of short text stream1, D2,…,Dt-1,Dt..., wherein DtIndicate the text collection that the t time reaches, DtIn each theme k, with following argument table Show mt,k、nt,k、doct,kWithWherein mt,kIndicate DtIn belong to the number of documents of theme k, nt,kIt is to belong to master under the t time Inscribe the total number of all words in k, doct,kIndicate the collection of document under the t time in theme k,It is that word w belongs under the t time The total degree of theme k;DtIn theme collection share TStIt indicates, indicates DtIn include theme label, initial content is empty set;Institute Discovery algorithm is stated to carry out as follows:
Step 1, the short collection of document at the 1st time point (t=1) areContain B1Piece short essay Shelves;Newly-built D1In theme set TS1;For D1, by initial method, successively learn every document dm(1≤m≤B1) master Topic;
Step 2, the collection of document D at the 1st time point of iterative learning (t=1)1In every document implicit theme, wherein the number of iterations I is from 1 to I;I is the parameter of user setting;During i-th iteration, D is relearned respectively1In every document dmTheme mark Label, are recalculated according to following sub-step:
Step 3 assumes TS1There is K theme, under the 1st time point of reasoning (t=1), D1In theme distribution θ1In each theme The distribution Θ of word1={ φ1,11,2,…,φ1,k,…,φ1,K};
Step 4, the collection of document for handling future time point, i.e. t=t+1;Assuming that Dt-1Theme set TSt-1There is K theme;When The short collection of document of preceding time point t (t >=2) isContain BtPiece short essay shelves;Enable TStIt is equal to Empty set;For Dt, by initial method, successively learn every document dm(1≤m≤Bt) theme;
Step 5, iterative learning collection of document DtIn every document number, wherein the number of iterations i is from 1 to I;I-th iteration process In, D is relearned respectivelytIn every document dmTheme label, follow the steps below and recalculate;
Step 6 calculates DtMiddle theme distribution θtWith the distribution Θ of word in each themet={ φt,1t,2,…,φt,k,…, φt,K, wherein theme distribution θtUsing the distribution phi of word in formula (12) and theme kt,kUsing formula (13);
Step 7 deletes Dt-1In corresponding variable m under corresponding theme and each themet,k、nt,k、doct,kWithValue;
Step 8, to the collection of document under subsequent point in time, successively using step 4,5,6 and 7 learnt.
2. a kind of Dynamic Theme of short text stream according to claim 1 finds algorithm, which is characterized in that the tool of step 1 Body includes:
Step 1.1 enables first document d1Theme label k be equal to 0;New theme k=0 and the variable for indicating theme k=0 m1,0、n1,0、doc1,0WithUsing formula (1), the corresponding variable m of theme k=0 is modified1,0、n1,0、doc1,0WithAnd main Topic 0 is added to theme TS1In, i.e. TS1={ 0 };
In formula (1),It is document d1The number of middle word,It is document d1The appearance number of middle word w;
Step 1.2 is directed to D1, successively learn document dm(2≤m≤B1) corresponding theme label;Assuming that current theme set TS1In Theme number be K;In initialization procedure, document dmSelect theme k (1≤k≤K) as its theme mark from K theme The probability of label are as follows:
In formula (2),Indicate document dmTheme label, α and β are the parameters of user setting,It is document dmThe number of middle word,It is document dmThe appearance number of middle word w, V are the sizes of vocabulary;
In step 1.3, initialization procedure, document dmSelect the probability of theme label of the new theme label K+1 as it Are as follows:
Step 1.4, according to K+1 probability value above, sampling document dmTheme label k;If k=K+1, new theme k and table Show the variable m of theme k1,k、n1,k、doc1,kWithWherein m1,k、n1,kWithInitial value be all 0, doc1,kIt is initially empty Collection, and k is added in theme set, i.e. TS1={ TS1, k };Finally, using formula (4) to variable m1,k、n1,k、doc1,kWith It modifies.
3. a kind of Dynamic Theme of short text stream according to claim 2 finds algorithm, which is characterized in that step 2 is specific Include:
Step 2.1 assumes document dmOriginal corresponding theme label is k, first formula (5) is used, to variable m1,k、n1,k、doc1,kWithIt modifies;If after modification, doc1,kBecome empty set, then from theme set TS1Middle removal theme label k, and remove and master Inscribe the relevant variable m of k1,k、n1,k、doc1,kWith
Step 2.2 recalculates document d using formula (2) and formula (3)mBelong to the probability of existing theme and new theme, and is text Shelves dmSample a new theme label k;If k=K+1, new theme k and the variable m for indicating theme k1,k、n1,k、doc1,k Withmt,k、nt,kWithInitial value be all 0, doc1,kIt is initially empty set, and k is added in theme set, i.e. TS1= {TS1, k };Then, using formula (4) to variable m1,k、n1,k、doc1,kWithIt modifies.
4. a kind of Dynamic Theme of short text stream according to claim 3 finds algorithm, which is characterized in that step 3 is specific Include:
Step 3.1, theme distribution θ1={ θ1,11,2,…,θ1,k,…,θ1,KIn θ1,kCalculation formula it is as follows:
The distribution of word in step 3.2, theme kInCalculation formula it is as follows:
5. a kind of Dynamic Theme of short text stream according to claim 4 finds algorithm, which is characterized in that step 4 is specific Include:
Step 4.1, to DtIt carries out in initialization procedure, document dmSelect theme k (1≤k≤K) as its master from K theme Inscribe the probability of label are as follows:
In formula (8), mt-1,kIndicate a time point Dt-1In belong to the number of documents of theme k,It was a upper time point Dt-1Middle word w goes out occurrence number in total in theme k;
In step 4.2, initialization procedure, document dmSelect the probability of theme label of the new theme label K+1 as it Are as follows:
Step 4.3, according to K+1 probability value above, sampling document dmTheme label k;If TStNot comprising theme k, it is added Into theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable mt,k、nt,k、doct,kWithIt modifies.
6. a kind of Dynamic Theme of short text stream according to claim 5 finds algorithm, which is characterized in that step 5 is specific Include:
Step 5.1 assumes document dmOriginal corresponding theme label is k, first using formula (11) to variable mt,k、nt,k、doct,kWithIt modifies;If after modification, doct,kBecome empty set, then from theme set TStMiddle removal theme label k, and remove and master Inscribe the relevant variable m of kt,k、nt,k、doct,kWith
Step 5.2 calculates document d using formula (8) and formula (9)mBelong to TStIn theme and new theme probability;
Step 5.3, the probability value calculated according to previous step are document dmSample a new theme label k;It is new if it is selection Theme, i.e. k=K+1, and k being added in theme set, i.e. TSt={ TSt, k };Then, using formula (10) to variable mt,k、 nt,k、doct,kWithIt modifies.
7. a kind of Dynamic Theme of short text stream according to claim 5 finds algorithm, which is characterized in that step 6 is specific Include:
Step 6.1, theme distribution θt={ θt,1t,2,…,θt,k,…,θt,KIn θt,kCalculation formula it is as follows:
The distribution of word in step 6.2, theme kInCalculation formula it is as follows:
CN201910354228.3A 2019-04-29 2019-04-29 Dynamic theme discovery method for short text stream Active CN110096704B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910354228.3A CN110096704B (en) 2019-04-29 2019-04-29 Dynamic theme discovery method for short text stream

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910354228.3A CN110096704B (en) 2019-04-29 2019-04-29 Dynamic theme discovery method for short text stream

Publications (2)

Publication Number Publication Date
CN110096704A true CN110096704A (en) 2019-08-06
CN110096704B CN110096704B (en) 2023-05-05

Family

ID=67446310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910354228.3A Active CN110096704B (en) 2019-04-29 2019-04-29 Dynamic theme discovery method for short text stream

Country Status (1)

Country Link
CN (1) CN110096704B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507713A (en) * 2020-12-15 2021-03-16 北京京航计算通讯研究所 Text aggregation system based on dynamic self-aggregation topic model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140129510A1 (en) * 2011-07-13 2014-05-08 Huawei Technologies Co., Ltd. Parameter Inference Method, Calculation Apparatus, and System Based on Latent Dirichlet Allocation Model
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
CN107798043A (en) * 2017-06-28 2018-03-13 贵州大学 The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays
CN107992549A (en) * 2017-11-28 2018-05-04 南京信息工程大学 Dynamic short text stream Clustering Retrieval method
CN109063030A (en) * 2018-07-16 2018-12-21 南京信息工程大学 A method of theme and descriptor are implied based on streaming LDA topic model discovery document

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140129510A1 (en) * 2011-07-13 2014-05-08 Huawei Technologies Co., Ltd. Parameter Inference Method, Calculation Apparatus, and System Based on Latent Dirichlet Allocation Model
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN107798043A (en) * 2017-06-28 2018-03-13 贵州大学 The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays
CN107992549A (en) * 2017-11-28 2018-05-04 南京信息工程大学 Dynamic short text stream Clustering Retrieval method
CN109063030A (en) * 2018-07-16 2018-12-21 南京信息工程大学 A method of theme and descriptor are implied based on streaming LDA topic model discovery document

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507713A (en) * 2020-12-15 2021-03-16 北京京航计算通讯研究所 Text aggregation system based on dynamic self-aggregation topic model

Also Published As

Publication number Publication date
CN110096704B (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN105653706B (en) A kind of multilayer quotation based on literature content knowledge mapping recommends method
TWI653542B (en) Method, system and device for discovering and tracking hot topics based on network media data flow
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN106599054B (en) Method and system for classifying and pushing questions
CN109902159A (en) A kind of intelligent O&M statement similarity matching process based on natural language processing
EP2833271A1 (en) Multimedia question and answer system and method
CN106991127B (en) Knowledge subject short text hierarchical classification method based on topological feature expansion
CN111241294A (en) Graph convolution network relation extraction method based on dependency analysis and key words
CN107273348B (en) Topic and emotion combined detection method and device for text
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN109213925B (en) Legal text searching method
CN111143672B (en) Knowledge graph-based professional speciality scholars recommendation method
CN108614897B (en) Content diversification searching method for natural language
CN108804701A (en) Personage's portrait model building method based on social networks big data
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN107562772A (en) Event extraction method, apparatus, system and storage medium
CN107329954B (en) Topic detection method based on document content and mutual relation
CN108090077A (en) A kind of comprehensive similarity computational methods based on natural language searching
CN103559193A (en) Topic modeling method based on selected cell
CN107247739A (en) A kind of financial publication text knowledge extracting method based on factor graph
CN100543735C (en) File similarity measure method based on file structure
CN106874397B (en) Automatic semantic annotation method for Internet of things equipment
CN110717042A (en) Method for constructing document-keyword heterogeneous network model
CN113282689A (en) Retrieval method and device based on domain knowledge graph and search engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant