CN112446220A - Short text aggregation method based on dynamic semantic modeling - Google Patents
Short text aggregation method based on dynamic semantic modeling Download PDFInfo
- Publication number
- CN112446220A CN112446220A CN202011479885.XA CN202011479885A CN112446220A CN 112446220 A CN112446220 A CN 112446220A CN 202011479885 A CN202011479885 A CN 202011479885A CN 112446220 A CN112446220 A CN 112446220A
- Authority
- CN
- China
- Prior art keywords
- distribution
- topic
- word
- short text
- aggregation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004220 aggregation Methods 0.000 title claims abstract description 64
- 230000002776 aggregation Effects 0.000 title claims abstract description 36
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000009826 distribution Methods 0.000 claims abstract description 145
- 238000005070 sampling Methods 0.000 claims abstract description 60
- 230000004931 aggregating effect Effects 0.000 claims abstract description 9
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 230000002688 persistence Effects 0.000 claims description 35
- 238000007476 Maximum Likelihood Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 3
- 238000009795 derivation Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 5
- 238000012805 post-processing Methods 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000013459 approach Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The invention relates to a short text aggregation method based on dynamic semantic modeling, which comprises the following steps: acquiring short text data to be aggregated on a time slice with a set interval, and performing data preprocessing to form a data set; capturing a plurality of distributions of topics and a plurality of distributions of words in the data set by establishing a dynamic self-aggregation topic model on each time slice; adopting Gibbs sampling to deduce a plurality of distributions in the dynamic self-aggregation topic model, and finally counting topic distribution and word distribution on each time slice when sampling is converged; and calculating the probability of short text aggregation related to the theme according to the theme distribution and the word distribution on each time slice, and adaptively aggregating the short texts. The method automatically aggregates the short text into the standard long document, so that more consistent subjects can be captured, the problem of sparsity of the short text is solved, heuristic pre-processing or post-processing technology is not needed, the model is simple, and the processing efficiency is high.
Description
Technical Field
The invention belongs to the technical field of short text aggregation, and particularly relates to a short text aggregation method based on dynamic semantic modeling.
Background
The short text semantic modeling is to perform certain operation and processing on massive short text data, can realize automatic modeling of short text topics without some additional preprocessing and post-processing operations, and can accurately infer topics possibly contained in short texts. Dynamic topic modeling, in turn, requires the method itself to support streaming data and be able to model temporal attributes in topics and dynamically infer current topics from previous topics and new data currently being acquired. Particularly, under the current social media environment, a large amount of short text data with time attributes are generated every day, so that a method capable of supporting dynamic topic modeling is needed, and meanwhile, the method is required to be capable of not only processing standard long text data, but also overcoming the problem of data sparsity existing in short text topic modeling. Conventional topic model approaches are designed to model long text data statically, and are less effective in short text topic modeling and in modeling the dynamics of the topic.
Disclosure of Invention
In view of the above analysis, the present invention aims to disclose a method for short text aggregation based on dynamic semantic modeling, and solve the problem of short text aggregation.
The invention discloses a short text aggregation method based on dynamic semantic modeling, which comprises the following steps:
acquiring short text data to be aggregated on a time slice with a set interval, and performing data preprocessing to form a data set;
capturing, on each timeslice, a plurality of distributions θ of topics in a dataset by building a dynamic self-aggregating topic modelt,kMultiple distribution of sum words phit,v(ii) a t is time slice, K is 1,2, …, K; k is the number of topics in the data set; v ═ 1,2, …, V; v is the number of words in the data set;
in the establishing of the dynamic self-aggregation topic model, the multi-term distribution theta of the time slice topict,kMultiple distribution of sum words phit,vMultiple distribution theta depending on previous time slice subjectt-1,kMultiple distribution of sum words phit-1,v;
Adopting Gibbs sampling to deduce a plurality of distributions in the dynamic self-aggregation topic model, and finally counting topic distribution and word distribution on each time slice when sampling is converged;
and calculating the probability of short text aggregation related to the theme according to the theme distribution and the word distribution on each time slice, and adaptively aggregating the short texts.
Further, in the dynamic self-aggregation topic model, a dirichlet priory is adopted to construct the persistence accuracy of the topicPersistence accuracy of sum words
Further, the topic distribution is inferred using a Gibbs sampling algorithmObtaining a persistence accuracy a of the topic when the sampling convergest(ii) a Wherein, thetat-1,kDistributing theta for the current topict,kThe previous topic distribution relied upon;
word distribution phi over previous time slice t-1t-1Deducing the word distribution phi of the current time slice tt(ii) a Inferring topic distributions using gibbs sampling algorithmObtaining the persistence accuracy beta of the word when the sampling convergestWherein phi ist-1,vDistribute phi for the current wordt,vThe distribution of the previous words relied upon.
Further, a short text data set in the data set is represented by { R, S }, wherein R represents unordered short text; s is the distribution of the aggregated documents; the aggregated document set is D;
at the current time slice t, the dynamic self-aggregation topic model modeling process is as follows:
1) for each topic K1.. K in the text data set, the learned persistence accuracy β of the word is usedtAnd the word distribution phi of the previous time slice t-1t-1Sampling the t word distribution phi of the current time slicet,k~Dirichlet(βtφt-1,k) In which the words are distributed phitIs a polynomial distribution, Dirichlet stands for Dirichlet distribution;
2) for each aggregate document D, D ∈ D in the text dataset, the persistency accuracy α of the learned topic is exploitedtAnd the previous topic distribution thetat-1Sample topic distribution θt,d~Dirichlet(αtθt-1,d) Wherein the subject distribution is thetatIs a multi-term distribution;
3) generating topics using co-occurrence of word pairs, extracting word pairs B independently from short text RwTopic assignment k based on multinomial distribution sampling of short text RRn~Multinomial(θs,d) Sampling word pairs B based on a multi-term distributionwDistribution w ofi,wj~Multinomial(φz,d) (ii) a z is the word wiOr wjThe allocation of (2); thetas,dAssigning and aggregating topic distributions of document d based on short text; phi is az,dA topic distribution for document d based on word assignment and aggregation.
Further, a formal representation of the word pairs is extractedwi,wj,wlRepresenting three consecutive words in the short text;
formalized representation of the construction of word pairs Bw={(wi,wj)|wi,wj∈d,i≠j}。
Further, the derivation of the polynomial distribution in the dynamic self-aggregation topic model using gibbs sampling comprises:
at an initial time, the value θ is first assigned0,k=1/K,φ0,V=1/V;
Obtaining conditional probability distribution through Gibbs sampling and iterative sampling of distribution S and subject k of the aggregated document;
computing new topic persistence accuracy in conjunction with maximum likelihood estimationWord-sum persistence accuracy
According to new theme persistence accuracyWord-sum persistence accuracyAnd counting the distribution of the subjects and the distribution of the words on the time slice t.
Further, the iterative sampling aggregates the assignment S of the short text and the assignment of the topic k:
according to gibbs sampling and chain rules, conditional distribution is adopted:
wherein k represents a topic, NRRepresenting word pairs B in short text RwTotal number of (2), NR,kIn the short text R, the number of word pairs assigned to the topic k, Nt,d,kRepresentation in aggregated documentsd term pair BwThe number of the subjects k to be assigned,representing word pair B in aggregate document dwTotal number of (2), NRRepresenting the total number of word pairs, N, in the short text RdTo representRepresenting word pair B in aggregate document dwThe total amount of the (c),a count representing the removal of the short text R; n represents the current count;
according to gibbs sampling and chain rules, conditional distribution is adopted:
assignment k of sampling subject kdm(ii) a Wherein dm represents that the mth word of the aggregation document d is located as a position coordinate point;
wi、wjis the ith, jth word, N in the aggregate document dkThe representation represents the total number of word pairs, N, in the topic kd,kIndicating that, in the aggregate document d, the number of word pairs assigned to the topic k,indicates that k is not included at coordinate point dmdmCounting of (2);indicates that k is not includeddmWord wiThe total number assigned to the subject k,indicates that k is not includeddmWord wjA total number assigned to subject k; v denotes the total number of all words.
Further, the maximum likelihood estimation is combined to calculate the newPersistence accuracy of the subject matter ofPersistence accuracy of sum words
wherein N isk,vIndicating the number of word pairs assigned to word v in topic k;representing a digamma function.
Further, a Gibbs sampling is adopted to deduce a plurality of distributions in the dynamic self-aggregation topic model, and topic distribution on a time slice t is obtained
Further, the probability of the aggregated document d associated with topic k is the probability of short text being assigned to topic k divided by the probability of the aggregated document being assigned to topic k.
The invention can realize at least one of the following beneficial effects:
1) the dynamic topic modeling method models topics by automatically aggregating short texts into standard long documents, so that more consistent topics can be captured, the problem of sparsity of the short texts is solved, heuristic pre-processing or post-processing technologies are not needed, the model is simple, and the processing efficiency is high.
2) The novel Gibbs sampling algorithm can rapidly and effectively derive unknown variables, and calculates topic distribution and word distribution through sampling results, so as to realize topic modeling.
3) The method can effectively dynamically model the theme, the current theme is deduced by using the previously captured theme and newly arrived data as prior knowledge, further dynamic modeling of the theme is realized, and the problem that the traditional theme modeling can only be deduced based on static data is effectively solved.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
Fig. 1 is a flowchart of a short text aggregation method based on dynamic semantic modeling in this embodiment;
fig. 2 is a schematic representation diagram of the dynamic self-aggregation topic model in this embodiment.
Detailed Description
The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.
Typically, the topic of the social network is dynamically changed in different time slices, the formalization of the time slices can be expressed as {.., T-2, T-1, T, }, and the interval of the time slices can be set to one day, one week, one month, or the like. In order to adaptively aggregate short texts into long documents, the embodiment discloses a short text aggregation method based on dynamic semantic modeling, which includes the following steps as shown in fig. 1:
step S1, short text data are obtained from the social network on a time slice with a set interval, and data preprocessing is carried out to form a data set;
step S2, at each time slice, through buildingVertical dynamic self-aggregation topic model captures multiple distributions theta of topics in a datasett,kMultiple distribution of sum words phit,v(ii) a t is time slice, K is 1,2, …, K; k is the number of topics in the data set; v ═ 1,2, …, V; v is the number of words in the data set;
in the establishing of the dynamic self-aggregation topic model, the multi-term distribution theta of the time slice topict,kMultiple distribution of sum words phit,vMultiple distribution theta depending on previous time slice subjectt-1,kMultiple distribution of sum words phit-1,v;
Step S3, deducing a plurality of distribution in the dynamic self-aggregation topic model by adopting Gibbs sampling, and counting topic distribution and word distribution on each time slice when the sampling is converged;
step S4, according to the topic distribution and the word distribution on each time slice, calculating the probability of short text aggregation related to the topic, and adaptively aggregating the short text.
Specifically, in step S1, based on the short text data crawled from the social network, preprocessing is performed to delete repeated texts and short texts with less than 3 words; performing text word segmentation, stop word deletion and deletion of words with occurrence frequency less than 8 times; a data set for aggregation is obtained.
In particular, in step S2, capturing a plurality of distributions theta of topics by establishing a dynamic self-aggregation topic model in each time slicet,kMultiple distribution of sum words phit,vThe method comprises the following substeps:
1) extracting word pairs;
embodiments of the present invention utilize co-occurrence of word pairs to generate topics, rather than word co-occurrence of traditional approaches. Wherein each word pair contains two unordered words and the word pairs are extracted independently from the same topic, the formalized representation of the construction of the word pairs is Bw={(wi,wj)|wi,wjE, d, i is not equal to j }; the formalized representation of the extracted word pairs isWherein, BwThe pair of words is represented by a word pair,wi,wj,wlrepresenting three consecutive words in a short text.
2) In the dynamic self-aggregation topic model, constructing the persistence accuracy alpha of the topictAnd the persistence accuracy of the word;
because in the establishing dynamic self-aggregation theme model, the multi-term distribution theta of the time slice themet,kMultiple distribution of sum words phit,vMultiple distribution theta depending on previous time slice subjectt-1,kMultiple distribution of sum words phit-1,v(ii) a Therefore, the persistence accuracy of the topic is constructed by Dirichlet priorsPersistence accuracy of sum words
In particular, the subject persistence accuracy αtRepresents the persistence of the topic, i.e., the significance of the topic k of the current time slice t compared to the topic of the previous time slice t-1; current topic distribution θt,kDependent on the previous theme distribution thetat-1,k(ii) a Inferring topic distributions using gibbs sampling algorithmSubject persistence accuracy alpha is obtained by sampling convergencet,θt-1,kDistributing theta for the current topict,kThe previous topic distribution relied upon.
Persistence accuracy beta of the wordtRepresents the persistence of the word, i.e., the persistence of the word w assigned to the subject k at the current time slice t compared to the previous time slice t-1; word distribution phi over previous time slice t-1t-1Deducing the word distribution phi of the current time slice tt(ii) a Inferring topic distributions using gibbs sampling algorithmSampling convergence to obtain the persistence accuracy beta of the wordt;φt-1,vDistribute phi for the current wordt,vIs dependent onThe previous word distribution.
3) Establishing a dynamic self-aggregation topic model
In order to realize self-aggregation of short texts, a short text data set in a data set is represented by { R, S }, wherein R represents unordered short texts, and S is the distribution of an aggregation document and is a hidden variable representing the distribution relation between the aggregation document and the short texts; the aggregated document set is D;
at the current time slice t, the dynamic self-aggregation topic model modeling process is as follows:
1) for each topic K1.. K in the text data set, the persistence accuracy β of the learned word is usedtAnd the word distribution phi of the previous time slice t-1t-1,
Sampling the t word distribution phi of the current time slicet,v~Dirichlet(βtφt-1,v) Wherein the words are distributed phitIs a polynomial distribution, Dirichlet stands for Dirichlet distribution;
2) for each aggregate document D, D ∈ D in the text dataset, the persistency accuracy α of the learned topic is exploitedtAnd the previous topic distribution thetat-1,
Sample topic distribution θt,d~Dirichlet(αtθt-1,d) Wherein the subject distribution is thetatIs a multi-term distribution;
3) generating a topic using co-occurrence of word pairs, independently extracting word pair B from short textw,
Topic assignment k based on multi-term distributed sampling short textRn~Multinomial(θs,d);
Sampling word pair B based on multi-term distributionwDistribution w ofi,wj~Multinomial(φz,d) (ii) a z is the word wiOr wjThe allocation of (2);
θs,dassigning and aggregating topic distributions of document d based on short text; phi is az,dA topic distribution for document d based on word assignment and aggregation.
Specifically, the representation of the dynamic self-aggregation topic model in this embodiment is shown in fig. 2.
In step S2, the dynamic self-aggregation topic model includes an implicit variable S, and in order to calculate the implicit variable, this embodiment adopts gibbs sampling to derive a polynomial distribution in the dynamic self-aggregation topic model.
Specifically, in step S3, the derivation of the polynomial distribution in the dynamic self-aggregation topic model using gibbs sampling includes:
1) at an initial time, the value θ is first assigned0,k=1/K,φ0,v=1/V;
2) Obtaining conditional probability distribution through Gibbs sampling and iterative sampling of distribution S and subject k of the aggregated document;
specifically, the assigning S and k of the iterative sampling aggregation short text includes:
(1) according to gibbs sampling and chain rules, conditional distribution is adopted:
distributing S of the sampling short texts;
wherein k represents a topic, NRRepresenting word pairs B in short text RwTotal number of (2), NR,kIn the short text R, the number of word pairs assigned to the topic k, Nt,d,kRepresenting word pair B in aggregate document dwThe number of the subjects k to be assigned,representing word pair B in aggregate document dwTotal number of (2), NRRepresenting the total number of word pairs, N, in the short text RdTo representRepresenting word pair B in aggregate document dwThe total amount of the (c),a count representing the removal of the short text R; n represents the current count;
(2) according to gibbs sampling and chain rules, conditional distribution is adopted:
assignment k of sampling subject kdm(ii) a Wherein dm represents that the mth word of the aggregation document d is located as a position coordinate point;
wi、wjis the ith, jth word, N in the aggregate document dkThe representation represents the total number of word pairs, N, in the topic kd,kIndicating that, in the aggregate document d, the number of word pairs assigned to the topic k,indicates that k is not included at coordinate point dmdmCounting of (2);indicates that k is not includeddmWord wiThe total number assigned to the subject k,indicates that k is not includeddmWord wjA total number assigned to subject k; v denotes the total number of all words.
3) Computing new topic persistence accuracy in conjunction with maximum likelihood estimationWord-sum persistence accuracy
In particular, maximum likelihood estimation is incorporated to calculate persistence accuracy of new topicsPersistence accuracy of sum wordsMaking persistence of a theme accurateDegree and word persistence accuracy is more accurate.
wherein N isk,vIndicating the number of word pairs assigned to word v in topic k;representing a digamma function.
4) According to new theme persistence accuracyWord-sum persistence accuracyAnd counting the distribution of the subjects and the distribution of the words on the time slice t.
Specifically, the topic distribution and the word distribution on the derived time slice t are as follows:
In step S4, the probability of the aggregated document related to the topic is calculated according to the topic distribution and the word distribution of each time slice deduced in step S3, and the short text is adaptively aggregated.
Specifically, the probability of the aggregated document d related to the topic k is the probability of the short text being assigned to the topic k divided by the probability of the aggregated document being assigned to the topic k;
I.e. the probability of the aggregated document d relating to topic k is:
in summary, the short text aggregation method based on dynamic semantic modeling of the embodiment has the following effects:
1) the dynamic topic modeling method models topics by automatically aggregating short texts into standard long documents, so that more consistent topics can be captured, the problem of sparsity of the short texts is solved, heuristic pre-processing or post-processing technologies are not needed, the model is simple, and the processing efficiency is high.
2) The novel Gibbs sampling algorithm can rapidly and effectively derive unknown variables, and calculates topic distribution and word distribution through sampling results, so as to realize topic modeling.
3) The method can effectively dynamically model the theme, the current theme is deduced by using the previously captured theme and newly arrived data as prior knowledge, further dynamic modeling of the theme is realized, and the problem that the traditional theme modeling can only be deduced based on static data is effectively solved.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.
Claims (10)
1. A short text aggregation method based on dynamic semantic modeling is characterized by comprising the following steps:
acquiring short text data to be aggregated on a time slice with a set interval, and performing data preprocessing to form a data set;
capturing, on each timeslice, a plurality of distributions θ of topics in a dataset by building a dynamic self-aggregating topic modelt,kMultiple distribution of sum words phit,v(ii) a t is time slice, K is 1,2, …, K; k is the number of topics in the data set; v ═ 1,2, …, V; v is the number of words in the data set;
in the establishing of the dynamic self-aggregation topic model, the multi-term distribution theta of the time slice topict,kMultiple distribution of sum words phit,vMultiple distribution theta depending on previous time slice subjectt-1,kMultiple distribution of sum words phit-1,v;
Adopting Gibbs sampling to deduce a plurality of distributions in the dynamic self-aggregation topic model, and finally counting topic distribution and word distribution on each time slice when sampling is converged;
and calculating the probability of short text aggregation related to the theme according to the theme distribution and the word distribution on each time slice, and adaptively aggregating the short texts.
3. The short text aggregation method based on dynamic semantic modeling according to claim 2, whichCharacterized in that the distribution of the topics is deduced using the Gibbs sampling algorithmObtaining a persistence accuracy a of the topic when the sampling convergest(ii) a Wherein, thetat-1,kDistributing theta for the current topict,kThe previous topic distribution relied upon;
word distribution phi over previous time slice t-1t-1Deducing the word distribution phi of the current time slice tt(ii) a Inferring topic distributions using gibbs sampling algorithmObtaining the persistence accuracy beta of the word when the sampling convergestWherein phi ist-1,vDistribute phi for the current wordt,vThe distribution of the previous words relied upon.
4. The short text aggregation method based on dynamic semantic modeling according to claim 3,
representing a short text data set in the data set by { R, S }, wherein R represents unordered short text; s is the distribution of the aggregated documents; the aggregated document set is D;
at the current time slice t, the dynamic self-aggregation topic model modeling process is as follows:
1) for each topic K1.. K in the text data set, the learned persistence accuracy β of the word is usedtAnd the word distribution phi of the previous time slice t-1t-1Sampling the t word distribution phi of the current time slicet,k~Dirichlet(βtφt-1,k) In which the words are distributed phitIs a polynomial distribution, Dirichlet stands for Dirichlet distribution;
2) for each aggregate document D, D ∈ D in the text dataset, the persistency accuracy α of the learned topic is exploitedtAnd the previous topic distribution thetat-1Sample topic distribution θt,d~Dirichlet(αtθt-1,d) Wherein the subject distribution is thetatIs a plurality of itemsDistributing;
3) generating topics using co-occurrence of word pairs, extracting word pairs B independently from short text RwTopic assignment k based on multinomial distribution sampling of short text RRn~Multinomial(θs,d) Sampling word pairs B based on a multi-term distributionwDistribution w ofi,wj~Multinomial(φzD); z is the word wiOr wjThe allocation of (2); thetas,dAssigning and aggregating topic distributions of document d based on short text; phi is az,dA topic distribution for document d based on word assignment and aggregation.
6. The short text aggregation method based on semantic dynamic modeling according to claim 5, wherein the derivation process of the polynomial distribution in the dynamic self-aggregation topic model by using Gibbs sampling comprises:
at an initial time, the value θ is first assigned0,k=1/K,φ0,v=1/V;
Obtaining conditional probability distribution through Gibbs sampling and iterative sampling of distribution S and subject k of the aggregated document;
computing new topic persistence accuracy in conjunction with maximum likelihood estimationWord-sum persistence accuracy
7. The short text aggregation method based on semantic dynamic modeling according to claim 6, wherein the iterative sampling aggregates the assignment S of short text and the assignment of topic k:
according to gibbs sampling and chain rules, conditional distribution is adopted:
wherein k represents a topic, NRRepresenting word pairs B in short text RwTotal number of (2), NR,kIn the short text R, the number of word pairs assigned to the topic k, Nt,d,kRepresenting word pair B in aggregate document dwThe number of the subjects k to be assigned,representing word pair B in aggregate document dwTotal number of (2), NRRepresenting the total number of word pairs, N, in the short text RdTo representRepresenting word pair B in aggregate document dwThe total amount of the (c),a count representing the removal of the short text R; n represents the current count;
according to gibbs sampling and chain rules, conditional distribution is adopted:
assignment k of sampling subject kdm(ii) a Wherein dm represents a position coordinate point where the mth word of the aggregated document d is located;
wi、wjis the ith, jth word, N in the aggregate document dkThe representation represents the total number of word pairs, N, in the topic kd,kIndicating that, in the aggregate document d, the number of word pairs assigned to the topic k,indicates that k is not included at coordinate point dmdmCounting of (2);indicates that k is not includeddmWord wiThe total number assigned to the subject k,indicates that k is not includeddmWord wjA total number assigned to subject k; v denotes the total number of all words.
8. The method for short text aggregation based on semantic dynamic modeling according to claim 7, wherein persistence accuracy of new topics is calculated in combination with maximum likelihood estimationPersistence accuracy of sum words
10. The short text aggregation method based on semantic dynamic modeling according to claim 9, wherein the probability of the aggregation document d related to the topic k is the probability of the short text being assigned to the topic k divided by the probability of the aggregation document being assigned to the topic k.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011479885.XA CN112446220A (en) | 2020-12-15 | 2020-12-15 | Short text aggregation method based on dynamic semantic modeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011479885.XA CN112446220A (en) | 2020-12-15 | 2020-12-15 | Short text aggregation method based on dynamic semantic modeling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112446220A true CN112446220A (en) | 2021-03-05 |
Family
ID=74739113
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011479885.XA Pending CN112446220A (en) | 2020-12-15 | 2020-12-15 | Short text aggregation method based on dynamic semantic modeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112446220A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160110343A1 (en) * | 2014-10-21 | 2016-04-21 | At&T Intellectual Property I, L.P. | Unsupervised topic modeling for short texts |
CN107992549A (en) * | 2017-11-28 | 2018-05-04 | 南京信息工程大学 | Dynamic short text stream Clustering Retrieval method |
-
2020
- 2020-12-15 CN CN202011479885.XA patent/CN112446220A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160110343A1 (en) * | 2014-10-21 | 2016-04-21 | At&T Intellectual Property I, L.P. | Unsupervised topic modeling for short texts |
CN107992549A (en) * | 2017-11-28 | 2018-05-04 | 南京信息工程大学 | Dynamic short text stream Clustering Retrieval method |
Non-Patent Citations (4)
Title |
---|
李雷等: "社会网络中基于U_BTM模型的主题挖掘", 《计算机应用研究》 * |
牛亚男: "具有词判别力学习能力的短文本聚类概率模型研究", 《计算机应用研究》 * |
石磊 等: "《Dynamic Topic Modeling via Self-aggregation for Short Text Streams》", 《PEER-TO-PEER NETWORKING AND APPLICATIONS》 * |
石磊等: "基于RNN和主题模型的社交网络突发话题发现", 《通信学报》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI653542B (en) | Method, system and device for discovering and tracking hot topics based on network media data flow | |
CN107391772B (en) | Text classification method based on naive Bayes | |
CN109165294B (en) | Short text classification method based on Bayesian classification | |
CN108897784B (en) | Emergency multidimensional analysis system based on social media | |
Ahmed et al. | Detecting sentiment dynamics and clusters of Twitter users for trending topics in COVID-19 pandemic | |
CN107430625B (en) | Classifying documents by clustering | |
US20180240036A1 (en) | Automatic segmentation of a collection of user profiles | |
CN109271520B (en) | Data extraction method, data extraction device, storage medium, and electronic apparatus | |
Wu et al. | Personalized microblog sentiment classification via multi-task learning | |
Perdana et al. | Combining likes-retweet analysis and naive bayes classifier within twitter for sentiment analysis | |
CN112131322B (en) | Time sequence classification method and device | |
US20180041765A1 (en) | Compact video representation for video event retrieval and recognition | |
Koo et al. | Partglot: Learning shape part segmentation from language reference games | |
CN111177559A (en) | Text travel service recommendation method and device, electronic equipment and storage medium | |
Sree et al. | Data analytics: Why data normalization | |
He et al. | Identifying user behavior on Twitter based on multi-scale entropy | |
WO2018157410A1 (en) | Efficient annotation of large sample group | |
WO2016106944A1 (en) | Method for creating virtual human on mapreduce platform | |
CN110264311B (en) | Business promotion information accurate recommendation method and system based on deep learning | |
CN107506475A (en) | A kind of magnanimity electric power customer service file classification method based on Spark | |
CN112446220A (en) | Short text aggregation method based on dynamic semantic modeling | |
CN112115712A (en) | Topic-based group emotion analysis method | |
CN112507713A (en) | Text aggregation system based on dynamic self-aggregation topic model | |
Assenmacher et al. | Textual one-pass stream clustering with automated distance threshold adaption | |
CN111178038B (en) | Document similarity recognition method and device based on latent semantic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210305 |
|
RJ01 | Rejection of invention patent application after publication |